Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyze their data.
For instance, let’s say we want to find out what the salary differences are between different genders in the dataset.
We can use pandas’ groupby
method to separate the data by gender and then get
some summary statistics for each set.
data.groupby('gender').salary.sum()
gender
None 4550441
female 104980118
male 100594196
Name: salary, dtype: int64
Of course, we need to know how many of each gender there are for the sum to make sense.
data.groupby('gender').salary.count()
gender
None 109
female 2357
male 2534
Name: salary, dtype: int64
We can divide these to find the average, or we could also let pandas calculate the mean for us.
data.groupby('gender').salary.mean()
gender
None 41747.165138
female 44539.719134
male 39697.788477
Name: salary, dtype: float64
Let’s also revisit the date of birth re-coding that we developed earlier. First we can do the same operation that we used before:
dob_datetime = pandas.to_datetime(data.date_of_birth.astype(str), format='%Y%m%d')
But this gives us an error! It looks like some of the data aren’t able to be converted to a datetime.
We can tell the to_datetime
function to turn any data that can’t be converted into
a missing value (indicated by ‘NaT’ meaning ‘Not a Time’) so that we can proceed:
dob_datetime = pandas.to_datetime(data.date_of_birth.astype(str), format='%Y%m%d', errors='coerce')
Ok, that didn’t error, so lets take a look and see how it did.
dob_datetime.head()
rec_id
rec-2778-org NaT
rec-712-dup-0 NaT
rec-1321-org NaT
rec-3004-org NaT
rec-1384-org NaT
Name: date_of_birth, dtype: datetime64[ns]
Uh oh, now all the data are set to ‘NaT’, so something is going wrong even for the first two rows that worked before. Let’s look at the data to see if we can tell what’s going on.
data.date_of_birth.head()
rec_id
rec-2778-org 19300213.0
rec-712-dup-0 19080712.0
rec-1321-org 19940319.0
rec-3004-org 19290427.0
rec-1384-org 19631225.0
Name: date_of_birth, dtype: float64
So for some reason these rows are now showing up as floats rather than ints, so when we convert them to strings they don’t match the expected formatting. Let’s look at more rows to see if we can figure out why the values are floats now.
data.date_of_birth.head(20)
rec_id
rec-2778-org 19300213.0
rec-712-dup-0 19080712.0
rec-1321-org 19940319.0
rec-3004-org 19290427.0
rec-1384-org 19631225.0
rec-3981-org 19421201.0
rec-916-org 19450918.0
rec-1684-org 19950620.0
rec-63-dup-0 19000106.0
rec-3808-org 19150402.0
rec-112-org 19951125.0
rec-3297-org 19910228.0
rec-1315-org 19370312.0
rec-1050-org 19400826.0
rec-2116-org 19471114.0
rec-3232-org 19720809.0
rec-1900-dup-1 NaN
rec-2460-dup-2 19390305.0
rec-3123-org 19580628.0
rec-2166-org 19261017.0
Name: date_of_birth, dtype: float64
Ah ha! There are missing values, marked ‘NaN’ (for ‘Not a Number’). Any time there are NaNs in an integer column, pandas converts that column to floats because it has no representation for missing numbers in the integer data type.
So now we need a way to convert the float values to datetimes and the NaN values to NaTs. Pandas has a method called apply
that will apply a function to each row. So if we can write a function that will give us the right thing for any row, we can pass it to apply and pandas will run it on every row.
Let’s look at the first row:
print(data.date_of_birth.loc['rec-2778-org'])
19300213.0
So we need to convert the value back to an integer, then to a string and then use the to_datetime
method to convert it to a datetime.
print(pandas.to_datetime(str(int(data.date_of_birth.loc['rec-2778-org'])), format='%Y%m%d', errors='coerce'))
Timestamp('1930-02-13 00:00:00')
Ok, that works when we don’t have missing data. Now let’s look at a row with a NaN:
print(data.date_of_birth.loc['rec-1900-dup-1'])
nan
We need our function to be able to identify NaNs. Pandas has functions called isnull
and notnull
that return True or False for NaNs or other null values:
print(pandas.isnull(data.date_of_birth.loc['rec-1900-dup-1']))
True
So we can use isnull
to test for null values and return the right thing in each case:
def dob_to_datetime(dob_float):
if pandas.notnull(dob_float):
return pandas.to_datetime(str(int(dob_float)), format='%Y%m%d', errors='coerce')
else:
return pandas.to_datetime('NaT')
And we can run this function on our two test rows to see how they do:
print(dob_to_datetime(data.date_of_birth.loc['rec-2778-org']))
Timestamp('1930-02-13 00:00:00')
print(dob_to_datetime(data.date_of_birth.loc['rec-1900-dup-1']))
NaT
Looks like it’s working as we hoped, so now let’s run it on all the rows.
dob_datetime = data.date_of_birth.apply(dob_to_datetime)
dob_datetime.head(20)
rec_id
rec-2778-org 1930-02-13
rec-712-dup-0 1908-07-12
rec-1321-org 1994-03-19
rec-3004-org 1929-04-27
rec-1384-org 1963-12-25
rec-3981-org 1942-12-01
rec-916-org 1945-09-18
rec-1684-org 1995-06-20
rec-63-dup-0 1900-01-06
rec-3808-org 1915-04-02
rec-112-org 1995-11-25
rec-3297-org 1991-02-28
rec-1315-org 1937-03-12
rec-1050-org 1940-08-26
rec-2116-org 1947-11-14
rec-3232-org 1972-08-09
rec-1900-dup-1 NaT
rec-2460-dup-2 1939-03-05
rec-3123-org 1958-06-28
rec-2166-org 1926-10-17
Name: date_of_birth, dtype: datetime64[ns]
And we can replace the column in the dataframe with these datetime values:
data.date_of_birth = dob_datetime data.date_of_birth.head()
rec_id rec-2778-org 1930-02-13 rec-712-dup-0 1908-07-12 rec-1321-org 1994-03-19 rec-3004-org 1929-04-27 rec-1384-org 1963-12-25 Name: date_of_birth, dtype: datetime64[ns]