A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
To access a value at the position [i,j]
of a DataFrame, we have two options, depending on
what is the meaning of i
in use.
Remember that a DataFrame provides a index as a way to identify the rows of the table;
a row, then, has a position inside the table as well as a label, which
uniquely identifies its entry in the DataFrame.
DataFrame.iloc[..., ...]
to select values by their (entry) positionimport pandas
data = pandas.read_csv('https://raw.githubusercontent.com/uwescience/ds4ad/master/data/synthetic_data.csv', index_col='rec_id')
print(data.iloc[0, 0])
sarah
DataFrame.loc[..., ...]
to select values by their (entry) label.print(data.loc["rec-2778-org", "given_name"])
sarah
:
on its own to mean all columns or all rows.print(data.loc["rec-2778-org", :])
given_name sarah
surname bruhn
street_number 44
address_1 forbes street
address_2 wintersloe
suburb kellerberrin
postcode 4510
state vic
date_of_birth 1.93002e+07
soc_sec_id 7535316
gender female
salary 136344
Name: rec-2778-org, dtype: object
data.loc["rec-2778-org"]
(without a second index).print(data.loc[:, "given_name"])
rec_id
rec-2778-org sarah
rec-712-dup-0 jacob
rec-1321-org brinley
rec-3004-org aleisha
rec-1384-org ethan
rec-3981-org alicia
rec-916-org benjamin
rec-1684-org petreece
rec-63-dup-0 olivia
rec-3808-org NaN
rec-112-org joshua
rec-3297-org rachael
rec-1315-org joseph
rec-1050-org sarah
rec-2116-org sidonie
rec-3232-org andrew
rec-1900-dup-1 kiara
rec-2460-dup-2 nicholas
rec-3123-org isabella
rec-2166-org alexandra
rec-1155-org matthew
rec-1485-org michael
rec-2852-org georgia
rec-707-org tia
rec-3499-org brooke
rec-2153-org liam
rec-1665-dup-2 jamesr
rec-35-dup-1 darcy
rec-78-org kody
rec-3116-org james
...
rec-3650-org samuel
rec-1323-org jack
rec-1278-org michael
rec-540-dup-0 anthony
rec-724-org kaitlin
rec-3120-org monique
rec-881-org brooklyn
rec-2303-org polly
rec-1772-org amber
rec-3587-org madelyn
rec-3005-org christopher
rec-2636-org annabel
rec-235-org thomas
rec-2822-dup-0 molly
rec-322-org brooke
rec-1972-org carly
rec-3433-org william
rec-2889-org naomi
rec-1313-org zachary
rec-2206-org john-paul
rec-2433-org mia
rec-1609-org madison
rec-3812-org georgia
rec-303-org clodagh
rec-2284-org sam
rec-1487-org thomas
rec-1856-org james
rec-3307-org paige
rec-227-org antonio
rec-1143-org harry
Name: given_name, Length: 5000, dtype: object
data["given_name"]
data.given_name
(since it’s a column name)DataFrame.loc
and a named slice.print(data.loc['rec-2778-org':'rec-1384-org', 'given_name':'street_number'])
given_name surname street_number
rec_id
rec-2778-org sarah bruhn 44.0
rec-712-dup-0 jacob lanyon 5.0
rec-1321-org brinley efthimiou 35.0
rec-3004-org aleisha hobson 54.0
rec-1384-org ethan gazzola 49.0
Now do something similar with iloc:
print(data.iloc[0:4, 0:2])
given_name surname
rec_id
rec-2778-org sarah bruhn
rec-712-dup-0 jacob lanyon
rec-1321-org brinley efthimiou
rec-3004-org aleisha hobson
In the above code, we discover that slicing using loc
is inclusive at both
ends, which differs from slicing using iloc
, where slicing indicates
everything up to but not including the final index.
You can also pass a list of row or column labels to get the rows and columns you want in the order you want:
print(data.loc['rec-2778-org':'rec-1384-org', ['soc_sec_id', 'surname', 'salary']])
soc_sec_id surname salary
rec_id
rec-2778-org 7535316 bruhn 136344
rec-712-dup-0 9497788 lanyon 59079
rec-1321-org 6814956 efthimiou 39987
rec-3004-org 5967384 hobson 47962
rec-1384-org 3832742 gazzola 39988
Or if we wanted this information on people named Courtney, we could find those indexes and use them for the row selection:
print(data[data.given_name == 'courtney'].index)
Index(['rec-1721-org', 'rec-548-org', 'rec-101-org', 'rec-2312-org',
'rec-1520-org'],
dtype='object', name='rec_id')
print(data.loc[['rec-1721-org', 'rec-548-org', 'rec-101-org', 'rec-2312-org',
'rec-1520-org'], ['soc_sec_id', 'surname', 'salary']])
soc_sec_id surname salary
rec_id
rec-1721-org 1564113 sebastyan 45003
rec-548-org 8735103 le lievre 45414
rec-101-org 3854639 campain 13833
rec-2312-org 8820005 robson 45699
rec-1520-org 8729844 mulquiney 45701
print(data.loc['rec-2778-org':'rec-1384-org', ['given_name', 'surname', 'salary']].max())
given_name sarah
surname lanyon
salary 136344
dtype: object
Note that the max is only performed on the salary because it’s the only numeric column in the slice.
print(data.loc['rec-2778-org':'rec-1384-org', ['given_name', 'surname', 'salary']].min())
given_name aleisha
surname bruhn
salary 39987
dtype: object
True
and False
.# Use a subset of data to keep output readable.
subset = data.loc['rec-2778-org':'rec-1384-org', ['soc_sec_id','salary']]
print('Subset of data:\n', subset)
# Which values were greater than 50000 ?
print('\nWhere are values large?\n', subset > 50000)
Subset of data:
soc_sec_id salary
rec_id
rec-2778-org 7535316 136344
rec-712-dup-0 9497788 59079
rec-1321-org 6814956 39987
rec-3004-org 5967384 47962
rec-1384-org 3832742 39988
Where are values large?
soc_sec_id salary
rec_id
rec-2778-org True True
rec-712-dup-0 True True
rec-1321-org True False
rec-3004-org True False
rec-1384-org True False
mask = subset > 50000
print(subset[mask])
soc_sec_id salary
rec_id
rec-2778-org 7535316 136344.0
rec-712-dup-0 9497788 59079.0
rec-1321-org 6814956 NaN
rec-3004-org 5967384 NaN
rec-1384-org 3832742 NaN
print(subset[subset > 50000].describe())
soc_sec_id salary
count 5.000000e+00 2.000000
mean 6.729637e+06 97711.500000
std 2.079188e+06 54634.605448
min 3.832742e+06 59079.000000
25% 5.967384e+06 78395.250000
50% 6.814956e+06 97711.500000
75% 7.535316e+06 117027.750000
max 9.497788e+06 136344.000000