Pandas dataframe
.import pandas as pd
df3 = pandas.read_csv('https://raw.githubusercontent.com/uwescience/ds4ad/master/data/synthetic_data.csv',index_col='rec_id')
print(df3.describe())
Often times you want to look at column distributions and behavior visually. Using pandas.DataFrame.plot()
directly facilitates that and uses matplotlib/pyplot
.
You can access data frame columns directly like before in pandas
notation followed by the .plot
.
You set the kind of plot you want by setting the kind
parameter accordingly.
This gives us a histogram of the column salary
.
df3["salary"].plot(kind='hist')
This gives us a histogram of the column tax_rate
.
df3["tax_rate"].plot(kind='hist')
You can get boxplots by setting the kind
parameter to ‘box’.
df3['salary'].plot(kind = 'box')
Plot the density of the column ‘tax_rate’.
Hint: Look at the pandas.DataFrame.plot documentation for reference.
df3['tax_rate'].plot(kind = 'density')
You can plot multiple columns by specifying them inside .plot
.
df3.plot('tax_rate','salary', kind = 'scatter')
Often times you will summarize your data in a variety of ways and will want to represent that information visually.
state_tax_mean = df3.groupby(['state'])['tax_rate'].mean()
This returns a series, and a series can be directly used with .plot
.
state_tax_mean.plot(kind='bar', color='r')
plt.title('Average Tax Rate by State')
plt.xlabel('State')
plt.ylabel('Average Tax Rate')
Calculate the median salary by state and create a plot with all the appropriate labels.
state_md_salary = df3.groupby(['state'])['salary'].median()
state_md.plot(kind='bar', color='r')
plt.title('Median Salary by State')
plt.xlabel('State')
plt.ylabel('Median Salary ($)')