We decided to design and create a relational database to contain the joined FDA recall and Amazon review data. This way, researchers can continue to work on this topic, beginning with the analytical tasks, rather than spending so much time joining the data on unique identifiers and organizing it into a concise, easy-to-use format.
Some visualizations based on summary statistics queried from the database can be found here.
The Unsafe Foods Database is a relational Postgres database that facilitates the querying and analysis of Amazon review data with respect to historical FDA recall data. We gathered the Amazon Review data from a historical dataset that was published by Julian McAuley at UCSD. This data contains approximately 1.3 million individual reviews that include the reviewer ID, date and time of the review, the text, summary, rating, and product metadata.
In python, you need to download a module to connect with the PostGres database. My preferred module is psycopg2. In order to install this package successfully, you must install its dependencies first using the following commands in bash:
sudo apt-get install gcc
sudo apt-get install python-setuptools
sudo easy_install psycopg2
After installing, you can invoke the module in your Python script and connect to the database using the following command and parameters:
import psycopg2
#Connect to database
conn = psycopg2.connect(database = <Database-Name>, user = <your-user-name>, password = <your-password>, host = <your-host-name>, port = <your port number, usually 5432>)
print("Opened database successfully")
Using the above connection info, this section will connect to the database and execute some queries to illustrate the database’s capabilities.
import psycopg2
#We also want to import pandas and numpy to work with the data we have fetched
import pandas as pd
import numpy as np
#Connect to database; input depends on your settings
conn = psycopg2.connect(database=<db_name>, user=<user_name>, password=<password>, host=<host_name>, port=<port_name>)
print("Opened database successfully")
Opened database successfully
First, let’s go ahead and look at some of the metadata that we are working with. How many products are there per category?
'''
You need to set up a cursor before you start executing queries.
One way to look at it is that your connection, 'conn', is your ticket to the database,
while your cursor, cur, will be your shopping cart.
'''
cur = conn.cursor()
#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumProducts from\
Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
JOIN Product p on ca.product_id = p.product_id\
group by c.category_name order by NumProducts DESC;')
#fetch table from the cursor
category_breakdown = pd.DataFrame(cur.fetchall())
category_breakdown
0 | 1 | |
---|---|---|
0 | Grocery & Gourmet Food | 171760 |
1 | Beverages | 3925 |
2 | Cooking & Baking | 2434 |
3 | Tea | 1791 |
4 | Chocolate | 1043 |
5 | Snack Foods | 983 |
6 | Coffee | 949 |
7 | Candy & Chocolate | 927 |
8 | Tea Samplers | 763 |
9 | Herbal | 583 |
10 | Sugar | 489 |
11 | Single-Serve Cups | 464 |
12 | Baby Foods | 436 |
13 | Breakfast Foods | 418 |
14 | Cereal | 412 |
15 | Nut | 411 |
... | ... | ... |
516 rows × 2 columns
What about the number of reviews per category?
#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumReviews from\
Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
JOIN Review r on ca.product_id = r.product_id\
group by c.category_name order by NumReviews DESC;')
#fetch table from the cursor
category_breakdown_reviews = pd.DataFrame(cur.fetchall())
category_breakdown_reviews
0 | 1 | |
---|---|---|
0 | Grocery & Gourmet Food | 1077410 |
1 | Beverages | 23570 |
2 | Cooking & Baking | 20297 |
3 | Tea | 9737 |
4 | Sugar | 5793 |
5 | Vinegars | 4792 |
6 | Oils | 4792 |
7 | Chocolate | 4482 |
8 | Coffee | 4431 |
9 | Candy & Chocolate | 3981 |
10 | Baby Foods | 3755 |
11 | Herbal | 3609 |
... | ... | ... |
475 rows × 2 columns
How about we start looking at recalled products? Maybe a certain category gets a lot of recalls?
#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumRecalls from\
Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
JOIN recalledproduct rp on ca.product_id = rp.product_id\
where ca.product_id in (select product_id from recalledproduct)\
group by c.category_name order by NumRecalls DESC;')
#fetch table from the cursor
category_breakdown_recalls = pd.DataFrame(cur.fetchall())
category_breakdown_recalls
0 | 1 | |
---|---|---|
0 | Grocery & Gourmet Food | 158 |
1 | Bars | 6 |
2 | Nut | 5 |
3 | Snack Foods | 4 |
4 | Single Herbs & Spices | 3 |
5 | Butter | 3 |
6 | Energy & Nutritional | 3 |
7 | Cereal | 3 |
8 | Candy & Chocolate | 3 |
9 | Breakfast Foods | 3 |
10 | Breakfast & Cereal Bars | 3 |
11 | Chocolate | 3 |
12 | Garlic | 2 |
13 | Peanut | 2 |
14 | Peanut Butter | 2 |
15 | Cooking & Baking | 2 |
16 | Nut Butters | 2 |
17 | Seasoned Coatings | 1 |
18 | Cocoa | 1 |
19 | Dips & Spreads | 1 |
20 | Crackers | 1 |
21 | Beef | 1 |
22 | Leaveners & Yeasts | 1 |
23 | Stews | 1 |
24 | Cookies | 1 |
25 | Chili | 1 |
26 | Allspice | 1 |
... | ... | ... |
All of our products are in the Grocery and Gourmet Food category, but it looks like Canned Dry & Packaged Foods might be the biggest headache for Amazon.
What about brands? What is the range of number of recalls that a single brand endured within the time frame of our recall data set?
#execute SQL query
cur.execute('SELECT b.brand_id, count(*) as NumRecalls from\
Brand b Join Product p on b.brand_id = p.brand_id\
join recalledproduct rp on p.product_id = rp.product_id\
group by b.brand_id order by NumRecalls DESC;')
#fetch table from the cursor
brand_breakdown = pd.DataFrame(cur.fetchall())
brand_breakdown
0 | 1 | |
---|---|---|
0 | 5158 | 13 |
1 | 7169 | 8 |
2 | 2650 | 8 |
3 | 9864 | 8 |
4 | 6327 | 5 |
5 | 1154 | 5 |
6 | 8812 | 4 |
7 | 7259 | 4 |
8 | 8705 | 3 |
9 | 3321 | 3 |
10 | 4942 | 2 |
11 | 6382 | 2 |
12 | 3802 | 2 |
13 | 8864 | 2 |
14 | 6689 | 2 |
... | ... | ... |
63 rows × 2 columns
Don’t forget to close your connection!
conn.close()
This tutorial can be found in Jupyter Notebook form here.