Data organizing

Our Approach

We decided to design and create a relational database to contain the joined FDA recall and Amazon review data. This way, researchers can continue to work on this topic, beginning with the analytical tasks, rather than spending so much time joining the data on unique identifiers and organizing it into a concise, easy-to-use format.

Some visualizations based on summary statistics queried from the database can be found here.

Using the Unsafe Foods Database

Understanding the Database

alt text

The Unsafe Foods Database is a relational Postgres database that facilitates the querying and analysis of Amazon review data with respect to historical FDA recall data. We gathered the Amazon Review data from a historical dataset that was published by Julian McAuley at UCSD. This data contains approximately 1.3 million individual reviews that include the reviewer ID, date and time of the review, the text, summary, rating, and product metadata.

Connecting to the Database in Python

In python, you need to download a module to connect with the PostGres database. My preferred module is psycopg2. In order to install this package successfully, you must install its dependencies first using the following commands in bash:

sudo apt-get install gcc
sudo apt-get install python-setuptools
sudo easy_install psycopg2

After installing, you can invoke the module in your Python script and connect to the database using the following command and parameters:

import psycopg2

#Connect to database
conn = psycopg2.connect(database = <Database-Name>, user = <your-user-name>, password = <your-password>, host = <your-host-name>, port = <your port number, usually 5432>)

print("Opened database successfully")

Querying the Database

Using the above connection info, this section will connect to the database and execute some queries to illustrate the database’s capabilities.

import psycopg2

#We also want to import pandas and numpy to work with the data we have fetched
import pandas as pd
import numpy as np

#Connect to database; input depends on your settings
conn = psycopg2.connect(database=<db_name>, user=<user_name>, password=<password>, host=<host_name>, port=<port_name>)

print("Opened database successfully")
Opened database successfully

First, let’s go ahead and look at some of the metadata that we are working with. How many products are there per category?

'''
You need to set up a cursor before you start executing queries.
One way to look at it is that your connection, 'conn', is your ticket to the database,
while your cursor, cur, will be your shopping cart.
'''
cur = conn.cursor()
#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumProducts from\
                Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
                JOIN Product p on ca.product_id = p.product_id\
                group by c.category_name order by NumProducts DESC;')

#fetch table from the cursor
category_breakdown = pd.DataFrame(cur.fetchall())
category_breakdown
0 1
0 Grocery & Gourmet Food 171760
1 Beverages 3925
2 Cooking & Baking 2434
3 Tea 1791
4 Chocolate 1043
5 Snack Foods 983
6 Coffee 949
7 Candy & Chocolate 927
8 Tea Samplers 763
9 Herbal 583
10 Sugar 489
11 Single-Serve Cups 464
12 Baby Foods 436
13 Breakfast Foods 418
14 Cereal 412
15 Nut 411
... ... ...

516 rows × 2 columns

What about the number of reviews per category?

#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumReviews from\
                Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
                JOIN Review r on ca.product_id = r.product_id\
                group by c.category_name order by NumReviews DESC;')

#fetch table from the cursor
category_breakdown_reviews = pd.DataFrame(cur.fetchall())
category_breakdown_reviews
0 1
0 Grocery & Gourmet Food 1077410
1 Beverages 23570
2 Cooking & Baking 20297
3 Tea 9737
4 Sugar 5793
5 Vinegars 4792
6 Oils 4792
7 Chocolate 4482
8 Coffee 4431
9 Candy & Chocolate 3981
10 Baby Foods 3755
11 Herbal 3609
... ... ...

475 rows × 2 columns

How about we start looking at recalled products? Maybe a certain category gets a lot of recalls?

#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumRecalls from\
                Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
                JOIN recalledproduct rp on ca.product_id = rp.product_id\
                where ca.product_id in (select product_id from recalledproduct)\
                group by c.category_name order by NumRecalls DESC;')

#fetch table from the cursor
category_breakdown_recalls = pd.DataFrame(cur.fetchall())
category_breakdown_recalls
0 1
0 Grocery & Gourmet Food 158
1 Bars 6
2 Nut 5
3 Snack Foods 4
4 Single Herbs & Spices 3
5 Butter 3
6 Energy & Nutritional 3
7 Cereal 3
8 Candy & Chocolate 3
9 Breakfast Foods 3
10 Breakfast & Cereal Bars 3
11 Chocolate 3
12 Garlic 2
13 Peanut 2
14 Peanut Butter 2
15 Cooking & Baking 2
16 Nut Butters 2
17 Seasoned Coatings 1
18 Cocoa 1
19 Dips & Spreads 1
20 Crackers 1
21 Beef 1
22 Leaveners & Yeasts 1
23 Stews 1
24 Cookies 1
25 Chili 1
26 Allspice 1
... ... ...

All of our products are in the Grocery and Gourmet Food category, but it looks like Canned Dry & Packaged Foods might be the biggest headache for Amazon.

What about brands? What is the range of number of recalls that a single brand endured within the time frame of our recall data set?

#execute SQL query
cur.execute('SELECT b.brand_id, count(*) as NumRecalls from\
                Brand b Join Product p on b.brand_id = p.brand_id\
                join recalledproduct rp on p.product_id = rp.product_id\
                group by b.brand_id order by NumRecalls DESC;')

#fetch table from the cursor
brand_breakdown = pd.DataFrame(cur.fetchall())
brand_breakdown
0 1
0 5158 13
1 7169 8
2 2650 8
3 9864 8
4 6327 5
5 1154 5
6 8812 4
7 7259 4
8 8705 3
9 3321 3
10 4942 2
11 6382 2
12 3802 2
13 8864 2
14 6689 2
... ... ...

63 rows × 2 columns

Don’t forget to close your connection!

conn.close()

This tutorial can be found in Jupyter Notebook form here.