Data organizing

Our Approach

We decided to design and create a relational database to contain the joined FDA recall and Amazon review data. This way, researchers can continue to work on this topic, beginning with the analytical tasks, rather than spending so much time joining the data on unique identifiers and organizing it into a concise, easy-to-use format.

Some visualizations based on summary statistics queried from the database can be found here.

Using the Unsafe Foods Database

Understanding the Database

alt text

The Unsafe Foods Database is a relational Postgres database that facilitates the querying and analysis of Amazon review data with respect to historical FDA recall data. We gathered the Amazon Review data from a historical dataset that was published by Julian McAuley at UCSD. This data contains approximately 1.3 million individual reviews that include the reviewer ID, date and time of the review, the text, summary, rating, and product metadata.

Connecting to the Database in Python

In python, you need to download a module to connect with the PostGres database. My preferred module is psycopg2. In order to install this package successfully, you must install its dependencies first using the following commands in bash:

sudo apt-get install gcc
sudo apt-get install python-setuptools
sudo easy_install psycopg2

After installing, you can invoke the module in your Python script and connect to the database using the following command and parameters:

import psycopg2

#Connect to database
conn = psycopg2.connect(database = <Database-Name>, user = <your-user-name>, password = <your-password>, host = <your-host-name>, port = <your port number, usually 5432>)

print("Opened database successfully")

Querying the Database

Using the above connection info, this section will connect to the database and execute some queries to illustrate the database’s capabilities.

import psycopg2

#We also want to import pandas and numpy to work with the data we have fetched
import pandas as pd
import numpy as np

#Connect to database; input depends on your settings
conn = psycopg2.connect(database=<db_name>, user=<user_name>, password=<password>, host=<host_name>, port=<port_name>)

print("Opened database successfully")

Opened database successfully

First, let’s go ahead and look at some of the metadata that we are working with. How many products are there per category?

'''
You need to set up a cursor before you start executing queries.
One way to look at it is that your connection, 'conn', is your ticket to the database,
while your cursor, cur, will be your shopping cart.
'''
cur = conn.cursor()

#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumProducts from\
                Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
                JOIN Product p on ca.product_id = p.product_id\
                group by c.category_name order by NumProducts DESC;')

#fetch table from the cursor
category_breakdown = pd.DataFrame(cur.fetchall())

category_breakdown

	0	1
0	Grocery & Gourmet Food	171760
1	Beverages	3925
2	Cooking & Baking	2434
3	Tea	1791
4	Chocolate	1043
5	Snack Foods	983
6	Coffee	949
7	Candy & Chocolate	927
8	Tea Samplers	763
9	Herbal	583
10	Sugar	489
11	Single-Serve Cups	464
12	Baby Foods	436
13	Breakfast Foods	418
14	Cereal	412
15	Nut	411
...	...	...

516 rows × 2 columns

What about the number of reviews per category?

#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumReviews from\
                Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
                JOIN Review r on ca.product_id = r.product_id\
                group by c.category_name order by NumReviews DESC;')

#fetch table from the cursor
category_breakdown_reviews = pd.DataFrame(cur.fetchall())

category_breakdown_reviews

	0	1
0	Grocery & Gourmet Food	1077410
1	Beverages	23570
2	Cooking & Baking	20297
3	Tea	9737
4	Sugar	5793
5	Vinegars	4792
6	Oils	4792
7	Chocolate	4482
8	Coffee	4431
9	Candy & Chocolate	3981
10	Baby Foods	3755
11	Herbal	3609
...	...	...

475 rows × 2 columns

How about we start looking at recalled products? Maybe a certain category gets a lot of recalls?

#execute SQL query
cur.execute('SELECT c.category_name, count(*) as NumRecalls from\
                Category c Join CategoryAssignment ca on c.category_id = ca.category_id\
                JOIN recalledproduct rp on ca.product_id = rp.product_id\
                where ca.product_id in (select product_id from recalledproduct)\
                group by c.category_name order by NumRecalls DESC;')

#fetch table from the cursor
category_breakdown_recalls = pd.DataFrame(cur.fetchall())

category_breakdown_recalls

	0	1
0	Grocery & Gourmet Food	158
1	Bars	6
2	Nut	5
3	Snack Foods	4
4	Single Herbs & Spices	3
5	Butter	3
6	Energy & Nutritional	3
7	Cereal	3
8	Candy & Chocolate	3
9	Breakfast Foods	3
10	Breakfast & Cereal Bars	3
11	Chocolate	3
12	Garlic	2
13	Peanut	2
14	Peanut Butter	2
15	Cooking & Baking	2
16	Nut Butters	2
17	Seasoned Coatings	1
18	Cocoa	1
19	Dips & Spreads	1
20	Crackers	1
21	Beef	1
22	Leaveners & Yeasts	1
23	Stews	1
24	Cookies	1
25	Chili	1
26	Allspice	1
...	...	...

All of our products are in the Grocery and Gourmet Food category, but it looks like Canned Dry & Packaged Foods might be the biggest headache for Amazon.

What about brands? What is the range of number of recalls that a single brand endured within the time frame of our recall data set?

#execute SQL query
cur.execute('SELECT b.brand_id, count(*) as NumRecalls from\
                Brand b Join Product p on b.brand_id = p.brand_id\
                join recalledproduct rp on p.product_id = rp.product_id\
                group by b.brand_id order by NumRecalls DESC;')

#fetch table from the cursor
brand_breakdown = pd.DataFrame(cur.fetchall())

brand_breakdown

	0	1
0	5158	13
1	7169	8
2	2650	8
3	9864	8
4	6327	5
5	1154	5
6	8812	4
7	7259	4
8	8705	3
9	3321	3
10	4942	2
11	6382	2
12	3802	2
13	8864	2
14	6689	2
...	...	...

63 rows × 2 columns

Don’t forget to close your connection!

conn.close()