Unsafe Foods: Getting Information About Types of Recalls

In Text Mining, the simplest approach to classifying documents is by word count. This is often really useful when you’re dealing with corpora that have dry, defined subject matter. For more personal, less professional, shorter and to-the-point documents, such as social media posts or, in our case, reviews, this approach can sometimes be sub-par depending on how you are trying to classify your data. For the reviews, as we have seen in previous weeks from the results of topic modeling, we have learned that the overlapping language in the reviews more often concerns the type of product than it does the product’s quality. This causes the model to classify any reviews referring to products similar to recalled products in the training set to to be tagged as indicating that the product should be recalled. In other words, the word counts can over-fit our results based on the type of product.

»
Cynthia Vint

ORCA: Fifty Shades of Biases

As discussed in our previous post, there are three main forms of potential bias which could limit ORCA data’s usefulness in informing transit planning: 1) ORCA vs. APC (automatic passenger count)/ridership. If some routes/time/stops have a higher percentage of passengers using ORCA card, then those routes/time/stops would be more influential when ORCA card data are used for planning purposes, which would introduce bias 2) APC vs. ridership, due to the fact that only a proportion of King County Metro buses have APC counts. If some routes/stops have a higher percentage of journeys with APC counts, then those routes/stops would be more influential when APC is used for planning purposes 3) ridership vs. the population of Puget Sound. If some communities currently have more bus usage, then those communities are more likely to be overrepresented in our data, thus have a higher weight in planning decision making. For equity purposes, the ORCA team would like to understand those biases before we make suggestions for planning based on ORCA and APC data.

»
Alicia Shen

Crowdsensing the Census - Exploring Street Network

Last week we wanted to extract more information from Milan´s OSM data, so we performed visualization and analysis of its street network. We aim to find a relationship between the street connectivity and social deprivation. We also wanted to further investigate how this relationship differs from that of Milan in a city belonging to a developing country, particularly Mexico City.

»
Carlos Espino

OpenSidewalks at #SOTMUS

For the past several weeks, the OpenSidewalks team has been actively preparing for State of the Map US (SOTMUS), an annual gathering for the US-focused OpenStreetMap (OSM) community. The conference includes presentations from researchers, professionals, and hobbyists, along with lightning talks and hands-on workshops. The OpenSidewalks team had the opportunity to present a brief overview of our proposal for standardizing conventions around sidewalk tagging in OSM.

»
Jess Hamilton and Meg Drouhard

OpenStreetMap: Mostly Amazing (Sometimes Troublesome)

This past couple of weeks, part of the CrowdSensing Census team has been working on extracting, parsing and analyzing OpenStreetMap (OSM) data as a component of our heterogeneous-based tool for estimating poverty alongside call detail record (CDR) data.

»
Rachael Dottle

Unsafe Foods Team Visits the Department of Health

The Unsafe Foods team began week 4 with a field trip to the Washington State Department of Health’s Division of Disease Control & Health Statistics. We met with state epidemiologists who gave us insight into their process for monitoring foodborne illness outbreaks and recommending that food products be recalled.

»
Kara Woo

Engaging the community to advocate for open sidewalks

Getting around the city quickly and arriving on-time can be a challenge, even if we are familiar with the urban space we are moving in. Fortunately, most of us benefit from digital maps to help us.

»
Thomas Disley

ORCA: Trying to understand bias in ORCA data

This past week the ORCA team started to tackle the question of bias mentioned in our previous blog post. As Sean elucidated in his overview of the ORCA team’s project charter there are three main sources of bias we feel are both substantively important, and practically feasible, to investigate during this summer program. We decided to start with the second level of bias, the differences between ORCA tap counts and Automatic Passenger Count (APC) data, focusing first and foremost on Pierce and Community Transit due to the full coverage of APC gauges on their vehicle fleets. While this is the most straightforward analysis of the three, attempting to say something basic about the differences between ORCA and APC counts of ridership underscored the iterative nature of research, especially given such large and complex datasets.

»
Victoria Sass

Unsafe Foods: Text Wrangling and Topic Modeling

Think back to the last time you wrote an online review. If you prioritized your spelling, grammar and formatting, the Unsafe Foods team would like to personally thank you. This past week, our team initiated work on several items, including the tedious task of cleaning Amazon review text data for preliminary analysis. Compared to other sources of text, which generally endure a rigorous editing and review process (e.g., newspaper articles), the unregulated world of online text presents Data Scientists with obvious cleaning and analysis challenges. However, the quantity and relatively easy access to online text data makes it a great source for projects that aim to analyze issues facing the general population (including our project!).

»
Mike Munsell

Introducing DSSG Fellow Thomas Disley

‘Hello World’ :)

»
Thomas Disley

ORCA: Transactions data and bias estimates

Following two weeks of intense tutorials on various data science tools, the ORCA team is very excited to finally enter the substantive portion of our project this summer! The four fellows - Carolina, Victoria, Alicia, and Sean - have been busy exploring the datasets in Week 3, along with our project leads and data scientists. One of the most intriguing aspects of our project is the data we are working with. We begin with two datasets, each containing nine weeks’ worth of ORCA farecard transactions. These are large and noisy data; there are endless possibilities for analyses, but a significant amount of data cleaning will be necessary. In addition to the raw transactions data, we inherited datasets containing trips and transfers (with estimated origins and destinations) information that are generated from transactions data. These three datasets, in conjunction with Automatic Vehicle Location (AVL) and Automatic Passenger Count (APC) data provided by various transit agencies in the Puget Sound region, gave us a solid foundation going forward.

»
Sean Wang

Learning How to Design as Allies Not Community Members.

Designing a map is essentially a process of abstraction, and this shouldn’t be treated as a trivial process.

»
Open Sidewalks Team

CrowdCensus Team's Data Exploration

The goal of the CrowdSensing Census project in DSSG 2016 is to develop a reliable and general model that can predict socio-economic levels of a city by making use of other real-time data such as OpenStreetMap data or Cellphone Detail Records (CDR) data. This work is meaningful because it improves data freshness and cost efficiency compared to census data: government-initiated socio-economic statistics are expensive and often outdated (10-year term).

»
Imam and Myeong

Introducing DSSG Fellow Imam Subkhan

It was on February 2016 when I received a circulated email from the Department of Anthropology, UW about the opportunity of the DSSG summer program at eScience Institute. They said that the program is designed to impact public policy for social benefit which has been my primary concern as a first year graduate anthropology student. Anthropology has a concern about human culture including the social economic conditions in a particular region and how it correlates with the public policy generated by the government.

»
Imam Subkhan

Unsafe Foods: Web Scraping and Key Matching

The Unsafe Foods project is going to be very exciting and challenging this summer. We get to experiment with advanced text mining practices and attempt to make a predictive model for product recalls from Amazon reviews. As intriguing as this topic is, our goals will not be achieved without a lot of sifting through some muddy data sets and determining what we can accomplish given our time frame and resources.

»
Cynthia Vint