Unsafe Foods: Getting Information About Types of Recalls
In Text Mining, the simplest approach to classifying documents is by word count. This is often really useful when you’re dealing with corpora that have dry, defined subject matter. For more personal, less professional, shorter and to-the-point documents, such as social media posts or, in our case, reviews, this approach can sometimes be sub-par depending on how you are trying to classify your data. For the reviews, as we have seen in previous weeks from the results of topic modeling, we have learned that the overlapping language in the reviews more often concerns the type of product than it does the product’s quality. This causes the model to classify any reviews referring to products similar to recalled products in the training set to to be tagged as indicating that the product should be recalled. In other words, the word counts can over-fit our results based on the type of product.
»