Motivation

Background

The production and consumption of written news has changed dramatically over the past decade. Specifically, the move from predominantly print to predominantly online formats has heralded numerous new challenges. One such challenge is the proliferation of disinformation websites and the corresponding spread of unreliable, untrustworthy, or even purposefully incendiary or deceptive content. In the US and globally, this proliferation of online disinformation has eroded public trust in mainstream media, political actors, and even public health officials.

In regard to the ongoing coronavirus pandemic, the deleterious impact of disinformation has and continues to be substantial. Disinformation related to the coronavirus poses a broad threat to national and global public health, particularly if individuals disregard the public health measures (e.g., socially-distancing, wearing a mask, etc.) necessary to mitigate the continued transmission of the virus. Further, this type of disinformation has also contributed to the direct loss of lives after self-administration of unproven medications promoted by online disinformation.

Examples of coronavirus related disinformation narratives:

Image of disinformation narratives Source: https://disinformationindex.org/wp-content/uploads/2020/07/GDI_Ad-funded-COVID-19-Disinformation-1.pdf

Examples of possible harms of coronavirus related disinformation narratives:

Image of possible harms Source: https://disinformationindex.org/wp-content/uploads/2020/07/GDI_Ad-funded-COVID-19-Disinformation-1.pdf

Our partner organization, The Global Disinformation Index (GDI), is already engaged in work to identify and defund disinformation on the internet. Their current focus involves identifying disinformation websites via a system of scoring the degree of disinformation risk at the domain-level.

To this end, GDI under the general guidance of the Technical Advisory Group[1], has developed and piloted an index that provides disinformation risk ratings for news sites in media markets across the globe. It is intended to help brands and ad tech firms have more transparent and trusted information about the news sites where they place ads, allowing them in principle to make more informed decisions about where ads are served, based on their own self-determined risk thresholds.You can read more about how GDI computes this index here.

This DSSG project was focused on addressing online disinformation related to coronavirus. In collaboration with GDI, we produced a neural network model that accurately classified articles as disinformation or not. This work may be generalized to other substantive domains or to disinformation more broadly. Our goal was to add transparency to the process by which brands and advertisers fund disinformation. The existing tools used by advertisers, such as blocking websites based on the presence of certain keywords, is blunt. Accordingly, we wanted to build a tool that could inform the advertisers and brands to pull their ads from disinformation websites with greater specificity than relying on blocking keywords. Our approach provided greater specificity and granularity that not only improved precision of targeting disinformation sites but also added the protection of profitability for legitimate news which are currently being blocked based on keyword parameters. This will limit the profitability of disinformation websites.

Questions

  1. What distinguishes disinformation and legitimate news online?

  2. What methods can be used to automate classification of online news content as legitimate or disinformation?

  3. What preprocessing steps are necessary to enable the success of a classification model which depends on natural language processing methods?

  4. How can a machine learning model be built to detect disinformation in a perpetually evolving landscape of disinformation narratives?

Stakeholders

There were several direct and indirect stakeholders to this project. The direct stakeholders were involved in the early stages of developing the models and the indirect stakeholders were those who might be affected by the operation of the model or engage with it at later stages. Our team worked closely with the following direct stakeholders:

Indirect stakeholders are individuals who will be affected by the model after release: they are not directly involved during the initial design process of the model. After the model is released, GDI will continue to work with indirect stakeholders. Meanwhile, the indirect stakeholders of this project are:

Ethics

The primary ethical considerations of this project were:

At the heart of this problem lies the difficulty of measuring the extent of disinformation in a given article. Are there half-truths present? And does the article mostly contain factual information save for a couple of sentences? Any tool built to automatically classify articles as disinformation then needs to make a trade-off between truthful and disinformation content.

Publicly labelling an article as “disinformation” must be supported by substantial evidence and rigorous protocols which might be lacking in some of the machine learning models that the team decides to use. The model will also be limited by the nuances of the training data.

We have taken the following steps to address these ethical concerns:

Besides being a good general practice, detailed documentation of the process allows the user aware of the blindspots of the model.

Good documentation also assisted the team in laying out their assumptions and identifying the inherent biases the team might have built into the model.

What is the predicted probability that an article is disinformation/legitimate? What are the hyper-parameters being used by the model? What should be the accuracy of the model to qualify as the chosen model? These and more questions about metrics, if answered, help the user validate the work being done.

The team built a variety of models and performed their validation tests to make an informed decision about the final model. The assessment of a model against the suitability and context of the problem helped further reduce subjectivity around the outcomes.


[1] Index members include Camille François (Graphika) and Dr. Scott Hale (Meedan/Oxford Internet Institute), Olaf Steenfadt (Reporters without Borders/Journalism Trust Inititiave), and Cristina Tardáguila (International Fact Checking Network/Poynter),