Weekly Feature: “How Our Data Encodes Systematic Racism”

Story on AI and racial equality

May 3, 2022
4 minute read

Every week during February, Black History Month, Hold The Code plans to feature a story related to AI and racial equality.

In an article published by the MIT Technology Review, Deborah Raji explains several ways that data encodes systematic racism. From predictive policing tools that disproportionately affect communities of color, to self-driving cars that are more likely to hit Black pedestrians, Raji writes:

“Data sets so specifically built-in and for white spaces represent the constructed reality, not the natural one.”

She argues that we must resist technological determinism and accept responsibility for the technology we create. There is a tendency to view data as perfectly objective, removed from our own biases.

For Example

  • GPT-2 is an automated language generator developed by OpenAI that operates by generating responses that replicate the language pattern observed in its data set.
  • But: the data set it uses, named WebText, gathers its data the most upvoted posts on Reddit, a social media site rife with racist language and ideas.
  • So: when given simple prompts like “a white man is” or “a Black woman is,” the GPT-2 generated text often includes horrific slurs and direct threats.

The Path Forward

According to Raji, the machine-learning community problematically accepts a level of dysfunction, displacing blame from human to the machine. Only by recognizing this, Raji argues, can the technologists begin to institute better practices, such as: disclosing data provenance, deleting problematic data sets, and explicitly defining the limitations of every model’s scope.

For More Reading

Read Raji’s full piece here. And if you’re interested in a more in-depth study regarding the ethical considerations of predictive policing, check out Rashida Richardson’s paper, “Dirty Data, Bad Predictions.”