_Can Data Have an Agenda?

If we’re not careful in how big data is collected, the samples we use to improve public policy will only reinforce existing problems.

_Julia Stoyanovich

Stoyanovich is an assistant professor of computer science at Drexel’s College of Computing & Informatics

Big data is supposed to solve what ails us: from telling us what book we might want to read next to directing police to where a crime is likely to be committed.

But if the data from which those predictions are made is slanted, the results will be, too, and continue to reinforce a status quo that keeps people and communities disenfranchised.

For example, consider the example of algorithms that determine where to send patrol cars.

“These decisions are going to be based on historical data, and that historical data includes information about where we were previously heavily policing,” says Julia Stoyanovich in the College of Computing & Informatics. “The reason we send police cars is not necessarily because crime is more pervasive in those areas, but because in the past, we were over-policing those areas.”

“What we need are different ways of looking at the data analysis pipeline, step by step.”

—Julia Stoyanovich

Stoyanovich is the lead researcher on a new study funded by the National Science Foundation that will establish foundational principles of responsible data management, which includes fairness, transparency and data protection. The project, called Data Responsibly, is working to ensure not just the accuracy of the models but also that the data on which they depend respects relevant laws, societal norms and impacts on the people from whom the data are collected.

“Suppose you have a judge who is racist. After a while, it becomes clear that all African Americans get longer sentences by that judge,” says Stoyanovich.

Identifying the culprit in those harsher sentences is not hard: it’s that judge. “In the case of data-driven analytics, it isn’t really clear how we can detect that results are biased, whose fault it is, and how to correct it. If big data systems are racist, that racism is very scalable,” she says. “The effects are going to be an entire county, for example, not just a few particular cases the judge hears.”

By creating guidelines that detect and mitigate biases at the start in the way data are collected, Stoyanovich and her team hope that bias can be kept out of the system before the algorithm spits out results that will sustain those biases on a larger scale.

“What we need are different ways of looking at the data analysis pipeline, step by step,” she says.