_When Data Gets Loose

Researchers have proposed a new methodology to shuffle survey data so individuals aren’t identifiable even if datasets accidentally go public.

_matthew schneider

Schneider is an associate professor in the LeBow College of Business.

Organizations are constantly collecting confidential consumer data, but how long does it stay private? Although datasets are supposed to be anonymized or encrypted for confidentiality, proprietary information has a way of getting out. In fact, Verizon confirmed 3,950 data breaches worldwide in its 2020 annual “Data Breach Investigations Report,” with 30 percent of those executed by internal actors such as employees.

“Encryption definitely helps, but it does not prevent a data breach,” says Matthew Schneider, an assistant professor in the LeBow College of Business. “It’s similar to safeguarding your email password; an internal actor with access to the encryption key or real data could easily cause a data breach.”

Privacy is also a problem for local governments and other entities that conduct confidential surveys of their constituents that they are legally required to share with the public. It’s relatively simple for an unethical actor to use public datasets to identify a particular respondent and figure out their revealing private responses.

“Assume that all data will eventually get out and should be transformed prior to sharing anywhere within the organization.”

— Matthew Schneider

To solve this, Schneider and his research partner Dawn Iacobucci of Vanderbilt University proposed a new methodology that permanently alters survey datasets to protect consumers’ privacy when the data is shared, whether intentionally or through a breach.

Their methodology, published in the Journal of Marketing Analytics, was built upon a technique found in genomic sequencing applications that was able to disguise the identity of survey respondents and their sensitive responses while maintaining the accuracy of insights within 5 percent.

“Our method would essentially ‘shuffle’ the demographic data in a survey dataset,” says Schneider. “But, unlike previous methods, ours only shuffles data when it maintains the correlations between important variables that are essential to analysts. The protected data is generated on a consumer level and still valuable to the end user. This can also be done for employee surveys. If this dataset got out, then only the organization’s insights would be known.”