Ever since data science emerged as a discipline critically important to the success of powerhouse companies such as Facebook, Google, Uber and Amazon, more and more universities have been asking themselves how to best prepare their students to be competitive data scientists in the workforce. Given the sheer amount of information to be imparted, from data preparation to database to big data management to visualization to predictive analytics to prescriptive analytics, the two years of a Master’s degree can represent, shall we say, an ambitious timeframe to make sure the students are competent in all components of data science. An important question is then about the role of undergraduate studies in that preparation. The National Academy of Sciences, Engineering and Medicine has a new report out precisely on that topic: Data science for undergraduates: opportunities and options.
The purpose of the report, in its own words, is to “set forth a vision for the emerging discipline of data science at the undergraduate level.” Let me quote from the report’s summary: “Data science is emerging as a field that is revolutionizing science and industries alike. Work across nearly all domains is becoming more data driven, affecting both the jobs that are available and the skills that are required. As more data and ways of analyzing them become available, more aspects of the economy, society, and daily life will become dependent on data… Today, the term “data scientist” typically describes a knowledge worker who is principally occupied with analyzing complex and massive data resources. However, data science spans a broader array of activities that involve applying principles for data collection, storage, integration, analysis, inference, communication and ethics. In future decades, all undergraduates will profit from a fundamental awareness of and competence in data science.”
The report has a number of rather expected (although always welcome) recommendations, but here are my favorite ones. First, Recommendation 2.2: “Academic institutions should provide and evolve a range of educational pathways to prepare students for an array of data science roles in the workplace. These include introductory courses, full degrees at both associate and bachelor levels, and a range of minors and certificates… A key goal is to give all students the ability to make good judgments, use tools responsibly and effectively, and ultimately make good decisions using data.”
Then, Recommendation 2.4 “Ethics is a topic that, given the nature of data science, students should learn and practice throughout their education. Academic institutions should ensure that ethics is woven into the data science curriculum from the beginning and throughout.”
Recommendation 2.5 “The data science community should adopt a code of ethics; such a code should be affirmed by members of professional societies, included in professional development programs and curricula, and conveyed through educational programs. The code should be reevaluated often in light of new developments.”
This past semester given the Facebook/Cambridge Analytica scandal I adjusted the final project in my Analytics for Decision Support course to include an essay on ethics in data science. (I had students summarize resources available online, write summaries and then provide their thoughts on discussion questions such as whether data science should be regulated and what the role a SMU education could serve.) I was impressed by the quality of my students’ essays and the effort they put into them. It was interesting to realize that some of them had had no idea data science wasn’t already regulated, while others argued that “there is no free lunch” and the scandal was overblown: if you’re not paying directly for a service like Facebook, then your data is paying for you. I will make this a regular module in my course from now on.
Yet, there are many challenges in enforcing ethics in data science, and a code of conduct might sound reassuring but end up being hollow, because you can give data to data scientists without telling them what the information is (for instance there might be a column about race, but the company leaders who acquired the data may not tell the data scientists what that column is about and simply call it “categorical variable number 1”). This is in sharp contrast with, say, doctors and lawyers, whose actions have immediate and clear consequences on their patient or client.
There is also another source of concern: I might have mentioned it on this blog before, but someone who works as a data scientist for a credit card company once mentioned to me, if I understood correctly, that this person could see the detailed list of purchases any cardholder had made. You might think only celebrities would have to worry about strangers peering over their list of purchases, but friends of friends, ex-significant others and the like might have an unhealthy interest in anyone’s purchases. You don’t use a credit card with the expectation people you know or acquaintances of people you know will be able to see your purchases as if they were standing next to you reading your statement. I wouldn’t be surprised if data scientists at companies like Amazon.com could access the detailed list of anyone’s purchase either, although that would be poor practice. It would be best to use the example of health payers, who keep a database where patients’ names are replaced with their ID, and a separate database that connects IDs with names. Employees have different levels of access to those databases depending on what their line of work calls for, and access to any database must be requested and reviewed by a special body in the company before it is granted.
Going forward, maybe the company that sells the information to third parties should remove information such as race and gender before giving the data set, instead of washing its hands of the way the data is used once the money for the data set has been paid. But that doesn’t solve the problem of people spreading misinformation to make unsuspecting Internet users reach a decision that they might not have made otherwise. In many managers’ mind, the need to behave ethically competes with the need to maintain revenue and preserve a competitive advantage. But perhaps the reputational risk will be enough to make leaders think twice about selling their users’ data. Facebook is said to be falling behind Instagram, YouTube and Snapchat among teens. Sometimes just one scandal is enough to make a behemoth company appear an also-ran.