Jessie G Taft
DL Seminar | Machine Learning’s Copyright Problem
Updated: Jan 8, 2019
By Bradley Wise | Connective Media Student, Cornell Tech
Could copyright law be enhancing bias in the way machine learning algorithms learn? According to Amanda Levendowski, a teaching fellow in the Technology Law & Policy Clinic at NYU, the case is compelling. It boils down to who has access to certain works without fear of legal action.
Levendowski explained that right now there are only a handful of companies (Facebook, Google, IBM, etc.) that have the financial capital to both create AI systems and acquire the rights to copyright works to feed those systems, either by building systems that acquire works outright or through acquisitions and licensing. This creates a problem. Since the volume of data generated or obtained by these companies includes works that are subject to copyright protection, it is very challenging for the public to understand and know what data is being used for these systems. Researchers and journalists have found that many AI systems built and used today suffer flaws – due primarily to biased datasets. The copyright problem, in short, is that if the data can’t be accessed, this bias can’t be addressed.
Levendowski’s paper, How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem, highlights several examples of how biased datasets create biased systems, most notably for “biased, low-friction data (BLFD).” BLFD datasets are regarded as legally low-risk, principally because they are widely accessible in the public domain. An example is the Enron emails: a set of 1.6 million emails sent among Enron employees in 2003, and ultimately released publicly because of the company’s involvement in a fraud investigation which ultimately led to its collapse. The Enron dataset has been used by a number of AI systems in the computer science community, including spam filters, other natural language machine learning systems, and even in the initial build of Apple’s Siri. However, researchers have also used these emails, and their extremely narrow demographic lineage, to analyze for gender bias and unsavory power dynamics, creating an ethical quandary for systems built upon them.
Additional questionable examples of BLFD include scraped profiles from the online dating site OKCupid, as well as WikiLeaks’ release of 20,000 hacked emails during the Hilary Clinton campaign, both disclosed in 2016. These datasets are questioned not for their bias per se, but for the ethical questions surrounding viewing and using them. The OKCupid dataset posed many ethical quandaries, in particular for the public uproar it created over privacy and identification (it has since been removed from public use). WikiLeaks’ release also created concerns over use of classified government emails in machine learning algorithms. Because BLFD is easily available and cheap to use, it presents a particularly acute dilemma for small startups or companies with limited capital resources.
Levendowski’s solution is to push for fairer AI through the copyright “fair use” doctrine, which permits limited use of copyright works without permission from the copyright holder, including for purposes of criticism, news reporting, or research. If there was a court decision acknowledging that fair use applies to using copyright works as machine learning training data, she argues, there would be more competition for fairer AI systems – an alluring prospect, even if its practical achievement does require concerted effort by incumbents and new players alike.