Text and Data Mining (TDM) generally involves the identification of patterns or relationships in data sets that were previously unknown. TDM can be used to build predictive models of behavior in the retail context, so that when a customer Amazon, or opens their Facebook page, they are presented with advertising keyed to their individual tastes and preferences.
In the media and entertainment context, one form of TDM, machine-learning, is being used to train AI programs to create content, whether in text, audio, visual or audiovisual form. Machine learning, like traditional TDM, is intended to discover novel and useful knowledge in data. However, a fundamental difference between machine learning and traditional TDM, is that TDM in and of itself, can extract data for human comprehension, whereas machine learning extracts data to improve an AI program’s own understanding and ability to produce output. In addition, TDM does not necessarily involve rule or pattern discovery, while machine learning almost always does.
TDM in the U.S.: What is ‘fair use’ anyway?
As discussed in the Geopolitics of AI section, the legality of making copies of the text or data through TDM has become a serious issue. As AI search engines crawl through the world wide web endlessly seeking, digesting, and aggregating content, they inevitably digest copyrighted works such as music videos, songs, novels, and news stories. Since this digestion – which generally requires the making of a copy – is frequently performed without the express consent of the copyright holder, its legality often depends on whether it is permitted under an exception to, or outside the framework of, copyright law. Under U.S. copyright law, the exception that is most frequently relied upon is fair use.
Under section 107 of the Copyright Act, fair use is a four-factor test: (1) the purpose of and character of the use; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the whole; and (4) the effect of the use on the potential market for, or value of, the copyrighted work. Fair use of a copyrighted work for such things as teaching, scholarship, and research is specifically permitted by section 107. A key consideration that courts have used in deciding whether fair use exists is whether the use is “transformative.”
Whether copying of copyrighted material for the purpose of machine learning constitutes fair use is a hotly debated topic that will affect the future of AI in the United States. For example, Thomson Reuters and West Publishing Corp. have sued Ross Intelligence, Inc. over, among other things, its alleged use of machine learning to create a legal research platform for Ross from the Westlaw database. The outcome of this case is still pending, and Ross’ motion to dismiss the copyright infringement and was denied.1
Will fair use protect machine learning?
In a seminal case from 2015, the Second Circuit found Google Books’ scanning of more than 20 million books, many of which were subject to copyright, to be a non-expressive and transformative fair use of the texts because Google Books enabled users to find information about copyrighted books, as opposed to the expressions contained in the books themselves.2 A key learning from the case was the distinction made between “expressive” and “non-expressive” use of copyrighted materials, the latter being deemed fair use by the court. Applied to AI, could the solution mean that so long as the original text does not “express” in the final work product, the act of machine reading is fair use?
We are not aware of U.S. courts applying fair use in the context of TDM, in part because cases considering AI functionality have often involved the express use of copyrighted material that qualified as traditional copyright infringement. For example, the Second Circuit found in a 2018 case, that although TVEyes’ “search feature” for Fox News content in and of itself might have been sufficiently transformative to be fair use, the fact that TVEyes also had a “watch feature” that redistributed copyrighted Fox News content to TVEyes users for a monthly fee did not permit a fair use defense (Fox News Network, LLC v. TVEyes, Inc., No. 15-3885 (Feb. 27, 2018)).
In practice, major TDM search projects are generally dealt with under contract, which has resulted in low instances of litigation. Academic and commercial arguments have also been raised against over-reliance on fair use for TDM. As a practical matter, a key factor that U.S. courts will look at is whether TDM deprives the copyright owner of the value of their copyrighted material.
AI licensing
The predominant way that rights to collect, use and share data are allocated in advance or in order to create business certainty is typically through licenses. A license is a right or a permission for a person or company to use another party’s intellectual property in exchange for a fee. The benefit of the licensing model is that it offers tremendous flexibility to slice, dice, allocate, monetize, expand and limit collection, use and disclosure in an area where often more traditional intellectual property rights of patent, copyright, trademark and trade secret law may be less clear or where there may be comparative differences of opinion or points of view and licensing can help address these issues among and between businesses and even consumers. In particular, licensing as a tool has broadly enabled many of the data-focused innovations of the Internet age. Licensing also helps to address privacy and data protection issues in many legal systems, for example, in the U.S. not only do privacy policies often address these issues, but terms of use or terms of service frequently include license grants that grant licenses to things that may or may not be subject to traditional intellectual property grants.
In addition, licensing can also be used to help address issues of confidentiality, usage considerations or limitations and, increasingly, learning and other issues which often may be experiential and machine-aided in connection with the collection, use and disclosure of data. For example, secondary usage or derivative usage of data, which may not be subject to copyright or trade secret protection is increasingly addressed by contract. Similarly, residuals which refer to information in nontangible form, which may be remembered by persons with access to confidential information are something increasingly important for parties to consider when exchanging confidential information with other parties. Not only can the information generated by a business relationship be valuable but who has a right to secrecy with respect to it and whether and how the counter-party can use it has become of such great importance that the entire enterprise value of certain businesses has been written off when rights in underlying data were questioned and more recently acquisitions transactions have had their purchase price changed or deals fail to close because of uncertainty about data rights.
With this in mind, it is helpful to understand common contractual provisions used in licensing relating to the collection, use and disclosure of data.
Key provisions
Representations, warranties and covenants
In contracts, representations are legally binding assurances that certain facts are true while warranties provide that if a stated fact is not true, that the recipient of the product or service covered by the assertion of fact will be protected from loss. In contrast, a covenant states that something will or cannot be done and affirmatively obligates a party making a representation and warranty. Breach of a covenant could result in money damages or an obligation for specific performance. When negotiating a contract for AI products or services, the representations, warranties, and covenants should be specific to artificial intelligence to address the risks associated with the use of such technology. Examples of such representations, warranties, and covenants include:
- Sufficient rights to use the technology – Many customers may require warranties that state that the vendor has sufficient rights and/or licenses to provide the technology. These come in the form of affirming original creation and/or appropriate licenses, as well as an express representation and warranty of non-infringement. This representation and warranty allow the customer to assert an “innocent infringer” defense to certain IP claims as well as requiring the vendor to stand behind its intellectual property. However, as a vendor, it may be difficult to provide such representation and warranty since it is difficult to find and assess potential threats and a vendor may never be sure that it is free from threats of IP infringement. This is especially difficult with the evolving landscape of artificial intelligence and the concept of copyright protection as discussed throughout this guide.
- Consents from individuals – As discussed throughout this guide, data protection laws worldwide rely largely on obtaining the user’s consent before processing or using that user’s data. Vendors want to ensure that customers have obtained consent from such individuals to provide personal data or personal information to the vendor in the input data as well as stating that the customer is not prohibited from using the data beyond the stated purpose for which consent was given.
- Performance of the AI model – Performance of the AI model is important to ensure that it is working in accordance with any specifications and documentation provided by a vendor. Additionally, a customer may request a warranty and a covenant that certain performance obligations are to be met, including results to be achieved, accuracy, and operability in the customer’s environment. These warranties and covenants may be valuable to a customer to ensure the AI model works as its intended and for a customer’s purpose. In turn, a vendor may precisely define and limit the expected performance since AI model development is complex and iterative. A vendor may want to allocate the risk to the customer in determining whether the AI model is suitable for a customer’s business.
- Security related – There are many issues with regards to cyber security vulnerabilities in AI. Customers may request proper reps/warranties ensuring adequate proactive and responsive cybersecurity policies and procedures.
- Physical equipment with embedded API – When a vendor sells physical equipment that includes artificial intelligence, customers should ensure that the representations and warranties in the contract that also cover injuries, damages and even death that could be caused by customer and its users use of the AI enabled machines and devices.
- Miscellaneous – Customers should consider how they will use artificial intelligence in their business. If they intend to incorporate AI into mission critical functions, such as automating production lines, then the representations and warranties about the AI system may address the potential business impact of a total system failure and extended downtime. In situations where the AI includes a facial recognition tool, the risk caused by AI may be allocated to the developers to ensure that the model was built so that the results of any output data are not deceptive and are free of bias and discrimination.
Indemnification
Indemnification clauses allocate liability to the party with greater culpability for the event that results in liability. With artificial intelligence, customers should ask to need to understand whether their data is being used as training data and if so, assess whether they are comfortable with the use and the potential outputs that are to be generated. For customers who are paying for the use of generative AI tools, customers may consider obtaining an indemnification from the vendor for intellectual property infringement. However, the parties should carefully consider how to allocate liability for the AI’s functionality because it may be difficult to determine whether the vendor or the customer caused the event giving rise to liability. For example, if the output data infringes on a third party’s IP rights, it may be difficult to determine whether it was the input data that was the infringing portion or whether it was the combination of the input data with other training data provided by the vendor that caused the infringement. Other indemnities that a customer may request a vendor to provide includes property damage or personal injury if the AI model is used in a high-risk environment such as a manufacturing plant or data breaches if the AI model is ingesting personal data and a data breach occurs due to a hack. A customer might review its use of the AI model and the type of data it is providing the vendor to ensure it is protecting itself from potential third-party risks. However, a vendor will want to ensure that its indemnities are limited to third-party claims that may occur and for which it would be responsible for. It may not be reasonable to not provide indemnities for things that are outside of its control or are in the control of the customer.
Limitation of liability
A limitation of liability clause limits the amount of damages that a party can recover from another party for breaches or performance failures. Limitation of liability clauses typically limit the liability to one of the following amounts: (i) the compensation and fees paid under the contract; (ii) an agreed upon amount of money; (iii) available insurance coverage; or (iv) a combination of two or more of the above. When the parties are negotiating a liability cap in an agreement, they should look closely at the specific risks and apply individual limitations accordingly. For example, if there is a supply line that is operated by AI-enabled robots and those robots fail, a customer’s business could be severely impacted if it is unable to run business as usual. In these situations, a customer would want to seek damages sufficient to cover their losses and any damage they experienced in not being able to run their business. Many liability caps also involve a waiver of consequential damages that prevent parties from recovering special, indirect, and consequential damages but a customer wants to consider certain exceptions to such waiver. For instance, if an AI data analytics system inadvertently discloses personal information of downstream users, the customer will face third party claims from those users and sustain serious reputation damage. The negotiations over liability limitations therefore deserve careful attention.
Insurance
Insurance requirements are critical in an AI related commercial contract as it decreases the risk associated with the AI system and shifts the consequences of that risk to another party. There are several types of insurance coverages that a party can obtain, and a party should review each policy type to ensure that it covers the variety of damages that may occur. If an AI system fails or causes damage, the parties may determine which coverage, if any, applies to any given situation, and if not, whether they can expand or add coverage or if coverage is even available. One type of insurance many parties request in commercial agreements when data, personal information or other systems are involved is cybersecurity insurance; however, cybersecurity insurance may not cover all AI-failures. Cybersecurity insurance typically covers model stealing attacks and data leakage. However, it does not cover bodily harm (i.e., an Uber self-driving car killed a person), brand damage, or damage to physical property. If this is important to a party, it should ensure that it requires the vendor to obtain the appropriate insurance policy.
Can data be owned?
Data is free flowing information. There is no standard definition of the term “data.” The Joint Technical Committee of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) proposes the following definition of the term:
“Reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing.”
At the most basic level, data is just information. For example, the fact that a texture belongs to a genre called “brick” or “steel” is information that is not capable of appropriation in itself. This lack of ownership of data stems from the fundamental principle that information, ideas, methods and techniques are free and free flowing.
Data is information. Whether or not any person has a proprietary or “ownership” interest in data rests on the question whether the law has created a specific property regime for that type of data, also called “intellectual property.” In most countries, there exist only five types of data susceptible of being protected by intellectual property: (i) works of art and other subject-matters from the creative industries; (ii) databases; (iii) software; (iv) trademarks and (v) patentable inventions. Simply put, data which does not fall within one of the aforementioned categories may not be “owned.” Of course, this does not mean that one is entirely free to use and re-use data which is not “intellectual property,” since other type of restrictions might apply to the data, as discussed below.
Intellectual Property
i. Ownership of or right to control input data
Parties providing data intentionally for usage in machine-learning or AI development frequently seek to assert ownership or control of the data or otherwise assert a right to exclusively share and use data. However, ownership, in the sense of property is often not available with personal information being a good example. Personal information is not a proprietary right, it is an access right which is almost exclusively controlled by the individual who the information relates to or identifies, and it is difficult for a party other than that individual to assert rights over that specific data. For other types of data, such as confidential business information, a customer may want to ensure that it maintains explicit confidentiality rights in its data and that no rights are transferred to the vendor by virtue of the performance of the services for the customer. For a vendor, there may be significant value in controlling the input data so that it may continue to use such input data in its AI tool without breaching another party’s rights. Many vendors provide services freely or cheaply in order to generate input data that can be used to train and improve their models.
ii. Ownership of or control of output data
The ownership status of output data is the most highly contested and difficult provision to negotiate in data-related contracts. The output data of an AI model may include direct end-user output data created for use by the AI customer and indirect “output” data that is inputted by the customer and used by the model to improve functionality and efficiency. Output data varies depending on what type of model is used and what its purpose is. There are three main types of outputs: (a) a prediction; (b) a recommendation; or (c) a classification. Many customers desire to “own” the output data since the output data was created using the input data provided by the customers. A customer could argue that output data was a derivative work (as such term is used under the U.S. Copyright Act) and therefore, ownership automatically flows through to the customer. Unfortunately, data which is used in artificial intelligence development or model training may often be of uncertain copyright provenance or unequivocally not subject to copyright protection. Even trade secret status is frequently unclear. However, sometimes vendors will argue that it is important that they “own” the output data they created since they used their own proprietary model to create the output data. The AI vendor may also want to ensure it keeps “ownership” of the output data so that it can continue to use that data to train the AI model. In practice, output data are rarely susceptible of appropriation hence relying on contractual terms delivers far better certainty. There is little case law on who can claim rights to output data so the parties may want to review the contractual language to ensure that each of their interests is protected when negotiating these types of contracts.
iii. Use of derived data
Derived data is new data and insights derived from the output data and may not have been available from the existing data. Since such derived data is valuable, customers and vendors both may have potential use for such data outside of the contractual agreement for the AI model. One of the issues with derived data is who can control it. A customer could argue that it should be afforded the right to control the Derived Data since it was the original inputter of the data that is used to create the derived data but since derived data is created by the act of combining and transforming data into a new type of data, a vendor could argue against it. As discussed in the Emerging Trends section below, vendors can monetize the access and use of the derived data through either a license to a database containing such derived data or the purchase of certain derived data from customers.
Information security
As part of the vetting and contracting process, customers may also consider a vendor’s data security models with regards to the AI tools. There is potential for hacking an AI system and causing issues such as system manipulation, data poisoning, and extraction attacks. System manipulation includes providing the AI system with malicious inputs, causing the output data to be inaccurate. Data poisoning is the act of modifying the input data while in transit or at rest, so it returns incorrect classifications. An example of this is a bad actor manipulating the training data to teach the AI model anything it wants such as the model seeing good software code as malicious code and vice versa. Data extraction attacks place the entire AI system at risk by generating a back door in the training data to gain access to the AI model itself. To avoid these potential harmful scenarios, customers may request that vendors agree to certain data security requirements such as penetration tests, detailed review processes and testing of any AI-generated source code, and access controls for personnel who will be supervising the AI model. Having security policies and procedures are critical to protect the integrity and confidentiality of the AI model and the data.
Service levels and key performance indicators
Many customers seeking to use services involving an AI model, may seek certain quality, accuracy, or other benchmarks in connection with the service. If the AI model provides inaccurate or non-beneficial data, then the AI model could be potentially useless. A customer may request a vendor to provide assurances that the AI model will perform as it is supposed to perform in the form of service levels and key performance indicators. One type of service level a customer may request is an accuracy SLA, which requires that the output data generated by an AI model is accurate X percentage of the time. For a vendor, this could be difficult since a customer could be providing poor input data to begin with. For example, if a customer provides input data that says only cows can be brown, then when the AI model sees a cow that is black and white, it could classify that cow as another animal that the AI model knows can be black and white. At the same time, a vendor should be continuously improving and training its AI model to know that cows can be assorted colors and committing that the AI model will provide accurate output data.
Termination rights
When parties terminate a relationship, there are common features in data licenses agreement that dictate the requirements to wrap up the termination or expiration such as destruction of any confidential information that was shared and continued use of data after termination. Many of these standard provisions may be difficult with respect to models, derivatives, and the like. For example, vendors may include in the contract the right to continue to use the confidential information that was part of the input data for further training. If this right is included, a provision stating deletion of confidential information could conflict with a license right regarding such confidential information that survives termination. This could also cause potential leakage of a customer’s confidential information to other customers who will have access to such confidential information by virtue of access to the output data of the model and/or derivatives. This is why reviewing licensing terms in conjunction with termination rights is key. A customer may also request the ability to continue to use the output and derived data long after the contract terminates. Under a typical data license, the right to use the software and the associated data is revoked after termination. However, this may not be feasible for a customer if the output data and derived data is incorporated into a customer’s dataset. To avoid a potential infringement claim, a customer may want to include a continued use clause which allows it to use the data already provided without restriction. In turn, a vendor may want to potentially limit the customer’s ability to use its data after termination or expiration and without restriction so that any potential uses by the customer does not erode its business value. For example, if a customer has the output data and derived data in its database and turns around and sells such data to others, this could potentially undercut the vendor’s business. Vendors and customers should review termination rights carefully to ensure their future interests are protected.
Failed license cases
Given uncertainty over many of these issues recently there has been a significant increase in litigation. Below we summarize a few of the ongoing cases that deal with licensing issues:
- CoPilot Case- On Nov. 2, 2022, a class action was filed on behalf of software developers against GitHub Inc., Microsoft Corp. and the OpenAI entities alleging violation of the Digital Millennium Copyright Act and breach of contract of the open-source licenses governing the source code due to the release of GitHub CoPilot. CoPilot is a generative AI built by GitHub to assist programmers while they are coding within the platform. In this case, the underlying work consists of source code created by other, non-GitHub or Microsoft developers for a public library on GitHub under an open-source license. One of the requirements these developers established for their open-source license was that copyright notices within their source code must be reproduced when it is used as the basis of derivative software or code.
Generally, original source code may be used and distributed to third parties as long as proper recognition is provided. The plaintiffs allege that the CoPilot AI instead removes or alters such copyright information from the source code and then reproduces the source code, without the requisite copyright information, to the CoPilot users. In response, GitHub and Microsoft have argued that CoPilot does not need to reproduce the copyright information as CoPilot is not built around the code in plaintiffs’ open-source library but is based on all code developed and stored within GitHub. GitHub recently got the judge presiding over this case to dismiss most of the claims, including the copyright infringement claim, but with leave to amend and re-submit to the court.
- On Jan. 13, Sarah Andersen, Karla Ortiz, and Kelly McKernan filed a class action against Stability AI Ltd., Stability AI Inc., Midjourney Inc., and DeviantArt Inc. alleging infringement of certain copyrighted images of the plaintiffs’ artwork, as well as breach of contract, unfair competition, and violation of their right of publicity. The named plaintiffs are artists who claim that Stability AI used their artwork in training Stability AI’s algorithms without consent.
Stability AI has responded by arguing that they should only face claims for copying works that are registered, something the named plaintiffs did not do prior to filing the suit. Further, defendants claim none of their produced output images contain substantial similarities to any copyrighted works, thus they could not be infringing on the existing copyrights. They claim a lack of pleading direct infringement prevents any claim under the DMCA or of vicarious infringement. Finally, they argue the unfair competition and right of publicity claims are preempted by the copyright claims, and thus should be dismissed.
- On Feb. 6, Getty Images also sued Stability AI for copyright infringement, as well as for trademark infringement. Getty Images is a leading creator and distributor of digital content, primarily photographic images. The images are either created by staff or hired photographers, or acquired from third parties, and the applicable copyrights assigned or licensed to Getty Images. Getty Images alleges the content scraped from its websites was collected without consent and that certain images produced by Stability AI's software contain a modified version of the signature Getty watermark, causing numerous concerns under trademark law as to Getty’s association with Stability AI and dilution of Getty’s trademark protection. Further, Getty has previously licensed their content to other companies, including those that have used Getty’s content to train generative AI models, like Stability AI. While that does not mean Stability AI’s use was in violation of copyright and trademark law, it does limit potential arguments they could make surrounding claims of fair use. Stability AI has yet to make an argument on the merits of the case, but they are likely going to mirror their arguments in the class action regarding not producing output images containing substantial similarities to any of the copyrighted works.
Emerging trends
Outside of regular intellectual property considerations relating to whether data used in training models violates copyright, privacy, or similar rights restrictions, in many other contexts training of generative AI, machine learning and other types of data-centric development increasingly is becoming an issue in many transactions. As businesses build and develop products, especially using freemium or give-to-get models, where part of the value proposition for the product is a reduced price or access to additional features and functionality continues to grow. Providers of such services often receive a broader license than a business might otherwise get in an arms-length transaction. Companies are and will be likely to clamp down on access to and use of some tools or may struggle with broad employee use outside of policy as they have with many other nascent services. For example, in social media many commentators have suggested that companies not put sensitive or proprietary information as part of prompts or otherwise in seeking to use various large language model or similar tools. For many years companies have made products available on a “give-to-get” basis whereby dashboards, analytics and other types of tools and value are built or made available predicated on the economic network effect which is generated when the community as a whole benefits from increased usage by many. Larger enterprises have, often, sought to use their particular market power or leverage to obtain the benefits of such effects while restricting or limiting the use of any data they provide, or data generated about their usage, is used, or incorporated into models, machine-learning and beyond. Hype surrounding artificial intelligence is driving greater conflict as more organizations awaken to the risk (and rewards) of data for deriving insight and analysis with artificial intelligence model development accelerating this trend. Many have projected that having large and less-encumbered data lakes can and will provide a competitive advantage for some players. However, recent developments in open-source model development suggest such advantages may be short lived. Wherever trends go, data continues to emerge as one of the most important asset classes of the twenty-first century.
- Thomson Reuters Enter. Ctr. GmbH v. ROSS Intelligence Inc., 529 F. Supp. 3d 303 (D. Del., Mar. 29, 2021).
- Authors Guild, Inc. v. Google Inc., 804 F.3d 202 (2d Cir. 2015).