Data LegoLand: Text and Data Mining and its Impact on COVID-19

Posted on August 25, 2020

Co-authored by Akshat Agrawal & Manika Dayal*

Text & Data Mining - Copyright
Image Source:


The COVID-19 pandemic has halted the world and has affected almost every aspect of an individual’s livelihood. This has led many academic researchers and pharmaceutical companies relentlessly working towards finding both treatments and vaccines for COVID-19. The need for a solution through innovation presents itself to be extremely pertinent. This article will seek to analyse the importance of Artificial Intelligence (“AI”), specifically Text and Data Mining (“TDM”) tools in combatting the menace of COVID-19 and consequently highlight the importance of AI based tools in facilitating research and pharmaceutical innovations.

Further, this article will aim to discuss the various copyright concerns involved in developing and training the AI software, for it to be able to produce an effective output and efficiently support the researchers in developing mechanisms to combat COVID-19, along with the peripheral concerns. The article will further elucidate upon the scope and ambit of research exceptions in copyright, whether they are essential to promote the development and innovation in science, especially during this pandemic for it to be able to  promote open and quick collaborative innovation. Finally, the Indian scheme shall be discussed wherein this article shall hint at the various statutory tools within the Indian Copyright Act, 1957 that could effectively be used to ease the legal complexities associated with training the AI software, for it to be effectually employed and used by researchers in the pharmaceutical and other connected industries.

Role of AI in Complimenting the R&D Required for a Solution to COVID-19

The Director General of WIPO recently in a statement published on the WIPO website stated that the main challenge currently being faced is the absence of any approved vaccines/ treatments or cures to COVID 19. Further, the policy focus of governments should be on supporting science, innovation and R&D for an efficient development of a scientific solution to the menace that is COVID 19. In light of this, it becomes exceedingly important to recognise the powerful role of AI, i.e. perceptive computing technologies, in addressing the needs of research data required to fight this global pandemic and effective support to medical researchers. In order to understand the conundrum better, we will elucidate upon the possible avenues where AI plays a pertinent role in medical research:

  1. Running complex and intricate algorithms by analysing and learning from large and diverse datasets to identify components of a vaccine, by understanding the viral protein structure of COVID-19.
  2. Helping research scientists analyse through an exorbitant amount of relevant research papers, and sift through the research papers and publications being published around the world in open access formats. AI technologies like Natural Language Processing (NLP) can help researchers crunch huge amounts of data that would be impossible for humans to process quickly. Further, AI can evaluate and summarize tens and thousands of new research papers on coronavirus to substantiate the researchers work.

It is further, pertinent to note that the pace at which research is being published is extremely rapid and therefore, it has become increasingly difficult for scientists and medical researchers to connect the dots and swiftly analyse the various points and contentions raised. For this, Allen Institute of AI recently partnered with several research organisations to produce the COVID 19 Open Research Dataset (CORD-19), a unique resource base with over 40,000 scholarly articles on COVID 19, with continued daily updates. Researchers herein can apply natural language processing algorithms and work on solutions and mechanisms for possibly developing the vaccine. Further fairly new tools like Google’s COVID-19 research explorer, enables users/ researchers to ask questions on its AI supported search engine, which upon analysis of the language terms read with the data sets fed in, returns a list of papers with key passages highlighted. Another similar initiative is a search engine based tool which uses AI to tag papers with keywords, labels and offers filters such as peer review status, source etc. has been developed at Lawrence Berkeley Laboratory, California and is known as COVID Scholar. Essentially functions performed by these AI based tools is to keep up with the results of scientific research, collate and efficiently make critical information accessible to researchers to save them a lot of time and effort.

However, AI related developmental methods require large amounts of labelled data to be effective. AI works on clean and enriched datasets which are required to be fed in for the AI software to be able to learn to analyse and filter the critical aspects of the same. Therefore, for it to predict more accurate results and to be trained to parse and extract knowledge out of big datasets, researchers need access to huge sets of data and training tools. Most of this data is IP protected by large research labs and pharmaceutical agencies along with big pharma companies curbing the effective use and development of AI based tools. Most of these datasets are either protected by Copyright or Database protection laws and seeking and negotiating licenses for the same disbands effective access to these essential resources, especially during a public health crisis.

Copyright Issues Arising in Training and Employment of Artificial Intelligence Software

Intellectual property, more often than not, poses as a major barrier to access essential resources, for innovation and effective results, in appropriate timelines and affordable terms, contrary to legislative intent of most legislations. Specifically, with respect to training and employing the AI systems for a supplementary role in research, large datasets are required which may constitute third party intellectual property, consequently raising copyright concerns. Further, mere public health benefits does not create a per-se exception to potential copyright concerns. The use of data to train and efficiently employ AI technologies is called Text and Data mining (TDM). This has been defined as the use of automated analytical techniques to analyse text and data for patterns, trends and useful information which are used potentially for knowledge discovery and facilitation of research. This is a fundamental technique in generating robust datasets that underpin machine and deep learning and enable the AI to effectively supplement knowledge discovery. TDM occurs in 4 stages:

In most of the TDM projects used with AI technology, the data used for mining comes within copyright or database right protection, effectively requiring a license from the owner for access. One of the most common types of text mining is “natural language processing” as discussed above which involves a linguistic analysis to read and interpret text, by feeding large interpretational data sets to the AI system. This helps create effective search engines, however it requires large linguistic and scientific terminology based data, which is often protected by copyrighted databases/ sui-generis database rights.

Fair Dealing/ Fair Use Exceptions to Ease Copyright Limitations Associated with the Use of Datasets for AI Development

Provisions have been incorporated at the national and international levels to facilitate access when intellectual property is a barrier. Exceptions in relation to copyrighted works to ensure availability of vital data, information and knowledge for the purposes of combatting the virus are provided in various forms in domestic and regional copyright legislations.

In some jurisdictions like Japan and UK, the copyright laws have been amended to expressly allow research uses and reproduction of lawfully accessed copyrighted materials to train AI systems (by lawfully accessed, the authors mean the presence of a subscription to the platform/journal for lawful access). A precondition to the same is the intended use being non-consumptive/ non-expressive. For instance,  comparison, classification or statistical analysis. In the UK Copyright, Designs and Patents Act, 1988, this exception is stated within Section 29 A, whereas in the Copyright Law of Japan, the corresponding provision is Article 30-4. In the European Copyright Directive, two new provisions have been added in 2019 to deal with the exceptions related to TDM.

1. Article 3 of the EU Copyright Directive provides a strong mandatory exception to reproduction and research organisations in order to carry out text and data mining of content to which they have lawful access for the purpose of “scientific research” and development of AI tools for scientific research. This exception is available to universities, research institutes and other scientific organisations where the research is not-for profit or profits are invested back in research. Further, this provision is applicable notwithstanding any contractual provisions.

2. Article 4 provides a related weaker protection for reproduction of lawfully available material for TDM, with a profit motive. However, an interesting aspect of this provision is that it is narrow in its scope as it provides an opt-out system for the rights holders wherein if the right holder has explicitly reserved their rights, this exception shall not be applicable.

Coming to the United States, TDM is covered within the “fair use” doctrine irrespective of its profit motive or not. This aligns with the preamble to the US Copyright Act which states the purpose of copyright to be “the progress of arts and sciences”. The courts, most notably in the Google Books case have allowed fair use for the use of data and reproduction of the same in a searchable database that could be used by scholars and researchers. The court applied the principle of “transformative use”, as the use herein was to provide information about the copyrighted works without providing the public with a substantial substitute of the original, irrespective of it being commercial or not. Another reason why the courts have allowed TDM as a fair use, is it involving a non-expressive use – that is use for analysis of the data to develop knowledge about patterns, trends etc., as against use in an expressive capacity. (Hathi trust case).

Coming to the case of India, the AI developers and researchers could use “fair dealing” provisions prescribed in Section 52 of the Indian Copyright Act. Interestingly, Section 52(1)(a)(i) prescribes  a fair dealing exception in the case of use of a work for personal or private use, including research work. The access restrictions clearly connote use for the work being limited in a personal capacity (only by those who have access to the article) or for research purposes. Due to the purpose of use being Text and Data Mining – for analysis and research purposes (analyzing trends and knowledge discovery), the only question which comes up is with respect to the fairness of the dealing. As the use/ dealing is in a non-expressive capacity which does not pose a competition or substitute for the original work, the purpose can be said to be transformative, and the effect of the dealing – on the original work- is negligible. Hence, the dealing could also be rendered as fair, fulfilling the requirements of Section 52(1)(a)(i).  Finally, as copyright does not subsist in ideas and facts, it is arguable that use of a work for TDM merely involves the use of data/datasets which are outside the ambit of copyright protection themselves. The protection under the Indian law is limited to the representation/ expression as a database, however the data within, could be used. As India is a developing arena for research and the use of AI technology, such TDM exceptions could play an important role in fostering AI technologies and for researchers to avoid copyright concerns especially when the use is concerned with research and supplementing research to develop tools to deal with the Pandemic.

An interesting observation herein is that the TDM exception is only available for works which can lawfully be accessed. Due to the ongoing pandemic, many journals have made their content freely accessible to researchers and have in turn made it open access. Even WIPO has notified free access to research related to SARS CoV 2 as well as free access to newspaper articles. This, combined with the TDM exception, shall bring a lot of resources and datasets into easy availability for development of AI technologies and use by researchers without the worry of copyright threats especially for knowledge development and use to combat the impact of COVID 19.

*Akshat Agrawal graduated with BA LLB (Hons.) from Jindal Global Law School with a specialisation in IP Policy. He is working as a Judicial Law Clerk at the Delhi High Court with Justice Prathiba M. Singh. He holds a keen interest in IP theory, A2K and Free Culture.

Manika Dayal graduated with B.A LL.B (Hons) from Jindal Global Law School with a specialization in Intellectual Property, Data Privacy and Media laws.

The opinions expressed herein are those of the contributors in their personal capacity and do not, in any way or manner, reflect the views of the organizations that the contributors are presently associated with, or that have previously employed or retained the contributors. 

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s