In a significant intervention in India's ongoing debate on artificial intelligence and intellectual property, legal expert Rahul Matthan has clarified a crucial distinction. He argues that the process of training AI models does not violate the Copyright Act of 1957, but the content these models generate potentially can.
The Core of Copyright: What Constitutes a 'Copy'?
This analysis comes in response to a working paper from the Department for Promotion of Industry and Internal Trade (DPIIT) on copyright and AI. Matthan, a partner at Trilegal, contends that the department's suggestion of copyright infringement during the AI training cycle is based on a flawed understanding of the technology.
He explains that the verb 'copy' is central to copyright law. Historically, it referred to physical reproductions, later extending to digital duplicates. For an infringement to occur, a reproduction must be intelligible, expressive, and capable of substituting the original work. Matthan asserts that AI training does not create such copies.
How AI Training Works: Stripping Expression to Learn Concepts
When an AI model is trained on a corpus of text, it does not store books, articles, or snippets. Instead, it converts the data into numerical vectors—coordinates in a high-dimensional space. This process mathematically encodes the relationships between concepts by mapping their distance and direction from each other.
"Training strips away the original expression of an author’s prose—the rhythm of sentences, choice of specific words and the ordering of paragraphs—to reveal the abstract concepts," Matthan writes. This is analogous to how a human learns from reading. The model retains ideas and concepts, not the protected expression. Indian courts have consistently ruled that copyright protects the expression of an idea, not the idea itself.
The Real Copyright Risk Lies in AI-Generated Output
While defending the legality of the training process, Matthan highlights a legitimate area of concern: the output. Creators rightly fear that AI systems, trained on their works, can produce rival content rapidly, threatening livelihoods.
The effective legal recourse, he suggests, lies at the other end of the workflow. If an AI model's output reproduces a substantial portion of a copyrighted work, that constitutes a clear copyright violation. The existing Copyright Act should provide remedies for this, though the DPIIT could propose clarifying amendments.
Matthan warns against extending copyright to the training cycle, a move that could absurdly imply human learning also requires licenses. The focus, he concludes, must shift from how AI learns to what it produces, ensuring creators are protected where the law actually applies.