Skip to main content

Priorities and Power: An AI Governance Proposal

    Although (non-)payments for the use of copyrighted works to train AI have taken centrestage, perhaps that's a concern which can be subsumed within the larger issues of balancing often competing interests and addressing asymmetries of power which abound in the field, issues which we would do well to engage with sooner rather than later…

For some time now, it has been in vogue to say, repeatedly and resolutely, that all creators should be remunerated for the use of their content to train generative AI. This is likely largely because companies have been known to indiscriminately scrape any and all content they manage to access, protected or not, to train artificial intelligence models, occasionally, if reports are to be believed, entering into multi-million dollar deals to access and use some copyrighted content in cases where the content in question is owned by or is in the possession of powerful corporations.

A report released in January 2025 by a sub-committee constituted by the Ministry of Electronics and Information Technology raised the issue of AI being trained with ‘bulk datasets’: it noted that such use is generally prohibited by copyright law, questioned how to ensure compliance with existing law, and wondered whether policy (and, presumably, legal) changes need to be made in light of the mass use of data to train AI. Unfortunately, the report, which dealt with the development of (Indian) guidelines to govern AI, did not make concrete proposals to help define what the path ahead should look like. 

Copyright infringement in relation to the deployment of AI can occur at (at least) three points: those of input, the generation of output, and the manifestation of output. Public discussions have tended to focus on input (and the use of data to train AI) and on output (in terms of who might own the output, often in the form of potentially copyrightable works, which AI spits out and in the context of what could happen should the output infringe the intellectual property or other rights of persons or in pre-existing works) rather than on the process through which the output is generated.

This isn't entirely surprising since AI systems are often proprietary and the process through which the output is manifest is often opaque to the public eye. Consequently, not only is infringement difficult to discern but authorship: the task of pinpointing the ‘true’ author of AI output is rarely straightforward since the degree to which human involvement shapes the output can be unclear. 

As for infringement: given that acts such as legitimately reproducing and adapting copyrighted works lie within the purview of their owners, it is reasonable to be quite certain that if such works are used as training data for AI without appropriate authorisation, copyright infringement is being committed while creating or developing the output of the AI. However, without knowing exactly what is done to those works (in their avatar of training data) to manifest output, it is difficult to do more than speculate about the exact manner in which copyright infringement is being committed at that intermediate stage. To be certain, it would be necessary to have AI systems be transparent. And to enforce copyright, it would be necessary to know whom to hold accountable for lapses in compliance with the law. 

Transparency and accountability are generally spoken of positively (including in the sub-committee report on AI governance) but developing a workable legal model to deal with AI requires policy decisions to be made regarding how to implement goals such as these while balancing and possibly prioritising the various interests implicated by AI. Even if these interests are not always clearly competing interests, they are often non-converging or misaligned: proprietary ‘rights’ such as copyrights and trade secrets, human rights such as free speech and privacy, and public interest.

To further complicate matters, it has proved to be difficult to define AI; the committee report seems to shy away from doing so. And it is likely that a single governance regime will ultimately cover disparate forms of AI simply because it is unlikely that, even if AI lent itself to easy identification and categorisation (which it does not), there would be different regimes in place for different types of AI.

Although generative AI has been in the news incessantly since 2024, much of the AI which plays a role in daily life is agentic or predictive. 

Demonstrating how definitions can be elusive, the line between agentic AI and GenAI can sometimes be quite thin and blurred: a melody could potentially be harmonized by the former, for example, but music would more likely be created from scratch with the support of the latter although what amounts to AI-generated music and what counts as human-composed music is sometimes a matter of semantics.

Predictive AI, on the other hand, probably best illustrates just how unclear the picture can become when a range of interests are simultaneously implicated by the deployment of AI.

Consider corporate proprietary AI which, hypothetically, claims to presage the occurrence of illness in a community — 

If the workings of the AI are not publicly shared, they cannot be tested and the results of the AI cannot be independently replicated meaning that there would be no easy way to immediately verify if the AI does what it says on the box. 

In such a case, it would be difficult to hold anyone to account before the AI had been deployed for long enough to draw inferences from its functioning if it were a spectacular failure at forecasting illness. 

That said, forcing a company to share the workings of its AI could potentially violate its intellectual property rights and breach its trade secrets.

Nonetheless, enabling or allowing a company to have its AI be opaque when the stakes are high could detrimentally impact public interest.

And drawing inferences from its functioning would likely require analysing the medical information of patients and other members of the community on whom it had been deployed, potentially violating their individual privacy and, if the information had been compiled, the copyright in the compilation. Not to mention that, if such a compilation had been used to train the AI in the first place, the AI itself could also violate both copyright and privacy rights at stages from input to output manifestation. 

In a context such as this, considering the immediate real-world consequences on the lives of large numbers of individuals, it is hard to argue that copyright concerns should take precedence over other concerns. And even though the situation is often vastly less muddled when it comes to generative AI with its focus on creating works, it is still often difficult to make that argument in the face of such outcomes resulting from the deployment of AI such as the generation of malicious deepfakes in addition to benign output.

Nonetheless, of all the concerns there are, copyright is probably one of the most clear-cut.

Indian copyright law, almost entirely articulated in the 1957 Indian Copyright Act, is reasonably clear that copyrighted works cannot be used to train AI without appropriate licences having been obtained except in a few extremely limited circumstances. The statute does not recognise ‘fair use’ in a manner akin to the US concept of the term. Instead, through Section 52, it excuses copyright infringement in certain specified circumstances which it lists. 

The use of protected content to train AI is not explicitly excused in the list although, if the conduct involved in so using it fell within the scope of any of the use cases mentioned in Section 52, the provision could be relied upon to excuse it. 

First amongst these circumstances are those listed in Section 52(1)(a) of the Indian Copyright Act which facilitates unauthorised ‘fair dealing’ with most protected works. ‘Fairness’ is generally determined using the four-factor test in § 107, Title 17, USC, and although Section 52(1)(a) does not apply to computer programmes, it could be interpreted to apply to protected works other than computer programmes used as training data.

To quote from the sub-committee report on Section 52(1)(a): “Commercial research is not exempted; not-for-profit institutional research is not exempted. Not-for-profit research for personal or private use, not with the intention of gaining profit and which does not compete with the existing copyrighted work is exempted.”

Further, although ‘transformative use’ has not been explicitly included in Section 52 of the Indian Copyright Act, it has become a valid defence against allegations of copyright infringement via case law thus demonstrating that although the statutory list of exceptions to copyright infringement is limited, it is not inflexible.

That being the case, it should be possible to frame factors or tests to have the use of protected content as training data for AI be considered fair dealing in certain circumstances, with the matter of how those circumstances could be delineated ultimately being a matter of policy and judicial determination.

Ideally, there should be a mechanism to prevent online data from being scraped to train AI in much the same way that data can be kept from being indexed by search engines. In relation to content not explicitly excluded, provided that the AI is open, is programmed so that its output is not a replica or a colourable imitation of an existing work incorporated in its training data, and is structured so that its output does not create new works in the style of specific creative humans, it could perhaps be considered eligible to benefit from a possible exception to copyright infringement if the data it intended to use to train itself were legally publicly accessible (as opposed to having been sub-licensed to AI companies without creator-consent by content aggregators, possibly traditional publishers, who normally hide content behind paywalls). 

A proposal such as this would, if it were implemented, overwhelmingly exclude from the ambit of data which could be legitimately used to train AI without authorisation any content which had been made commercially available at the instance of the author. It would, however, include the bulk of data such as blog posts, SocMed posts, text, and images made available online in the form of user generated content provided that such data had not been specifically excluded. 

Licensing all content protected by copyright is simply not a viable plan not only because of the practical difficulties identifying owners, negotiating licences, and paying fees but also because not all owners can be identified or want to be identified. The desire for anonymity or enhanced privacy is often entirely legitimate, and those who seek it — sexual assault survivors who speak of their experience online to raise awareness, perhaps — may not want to be stripped of their anonymity just so to become eligible to be paid some paltry sum as licence fees upon having their content be used as training data for AI. 

Being forced to sacrifice privacy for pennies seems like a poor trade off, and it may not be one which everyone is willing to make despite the inescapable chant of author-owners being entitled to be paid for the use of their content to train AI.

Instead of focussing on paying all author-owners licence fees, it may be worth redirecting attention to clearly and carefully delineating circumstances in which licence fees should be paid and to determining who is entitled to issue licences in the first place. For example, it isn't at all clear that publishers who acquire works without payment to authors should have the right to licence those works en masse to AI companies for tremendous sums of money without so much as informing ‘their’ authors, let alone gaining their consent. Similarly, there should be clear legal requirements of SocMed companies to ensure that, at the very least, they obtain user consent to include their posts in training datasets.

Ultimately, copyright law functions in tandem with contract law and, instead of focussing almost exclusively on the former (particularly in relation to infringement), it would probably be more worthwhile to focus on the interface between the two to ensure as level a playing field as possible for those involved including individual content creators, corporate content aggregators, and AI companies.

India has engaged in such an exercise before: it enacted wide-ranging amendments to its copyright statute in 2012 many of which were drafted primarily with the intention of addressing asymmetrical power relations in the film and music industry by forcibly injecting a basic degree of fairness into contracts which artistes could enter into in relation to their works and performances with corporate and other players.

The time may have come to begin to engage in such a process again, this time in relation to AI not only to ensure that licensing arrangements to train AI are fair but also to force fairness into contractual arrangements which currently determine who would be considered to own the intellectual property rights subsisting in AI-generated works and works created with the support of AI. 

Currently, there is little to prevent AI companies from doing what they please in relation to their AI, and a governance focus on preventing data from being scraped to train AI, if it were not supplemented with other measures to protect content creators, could do more harm than good.


This post includes text drawn from two LinkedIn posts: on AI & copyright re training in general and a proposal regarding how the issue could perhaps be handled published in late 2024 as well as from comments made to Sejal Sharma for her piece ‘MeitY's AI Regulation Report: Ambitious But No Concrete Solutions’  published on 9 January 2025.

Follow nsaikia on LinkedIn | This post is by Nandita Saikia and was first published at IN Content Law.