Microsoft’s AI boss Mustafa Suleyman claims publishing content on the open web makes it “fair use” for anyone to freely copy and use, sparking controversy amid ongoing lawsuits against Microsoft and OpenAI over alleged copyright infringement in training AI models.
Key misconceptions about copyright law: Suleyman’s statements reveal a flawed understanding of how copyright and fair use operate on the internet:
- He incorrectly asserts that publishing content online automatically makes it “freeware” that anyone can copy and use, despite copyright protection applying automatically to original works upon creation.
- Suleyman mistakenly claims a “social contract” grants fair use for web content, when in reality, fair use is a legal defense determined case-by-case in court based on specific factors like the purpose and amount of copying.
AI companies’ controversial stance on copyrighted data: Microsoft’s position reflects a broader trend of AI companies arguing training models on copyrighted material is fair use, even as they face growing legal challenges:
- Several lawsuits allege Microsoft and OpenAI are infringing copyrights by scraping online content to train AI without permission or compensation to creators.
- While many AI firms claim fair use protects this practice, the unprecedented nature of generative AI means the legal precedents are unclear and will likely be determined through ongoing court battles.
Disregarding established web conventions: Beyond the legal questions, Suleyman’s comments highlight how some AI companies are ignoring or misrepresenting long-standing norms around web scraping:
- He suggests the robots.txt standard, which allows sites to specify rules for web crawlers, might provide a “grey area” for copying content, despite it being an informal convention, not a legally binding document.
- Reports indicate OpenAI and others have scraped sites while disregarding their robots.txt files entirely, breaching this “social contract” the tech industry has generally respected since the early web.
Broader implications for online content and AI: As generative AI rapidly advances, Suleyman’s statements exemplify the urgent need to clarify the legal and ethical boundaries around using copyrighted data to train these systems:
- With AI firms incentivized to hoover up as much training data as possible, a permissive approach to copyright could lead to widescale appropriation of creative works to fuel AI development.
- Allowing AI models to freely copy online content may undermine creators’ livelihoods and erode incentives to publish original material on the open web in the first place.
- Establishing clearer rules and norms will be crucial to strike a balance between enabling AI innovation and respecting intellectual property rights in this new technological landscape.
Microsoft’s AI boss thinks it’s perfectly OK to steal content if it’s on the open web