×
Library of Congress Is a Go-To Data Source for Companies Training AI Models
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

The Library of Congress: A new frontier for AI development: The world’s largest library has become an attractive resource for AI companies seeking to train their advanced language models using its vast digital archives.

  • The Library of Congress (LOC) houses 180 million items, including books, manuscripts, maps, and audio recordings, with 185 petabytes of digital data.
  • AI companies are increasingly interested in accessing this data to develop and train their most sophisticated AI models.
  • The library’s digital collections offer rare, original, and authoritative information in over 400 languages, spanning various disciplines.

Surge in data access requests: The Library of Congress has witnessed a significant increase in API traffic as AI companies seek to leverage its extensive digital resources.

  • The congress.gov site, managed by the LOC, receives between 20 million to 40 million monthly hits on its API.
  • Since its launch in September 2022, the congress.gov API has seen consistent growth in traffic.
  • The library’s API now receives approximately one million visits every month.

Unique appeal for AI developers: The Library of Congress’s data is particularly attractive to AI companies due to its public domain status and diverse content.

  • Unlike many other data sources, the LOC’s digital archives are not subject to copyright restrictions.
  • This makes the library one of the few remaining “free” resources for AI companies that have already mined much of the internet.
  • The alternative for AI companies is to strike licensing deals with publishers or use potentially problematic AI-generated “synthetic data.”

Challenges and limitations: While the Library of Congress welcomes data access through its API, it faces challenges in managing how companies collect and use its information.

  • The library prohibits direct scraping of content from its website, which can slow down public access to the archives.
  • Judith Conklin, chief information officer at the LOC, noted that some companies attempt to scrape websites directly for faster data collection, creating performance issues.
  • The library must manually intervene to slow down such scraping attempts to maintain website performance.

AI companies as potential customers: Beyond data collection, major tech companies are also approaching the Library of Congress as a potential client for their AI services.

  • Companies like OpenAI, Amazon, and Microsoft are pitching AI models to assist librarians and specialists with tasks such as catalog navigation, record searching, and document summarization.
  • However, there are challenges to overcome, particularly in dealing with historical accuracy and context.

Challenges with historical data: AI models trained on contemporary data often struggle to accurately interpret historical information in the library’s collections.

  • Natalie Smith, the LOC’s director of digital strategy, highlighted that AI models sometimes apply modern concepts to historical documents.
  • For example, an AI might misidentify a person holding a book in a historical image as someone holding a cell phone.
  • This bias towards contemporary interpretations poses challenges for accurately processing and understanding historical materials.

Risks of AI-generated inaccuracies: The library is cautious about implementing AI tools internally due to the potential for hallucinations and propagation of incorrect information.

  • The Congressional Research Service, part of the LOC, tested AI models for writing bill summaries but encountered issues with factual accuracy.
  • In one instance, an AI model incorrectly listed the District of Columbia as a U.S. state and made erroneous claims about the impact of legislation on students from Taiwan and Hong Kong.
  • These inaccuracies highlight the need for careful consideration and human oversight when using AI tools in the library’s operations.

Future plans and implications: Despite challenges, the Library of Congress aims to expand its digital offerings and make more unrestricted data available to the public and AI researchers.

  • The library plans to digitize more of its special collections in the coming years, benefiting both the public and AI companies.
  • Natalie Smith emphasized the historical role of libraries and federal agencies in providing data that has driven economic innovation.
  • The library’s data has the potential to contribute to technological advancements, similar to how geospatial data from federal agencies enabled the development of services like Uber.

Balancing innovation and responsibility: As the Library of Congress navigates its role in the AI era, it faces the challenge of fostering innovation while ensuring the integrity and appropriate use of its vast resources.

  • The library must balance making its collections more accessible with protecting the performance of its digital platforms.
  • There’s a need to address the limitations of AI in interpreting historical data accurately.
  • As AI companies increasingly rely on public domain resources like the LOC, questions arise about the long-term implications for AI development and the role of public institutions in shaping the future of AI technology.
The Library Of Congress Is A Training Data Playground For AI Companies

Recent News

Microsoft is easing its AI data center grip on OpenAI as part of new ‘Stargate’ plan

The tech giant hands over partial control of its massive data center infrastructure as OpenAI plans to build its own AI supercomputer.

ByteDance will reportedly invest $12B in AI chips this year alone

The TikTok maker plans to split its AI investments between domestic Chinese suppliers and foreign chip manufacturers amid growing tech tensions with the U.S.

AI models are increasingly displaying signs of self-awareness

AI chatbots demonstrate an ability to recognize and describe their own behavior patterns when asked targeted questions about their tendencies.