back
Get SIGNAL/NOISE in your inbox daily

The Library of Congress: A new frontier for AI development: The world’s largest library has become an attractive resource for AI companies seeking to train their advanced language models using its vast digital archives.

  • The Library of Congress (LOC) houses 180 million items, including books, manuscripts, maps, and audio recordings, with 185 petabytes of digital data.
  • AI companies are increasingly interested in accessing this data to develop and train their most sophisticated AI models.
  • The library’s digital collections offer rare, original, and authoritative information in over 400 languages, spanning various disciplines.

Surge in data access requests: The Library of Congress has witnessed a significant increase in API traffic as AI companies seek to leverage its extensive digital resources.

  • The congress.gov site, managed by the LOC, receives between 20 million to 40 million monthly hits on its API.
  • Since its launch in September 2022, the congress.gov API has seen consistent growth in traffic.
  • The library’s API now receives approximately one million visits every month.

Unique appeal for AI developers: The Library of Congress’s data is particularly attractive to AI companies due to its public domain status and diverse content.

  • Unlike many other data sources, the LOC’s digital archives are not subject to copyright restrictions.
  • This makes the library one of the few remaining “free” resources for AI companies that have already mined much of the internet.
  • The alternative for AI companies is to strike licensing deals with publishers or use potentially problematic AI-generated “synthetic data.”

Challenges and limitations: While the Library of Congress welcomes data access through its API, it faces challenges in managing how companies collect and use its information.

  • The library prohibits direct scraping of content from its website, which can slow down public access to the archives.
  • Judith Conklin, chief information officer at the LOC, noted that some companies attempt to scrape websites directly for faster data collection, creating performance issues.
  • The library must manually intervene to slow down such scraping attempts to maintain website performance.

AI companies as potential customers: Beyond data collection, major tech companies are also approaching the Library of Congress as a potential client for their AI services.

  • Companies like OpenAI, Amazon, and Microsoft are pitching AI models to assist librarians and specialists with tasks such as catalog navigation, record searching, and document summarization.
  • However, there are challenges to overcome, particularly in dealing with historical accuracy and context.

Challenges with historical data: AI models trained on contemporary data often struggle to accurately interpret historical information in the library’s collections.

  • Natalie Smith, the LOC’s director of digital strategy, highlighted that AI models sometimes apply modern concepts to historical documents.
  • For example, an AI might misidentify a person holding a book in a historical image as someone holding a cell phone.
  • This bias towards contemporary interpretations poses challenges for accurately processing and understanding historical materials.

Risks of AI-generated inaccuracies: The library is cautious about implementing AI tools internally due to the potential for hallucinations and propagation of incorrect information.

  • The Congressional Research Service, part of the LOC, tested AI models for writing bill summaries but encountered issues with factual accuracy.
  • In one instance, an AI model incorrectly listed the District of Columbia as a U.S. state and made erroneous claims about the impact of legislation on students from Taiwan and Hong Kong.
  • These inaccuracies highlight the need for careful consideration and human oversight when using AI tools in the library’s operations.

Future plans and implications: Despite challenges, the Library of Congress aims to expand its digital offerings and make more unrestricted data available to the public and AI researchers.

  • The library plans to digitize more of its special collections in the coming years, benefiting both the public and AI companies.
  • Natalie Smith emphasized the historical role of libraries and federal agencies in providing data that has driven economic innovation.
  • The library’s data has the potential to contribute to technological advancements, similar to how geospatial data from federal agencies enabled the development of services like Uber.

Balancing innovation and responsibility: As the Library of Congress navigates its role in the AI era, it faces the challenge of fostering innovation while ensuring the integrity and appropriate use of its vast resources.

  • The library must balance making its collections more accessible with protecting the performance of its digital platforms.
  • There’s a need to address the limitations of AI in interpreting historical data accurately.
  • As AI companies increasingly rely on public domain resources like the LOC, questions arise about the long-term implications for AI development and the role of public institutions in shaping the future of AI technology.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...