News/Data
NVIDIA’s AI Training Practices Continue to Spark Copyright Controversy
NVIDIA faces allegations of improperly using copyrighted video content to train its artificial intelligence models, raising questions about the ethics and legality of AI training practices in the tech industry. The core accusation: NVIDIA allegedly downloaded massive amounts of video content from platforms like YouTube and Netflix without permission to train commercial AI projects. The company is said to have downloaded the equivalent of 80 years worth of videos daily for AI model training purposes. This content was reportedly used to develop products such as NVIDIA's Omniverse 3D world generator and "digital human" initiatives. The scale of the alleged downloads...
read Aug 9, 2024How Mathematical Data May Help Solve the AI Training Data Shortage
Artificial intelligence's insatiable appetite for data has raised concerns about potential limitations on its future growth, but a compelling argument suggests these worries may be unfounded due to the infinite nature of mathematics. The big picture: The notion of running out of data for AI training overlooks the vast potential of mathematical data as an inexhaustible resource for fueling AI advancement. Experts have expressed concern that the finite amount of text and images available for AI training could hinder future progress. This perspective fails to consider the unlimited potential of mathematical data to supplement and expand training resources. Mathematical data...
read Aug 8, 2024ProRata is Pioneering a Pay-Per-Use Data Sales Model to Solve AI’s Copyright Woes
The emergence of AI pay-per-use models aims to address copyright concerns in generative AI by ensuring fair compensation for content creators and publishers whose work is used to train AI systems. ProRata's innovative approach: Bill Gross, CEO of startup ProRata, is spearheading an "AI pay-per-use" model to tackle the issue of AI companies using copyrighted data without permission. ProRata's primary goal is to establish revenue-sharing agreements that allow publishers and individuals to receive compensation when AI companies utilize their work. The company has already secured partnerships with major players in the media and publishing industry, including Universal Music Group, Financial...
read Aug 8, 2024How to Adapt LLMs for Domain Data
The rapid advancement of large language models (LLMs) has opened up new possibilities for AI applications, but adapting these models to specific domains remains a challenge for many organizations. This article explores various methods for customizing LLMs, providing guidance for small AI product teams looking to integrate these powerful tools into their workflows. Overview of LLM adaptation approaches: The article outlines five main strategies for adapting LLMs to domain-specific data and use cases, each with its own strengths and limitations. Pre-training and continued pre-training are discussed as comprehensive but resource-intensive methods, typically beyond the reach of smaller teams. Fine-tuning, particularly...
read Aug 7, 2024Rules of Thumb for Curating a Good Training Dataset
Fine-tuning large language models (LLMs) has become a critical process in tailoring AI capabilities to specific tasks and domains. This article delves into the nuances of dataset curation for effective fine-tuning, offering valuable insights for AI practitioners and researchers. The big picture: Fine-tuning LLMs requires a delicate balance between quality and quantity in dataset preparation, with a focus on creating diverse, high-quality datasets that can effectively enhance model performance without compromising existing capabilities. The article is part of a series exploring the adaptation of open-source LLMs, with this installment specifically addressing the rules of thumb for curating optimal training datasets....
read Aug 7, 2024Inside the Company that Gathers High Quality Data for Major AI Companies
Turing, a staffing firm led by CEO Jonathan Siddharth, has become a pivotal player in the AI industry by pivoting from software engineer recruitment to providing specialized "human data" for major AI companies, including OpenAI, to enhance their language models' reasoning abilities and task performance. The AI data revolution: Turing's transformation highlights a growing trend in the AI industry where high-quality, specialized data is becoming increasingly crucial for advancing AI capabilities beyond what can be learned from publicly available internet data. In early 2022, OpenAI approached Turing to provide high-quality computer code data to improve GPT-4's reasoning abilities, marking the...
read Aug 6, 2024How to Use AI to Parse Files and Extract Data with Airparser
Airparser leverages AI to streamline data extraction from various file types, offering a no-code solution for businesses and individuals seeking to automate their data processing workflows. AI-powered data extraction revolutionizes file parsing: Airparser emerges as a cutting-edge tool designed to extract valuable information from emails, PDFs, and other file formats without requiring users to possess coding skills. The platform utilizes artificial intelligence to accurately parse and extract data from unstructured documents, significantly reducing the time and effort typically required for manual data entry. By automating the extraction process, Airparser enables users to quickly analyze and repurpose existing data from a...
read Aug 6, 2024How AI is Transforming Financial Services
Generative AI is rapidly transforming the financial services industry, enabling banks to become more data-driven and insightful in their operations and customer interactions. This technological advancement is reshaping various aspects of banking, from transaction analysis to risk management and customer service. The power of transaction data: Transaction data stands at the heart of this transformation, offering banks unprecedented insights into customer behavior and financial patterns. Banks are leveraging GenAI to analyze and interpret vast amounts of transaction data, uncovering valuable insights that were previously difficult or impossible to obtain. This data-driven approach allows financial institutions to better understand their customers'...
read Aug 4, 2024AI-Powered D-BOT Slashes Database Diagnosis Time to Minutes
D-BOT, a new database diagnosis system leveraging large language models (LLMs), promises to revolutionize how database administrators (DBAs) manage and troubleshoot database systems. This innovative approach addresses the challenges of managing numerous databases and providing rapid responses to issues, offering a more efficient alternative to traditional methods. The big picture: D-BOT aims to automate and accelerate database diagnosis, potentially reducing response times from hours to minutes while handling a wide range of scenarios. The system utilizes LLMs to acquire knowledge from diagnosis documents, enabling it to generate well-founded diagnosis reports that identify root causes and solutions. D-BOT's approach is designed...
read Aug 2, 2024If You Work With Data, Try These Generative AI Data Analytics Tools
The rapid growth of data and the potential insights hidden within it have led to the development of generative AI tools that democratize data analytics, making it accessible to everyone, not just data scientists. The challenge of too much data: The exponential growth of data presents a challenge in extracting valuable insights that can drive advancements in various fields, from healthcare to business: The problem lies not in the lack of information but in the overwhelming amount of data available, making it difficult to know where to start. Traditionally, extracting insights from data required highly skilled individuals and significant time...
read Aug 1, 2024Gary Marcus: How Outliers Expose the AI Industry’s Fragile Future
The rapid rise and potential fall of the current AI industry can be largely explained by one crucial fact: AI struggles with outliers, leading to absurd outputs when faced with unusual situations. The outlier problem: Current machine learning approaches, which underpin most of today's AI, perform poorly when encountering circumstances that deviate from their training examples: A Carnegie Mellon computer scientist, Phil Koopman, illustrates this issue using the example of a driverless car accident involving an overturned double trailer, which the AI system failed to recognize due to its unfamiliarity with the situation. This limitation, also known as the problem...
read Aug 1, 2024Anthropic’s New ‘APS’ Simplifies Complex Data Processing for AI Apps
Aryn Partitioning Service launch announcement: Anthropic has announced the Aryn Partitioning Service (APS), a serverless, GPU-powered API for segmenting and labeling PDF documents, performing OCR, extracting tables and images, and more, aiming to simplify the processing of complex, unstructured data for various applications. Key features and benefits: APS runs the Aryn Partitioner and its state-of-the-art, open-source deep learning DETR AI model, which has been trained on over 80,000 enterprise documents. The service can lead to 6x more accurate data chunking and 2x improved recall on hybrid search or RAG compared to off-the-shelf systems. Users can access APS through an API...
read Aug 1, 2024Google Integrates Gemini into Looker and BigQuery
Google Cloud has unveiled significant updates to its database and data analytics offerings, aiming to facilitate generative AI deployments and adoption by integrating more flexibility into data usage and access. Key Takeaways: Google Cloud announces a series of updates to its Spanner, Bigtable, BigQuery, and Looker services at the Google Cloud Next event in Tokyo. The updates focus on expanding the capabilities of these offerings to better support generative AI applications and improve data analysis. Gerrit Kazmaier, GM & VP of Data Analytics at Google Cloud, emphasizes the importance of having "incredible data" to achieve "incredible AI." Enhancing Data Analytics...
read Aug 1, 2024Reddit CEO Defends Decision to Block Web Scrapers
The CEO of Reddit, Steve Huffman, is defending the company's recent decision to block web scrapers from accessing the site's content without an AI agreement, which has sparked controversy and raised concerns about competition in the search engine market. Key developments: Reddit's move to restrict web scraping has significantly impacted search results and the competitive landscape: Reddit updated its robots.txt file to block bots from scraping the site, stating that it believes in an open internet but opposes the misuse of public content. As a result, search engines other than Google were temporarily unable to list recent Reddit posts in...
read Jul 31, 2024How Good Data Practices Drive Productivity for Dow
Dow is leveraging generative AI to dramatically improve productivity across its global operations by building a data literate culture and strong technical foundations. Driving value with gen AI: Dow is implementing gen AI tools like Microsoft 365 Copilot and seeing significant early productivity gains in key areas: Over half of pilot users reported saving 1-2 hours per day with Copilot Gen AI is being used to enhance content development, data analysis, patent research, and document search AI has revolutionized demand forecasting and supply network planning in Dow's supply chain Building on AI foundations: Dow's ability to rapidly adopt and scale...
read Jul 31, 2024Anthropic’s Aggressive Data Scraping Is Causing Problems for Sites It Targets
Anthropic's aggressive data scraping practices raise ethical concerns as the company disregards website permissions in its quest to train its Claude language model. Anthropic's data scraping tactics: The artificial intelligence company Anthropic has been engaging in aggressive data scraping practices to gather training data for its Claude language model, often disregarding website permissions and robots.txt files: Ifixit.com reported that Anthropic's ClaudeBot hit their site a million times in a single day, highlighting the intensity of the company's data scraping efforts. Freelancer experienced 3.5 million hits from ClaudeBot in just four hours, with the bot's activities triggering alarms and waking up...
read Jul 30, 2024Data Strategies for Marketers in a Privacy-Focused World
Marketers face significant challenges due to signal loss caused by privacy regulations and technological changes, but innovative strategies and tools can help them adapt and thrive in this new landscape. Understanding signal loss and its implications for advertisers: Signal loss refers to the reduction or elimination of data signals that advertisers rely on to track, measure, and optimize their campaigns, which can happen due to privacy regulations, browser updates, mobile operating system changes, and increased use of ad blockers: Signal loss reduces targeting precision, complicates attribution, and leads to inefficient budget allocation. Advertisers must increasingly rely on first-party data, invest...
read Jul 27, 2024AI Trains on Kids’ Photos Without Consent, Raising Alarms for Families and Policymakers
Human Rights Watch recently revealed that photos of children scraped from the internet, including some hidden behind privacy settings on social media, were used to train AI models without consent from the children or their families. This concerning revelation has broad implications for data privacy and the unintended consequences of "sharenting" in the age of AI. Key Takeaways: The unauthorized use of children's personal photos to train AI models raises serious privacy concerns: Many of the scraped images included children's names and identifying information, making them easily traceable. Some of the photos used were not even publicly available but hidden...
read Jul 27, 2024X’s Grok Has Been Secretly Training on Your Data
X is using user data to train its Grok AI chatbot, sparking privacy concerns as the feature is enabled by default, requiring users to actively opt out. Key details about X's data usage for Grok AI: X's social media platform is utilizing user posts, interactions, inputs, and results with the Grok chatbot to train and fine-tune the AI, which has caused outrage among some users upon discovering the opt-out nature of the feature: X's privacy policy has allowed for this data usage since at least September 2023, but it remains unclear exactly when the data collection for Grok began. While...
read Jul 27, 2024Salesforce’s New Trillion-Token AI Dataset Could Revolutionize Machine Learning
Salesforce's MINT-1T dataset, containing one trillion text tokens and 3.4 billion images, has the potential to significantly impact the AI industry by enabling breakthroughs in multimodal learning and leveling the playing field for researchers. Massive AI dataset: Bridging the gap in machine learning; The scale and diversity of MINT-1T, drawing from a wide range of sources like web pages and scientific papers, provides AI models with a broad view of human knowledge, which is crucial for developing AI systems that can work across different fields and tasks: The release of MINT-1T breaks down barriers in AI research, allowing small labs...
read Jul 27, 2024EU Watchdog Probes X’s Use of User Data for AI Without Consent
The EU's privacy watchdog is questioning X (formerly Twitter) over its use of user posts to train the xAI chatbot Grok without obtaining consent, potentially violating GDPR rules. Key details: The EU watchdog expressed surprise and is seeking clarity on X's data practices, which may not comply with GDPR requirements for obtaining user consent before using personal data: X users were likely opted-in by default to have their posts used as training data for Grok, an AI chatbot developed by Elon Musk's AI company xAI. Under GDPR, companies must obtain explicit consent from users before using their personal data for...
read Jul 27, 2024X Introduces User Opt-Out for Grok AI Training Data
X allows users to opt out of AI training data for Grok chatbot. The social media platform X now provides a setting for users to prevent their posts and interactions from being used to train and fine-tune the company's Grok AI assistant. Key details of the opt-out feature: The setting is accessible on the web and will soon be available on mobile. Users can uncheck a box to opt out of allowing their posts, interactions, inputs, and results with Grok to be used for training and fine-tuning purposes. Private accounts are automatically excluded from having their posts used to train...
read Jul 25, 2024NVIDIA Dominates KDD Cup 2024 Data Science Competition
Team NVIDIA's Clean Sweep: Team NVIDIA, consisting of six NVIDIANs, secured first place across all five competition tracks at the prestigious Amazon KDD Cup 2024, demonstrating their mastery in generative AI and data science: The team's innovative approach involved generating 500,000 questions using a combination of manual creation, large language models, and transforming existing e-commerce datasets to overcome the limited training data provided by the organizers. By fine-tuning the Qwen2-72B model using eight NVIDIA A100 Tensor Core GPUs and employing the QLoRA technique, Team NVIDIA outperformed all competitors despite the constraints imposed by the competition's format. KDD Cup 2024: Mimicking...
read Jul 25, 2024How Sightfull Is Using AI to Revolutionize Business Data Analytics and Insights
Sightfull, a business data analytics platform, leveraged the power of generative AI (GenAI) to enhance their users' experience by providing meaningful analysis and insights into their data. Exploring possibilities: Sightfull narrowed down potential use cases for GenAI within their product, focusing on areas where clients would benefit from personalized assistance: They identified three main ideas: discovery (finding relevant information quickly), productivity (interacting with the platform more efficiently), and explainability (understanding data and insights). Ultimately, they chose to focus on explainability, creating a "Data storytelling" feature that summarizes metrics and highlights points of interest. Prompt engineering techniques: The team experimented with...
read