News/Data

Dec 25, 2024

How to ensure data protection in the age of AI

Current state of AI security: Organizations are grappling with fundamental questions about how to secure AI systems and protect sensitive data while enabling productive use of the technology. Security leaders face dual challenges of protecting proprietary AI models from attacks while preventing unauthorized data exposure to public AI models Many organizations lack clear frameworks for managing AI-related security risks The absence of major AI security incidents so far has led to varying levels of urgency in addressing these challenges Key implementation challenges: Security teams must address several critical areas as AI adoption accelerates across business functions. Monitoring and controlling employee...

read
Dec 21, 2024

Drinking water, not fossil fuel: Why AI training data isn’t like oil

The ongoing debate about data scarcity in artificial intelligence (AI) requires a critical examination of common metaphors and their accuracy in describing the relationship between data and AI systems. Key misconception: The comparison of data to fossil fuels for AI systems, popularized by OpenAI co-founder Ilya Sutskever's claim that "Data is the fossil fuel of AI, and we used it all," misrepresents the fundamental nature of data as a resource. This metaphor incorrectly suggests that high-quality data for AI training is a finite, non-renewable resource The concept of data scarcity is highly context-dependent and varies significantly across different domains and...

read
Dec 21, 2024

Perplexity’s acquisition of Carbon will connect your internal data to search

The AI search startup Perplexity has acquired Carbon, a data integration framework provider, to enhance enterprise search capabilities and bridge the gap between public and private data sources. Strategic acquisition overview: Perplexity's purchase of Carbon marks a significant expansion of its AI search capabilities, particularly in the enterprise sector. The acquisition comes after Perplexity's successful year, which included raising hundreds of millions in funding and reaching a $9 billion valuation Carbon's technology provides a universal API and SDKs that connect external data sources to large language models (LLMs) The framework supports over 20 data connectors and file formats, including text,...

read
Dec 20, 2024

Meta just released Nymeria, a dataset that captures all the nuances of human motion

The year's long drive to advance wearable technology has created new opportunities for understanding and predicting human body movement, with potential applications ranging from fitness tracking to workplace ergonomics. Dataset Overview: Reality Labs Research has released Nymeria, a groundbreaking dataset containing 300 hours of multimodal egocentric human motion captured in natural settings. The dataset captures diverse individuals performing everyday activities across various locations using Project Aria glasses and miniAria wristbands Twenty predefined unscripted scenarios, including cooking and sports activities, were recorded to ensure comprehensive coverage of daily movements The collection includes detailed language annotations describing human motions at multiple levels...

read
Dec 20, 2024

Why industry insiders believe data scarcity may cause an AI slowdown

The artificial intelligence industry faces an unexpected challenge as major tech companies encounter limitations in the data available to train their AI systems, potentially slowing the rapid advancement of chatbots and other AI technologies. The data dilemma: Google DeepMind's CEO Demis Hassabis warns that the traditional approach of improving AI systems by feeding them more internet data is becoming less effective as companies exhaust available digital text resources. Tech companies have historically relied on increasing amounts of internet-sourced data to enhance large language models, which power modern chatbots Industry leaders are observing diminishing returns from this approach as they reach...

read
Dec 20, 2024

Ukraine is building a massive dataset of battleground footage to train AI war models

Artificial intelligence applications in modern warfare are taking another leap forward as Ukraine amasses an unprecedented collection of battlefield data from drone operations. Scale of data collection: Ukraine's OCHI system has accumulated over 2 million hours of battlefield footage since 2022, representing one of the largest real-world combat datasets ever assembled. The system processes input from more than 15,000 drone crews operating across various combat zones Daily data collection averages 5-6 terabytes of new footage A parallel system called Avengers analyzes drone and CCTV footage, reportedly identifying approximately 12,000 pieces of Russian military equipment weekly AI applications in current combat:...

read
Dec 20, 2024

How AI is transforming the game of baseball

The intersection of artificial intelligence and Major League Baseball is creating new competitive advantages, with the Texas Rangers leveraging data analytics to enhance their game strategy and operations. Data-driven success story: The Texas Rangers' 2023 World Series victory showcases how professional baseball teams are harnessing AI and big data to gain competitive advantages. Data engineer Oliver Dykstra has been instrumental in transforming collected information into actionable insights since joining the Rangers in October 2022 The team employs hundreds of predictive models that analyze various aspects of the game Accurate predictions helped forecast the Rangers' successful 2023 season performance AI-powered predictions...

read
Dec 18, 2024

What we can learn from Databricks’ $3B ARR milestone

The enterprise data and AI platform Databricks continues to demonstrate exceptional growth in the cloud and AI infrastructure space, reaching $3 billion in Annual Recurring Revenue (ARR) while maintaining a 60% growth rate. The current landscape: Databricks' remarkable performance comes amid a broader recovery in the SaaS and cloud sectors, with AI driving significant growth across various segments of the technology industry. The company serves over 10,000 customers, with 500 of them generating more than $1 million in annual revenue Growth has actually accelerated from 50% at $1.5 billion ARR to 60% at $3 billion ARR, defying typical scaling patterns...

read
Dec 18, 2024

New research shows where Big Tech is getting all its AI training data

The increasing dominance of large tech companies in AI training data collection and management raises significant concerns about diversity, transparency, and power concentration in artificial intelligence development. Key findings from comprehensive audit: The Data Provenance Initiative, comprising over 50 researchers, conducted an extensive analysis of nearly 4,000 public AI datasets across 600+ languages and 67 countries, spanning three decades. The research revealed a dramatic shift in data collection methods since 2017, moving from carefully curated sources to widespread internet scraping More than 70% of video and speech training data comes from a single source, likely YouTube, highlighting Google/Alphabet's outsized influence...

read
Dec 17, 2024

Nvidia and DataStax just made data storage costs much cheaper

The intersection of enterprise data management and artificial intelligence has reached a new milestone with Nvidia and DataStax's joint technological breakthrough in generative AI storage and retrieval systems. Key innovation: Nvidia NeMo Retriever microservices, integrated with DataStax's AI platform, delivers a revolutionary approach to enterprise data management and AI implementation. The new technology reduces data storage requirements by 35 times compared to traditional methods Enterprise data is expected to exceed 20 zettabytes by 2027, making storage efficiency crucial Current enterprise unstructured data stands at 11 zettabytes, equivalent to 800,000 copies of the Library of Congress Approximately 83% of enterprise data...

read
Dec 15, 2024

Human-sourced data prevents AI model collapse, study finds

The rapid proliferation of AI-generated content is creating a critical challenge for artificial intelligence systems, potentially leading to deteriorating model performance and raising concerns about the long-term viability of AI technology. The emerging crisis: AI models are showing signs of degradation due to overreliance on synthetic data, threatening the quality and reliability of AI systems. The increasing use of AI-generated content for training new models is creating a dangerous feedback loop Model performance is declining as systems are trained on synthetic rather than human-generated data This degradation poses risks ranging from medical misdiagnosis to financial losses Understanding model collapse: Model...

read
Dec 15, 2024

Harvard releases copyright-free AI training data to democratize AI development

The growth of open-source AI training datasets marks a significant shift in how artificial intelligence models access and learn from literary works, with Harvard University taking a leading role through a major public domain book release. Project overview: Harvard's Institutional Data Initiative (IDI) has launched an unprecedented effort to democratize AI development by releasing nearly one million public domain books for AI training purposes. The dataset represents a five-fold increase compared to the Books3 dataset, previously one of the largest open collections used for AI training Microsoft and OpenAI have provided funding support for this initiative, highlighting major tech companies'...

read
Dec 11, 2024

How to use ChatGPT’s data tool to unlock business insights without coding

Artificial Intelligence is dramatically simplifying complex data analysis tasks, with ChatGPT's Advanced Data Analysis feature enabling both technical and non-technical users to extract meaningful insights from large datasets through simple conversational prompts. Core capabilities and practical applications: ChatGPT's Advanced Data Analysis feature demonstrates significant versatility in handling various data formats and performing complex analytical tasks without requiring programming expertise. The system successfully processed a diverse range of datasets, including 69,215 records of New York City baby names, analyzing trends across different ethnicities and calculating naming ratios When examining 22,797 records of software uninstall data, ChatGPT performed sentiment analysis on user...

read
Dec 11, 2024

Scale AI faces lawsuit from data labelers wanting FTE status

A new new labor dispute has arisen with data labeling workers challenging their classification as contractors rather than full-time employees. Legal challenge emerges: Scale AI, a $14 billion AI data labeling company, faces a class action lawsuit from workers who claim they were misclassified as contractors and underpaid for their work training artificial intelligence models. The lawsuit, filed by Clarkson Law Firm in California, could potentially involve between 10,000 and 20,000 workers Lead plaintiff Steve McKinney alleges he was promised $25 per hour but sometimes received as little as $17 per hour Workers claim they were required to perform unpaid...

read
Dec 7, 2024

AI advances are outpacing legal frameworks on data protection

The legal landscape surrounding artificial intelligence and data protection continues to evolve as technological advances outpace existing regulatory frameworks, particularly in areas like generative AI and trade secret protection. Key legal challenges in Gen AI: The development of generative artificial intelligence has created several unresolved legal questions that policymakers must address. A critical balance must be struck between allowing data access for AI training and protecting creators' rights Questions remain about intellectual property rights for AI-generated content, including who owns the rights to content created using Gen AI tools Major tech companies like Google, OpenAI, and Microsoft have taken proactive...

read
Dec 6, 2024

How data embassies could enhance cross-border AI safety

The growing complexity of cross-border AI deployment has organizations searching for innovative solutions to navigate varying international regulations while protecting sensitive data. The data embassy concept: Data embassies represent a novel approach that allows organizations to maintain control over their data while operating in foreign jurisdictions, similar to how traditional diplomatic missions function. This framework would protect data from being accessed by host country authorities where data centers are physically located Early adopters like Estonia and Bahrain have implemented data embassy models, while India and Malaysia are exploring similar approaches The concept aims to resolve the tension between organizational data...

read
Dec 5, 2024

Meta’s next AI might allow you to type without using your hands

Surface electromyography (sEMG) technology is advancing as a means of translating muscle activity at the wrist into digital commands, with potential applications ranging from augmented reality control to keyboardless typing. Major breakthrough: Meta is releasing two groundbreaking datasets and benchmarks for sEMG-based typing and pose estimation as part of NeurIPS 2024, representing the largest open-source sEMG datasets ever compiled. The datasets include 716 hours of sEMG recordings from 301 consenting participants Each dataset contains 10 times more data than previous single-task, single-device collections State-of-the-art models for typing and pose estimation are being released alongside the datasets Technical innovation: Surface electromyography...

read
Dec 5, 2024

How AI is solving the finance industry’s data problem

The financial services industry is experiencing significant challenges with data management and infrastructure, prompting the emergence of AI-powered solutions to address longstanding inefficiencies and meet evolving regulatory demands. Current state of financial data: The financial services sector faces unprecedented complexity in managing vast amounts of data across multiple formats, systems, and regulatory requirements. Financial institutions must handle diverse data types including market data, transaction records, and client information, all while adhering to strict governance protocols The NASDAQ processes over 35 million trades daily, while Visa handles approximately 700 million transactions per day Legacy infrastructure and organizational silos create significant barriers...

read
Dec 5, 2024

This open-source dataset may lead to more fuel-efficient, AI-designed cars

Global efforts to create more sustainable and efficient vehicles have received a significant boost from a groundbreaking database of car designs and their aerodynamic properties developed by MIT engineers. Project overview: DrivAerNet++, a comprehensive open-source dataset, contains over 8,000 3D car designs with detailed aerodynamic simulations, representing a significant advancement in automotive design resources. The database encompasses multiple car types including fastback, notchback, and estateback designs Each design includes various representations such as mesh models, point clouds, and parametric specifications The project required more than 3 million CPU hours of processing time and generated 39 terabytes of data Technical foundation:...

read
Dec 4, 2024

Supabase AI turns ideas into Postgres databases in minutes

The rapid advancement of AI assistance tools has reached the database management realm with Supabase's latest offering, which aims to streamline database operations and development workflows. Product Overview: Supabase has launched a comprehensive AI Assistant that integrates directly with PostgreSQL databases and offers advanced database management capabilities. The AI Assistant functions as a global helper tool providing support across multiple database operations Key features include PostgreSQL schema design, data querying, visualization, and automated error debugging The tool can generate PostgreSQL RLS (Row Level Security) policies, functions, and triggers A notable capability is the automatic conversion of SQL queries to supabase-js...

read
Dec 4, 2024

Strategies executives use to build AI-ready data foundations

The foundation of successful AI implementation lies in creating robust data management strategies, as demonstrated by leading executives who emphasize the critical importance of proper data governance before deploying artificial intelligence solutions. Strategic foundations and personnel priorities: L&G's group chief data and analytics officer Claire Thompson emphasizes that establishing strong data foundations is crucial for future innovation and business value. A clear connection between data strategy and tangible business outcomes helps justify the investment in proper data management Close collaboration between data teams and IT departments is essential for effective data governance Data quality by design principles help prevent downstream...

read
Dec 3, 2024

This startup aims to transform the entire internet into a searchable database

A new startup called Exa is developing an innovative search engine that aims to transform how we find and organize information online by treating the entire internet as a structured database rather than just a collection of web pages. The technological breakthrough: Exa's new search engine, Websets, leverages large language models to encode web pages into "embeddings" that capture their underlying meaning rather than just matching keywords. The technology creates a semantic understanding of web content, allowing for more precise and relevant search results Unlike other AI search engines that simply layer language models over traditional search, Exa has rebuilt...

read
Dec 2, 2024

Ex Googler’s new site shows how much Google’s AI can glean from your photos

The growing concern over AI companies' access to and analysis of personal photos has led to the development of privacy-focused alternatives to mainstream cloud storage services. Privacy awakening: Former Google engineer Vishnu Mohandas left his position at the tech giant in 2020 after becoming increasingly concerned about how personal photos could be used to train AI systems. Mohandas developed Ente, an open-source photo storage service that features end-to-end encryption The platform aims to give users more control over their personal data while providing similar functionality to mainstream photo storage services Despite some limitations in features and ease of use compared...

read
Dec 1, 2024

How AI training data opt-outs may widen the global tech power gap

The complex relationship between AI training data access and global inequality is coming into sharp focus as major AI companies implement opt-out mechanisms that allow content creators to restrict use of their data, potentially amplifying existing power imbalances between developed and developing nations. Current landscape: A landmark copyright case between ANI Media and OpenAI in India's Delhi High Court has highlighted how opt-out mechanisms for AI training data could systematically disadvantage developing nations. OpenAI's quick move to blocklist ANI's domains from future training sets reveals broader implications about who gets to shape crucial AI infrastructure Domain-based blocking proves largely ineffective...

read
Load More