Data - CO/AI

News/Data

Jul 25, 2024

Anthropic’s Web Crawler Is Apparently Ignoring Companies’ Anti-Scraping Policies

Anthropic's ClaudeBot crawler hits iFixit's website almost a million times in 24 hours, ignoring the company's anti-scraping policies. This raises questions about AI companies' data scraping practices and the limited options available for websites to protect their content. Key details of the incident: iFixit CEO Kyle Wiens revealed that Anthropic's ClaudeBot web crawler accessed the website's servers nearly a million times within a 24-hour period, seemingly violating iFixit's Terms of Use: iFixit's Terms of Use explicitly prohibit reproducing, copying, or distributing any content from the website without express written permission, including using the content for training machine learning or AI...

read Jul 25, 2024

Unstructured Data Poses Important Considerations for Privacy, Governance and Ownership in the AI Era

The rapid advancements in AI's ability to analyze unstructured data are raising important questions about data privacy and ownership. Key Takeaways: As AI systems become increasingly capable of extracting insights from vast amounts of unstructured data, it's crucial to consider the privacy implications: While unstructured data may seem less sensitive than structured databases containing personal identifiers, AI can still pull together inferences, timelines, and narratives that could be highly intrusive. The era of AI moving from structured data sets to a more general technology approaching "universal knowledge" is both thrilling and potentially terrifying from a privacy perspective. Advancements in AI...

read Jul 24, 2024

Reddit Blocks Search Engines, AI Bots as Google Retains Exclusive Access Through Deal

Reddit is ramping up its efforts to protect its data and generate revenue by blocking search engines and AI bots from accessing recent posts and comments unless they pay for access. Google's exclusive access: Google is currently the only mainstream search engine showing recent Reddit results due to a $60 million deal allowing the company to train its AI models using Reddit's content: Other search engines like Bing and DuckDuckGo are excluded, leaving Google as the sole provider of up-to-date Reddit search results. This move comes after Reddit threatened to cut off Google's access if the company continued using the...

read Jul 24, 2024

Why Some Experts Believe Synthetic Data Will Degrade Future Models

The proliferation of AI-generated junk web pages poses a significant challenge to the future development and performance of AI models, as training on increasingly synthetic data can lead to degraded output quality and potential model collapse. Key takeaways from the research: A study published in Nature demonstrates that the quality of an AI model's output gradually deteriorates when trained on data generated by other AI models: The effect worsens as subsequent models produce output that is then used as training data for future models, likened to taking photos of photos and eventually being left with a dark square or "model...

read Jul 24, 2024

AI Startups Tackle Looming Data Shortage with Innovative Solutions

The AI industry is facing a looming data shortage as companies have already exhausted much of the available training data, but startups are exploring innovative solutions to address this challenge. Synthetic data emerges as a potential solution: Gretel, a startup valued at $350 million, is creating AI-generated synthetic data that closely mimics real information without the privacy concerns: Synthetic data has been used by companies working with sensitive information, such as patient data, to protect privacy while still providing valuable training data for AI models. Gretel's CEO, Ali Golshan, sees an opportunity to supply data-starved AI companies with fake data...

read Jul 23, 2024

LexisNexis Launches Research Tool Nexis+AI to Unlock Insights from Licensed Data

LexisNexis launches Nexis+ AI, using generative AI to enhance corporate research capabilities by leveraging its vast library of licensed news and corporate data. Key features and benefits: Nexis+ AI aims to accelerate data-intensive research and strategic decision-making for businesses: The platform allows users to quickly search and analyze business, financial, and legal information, summarize lengthy documents, compile and share content, create first drafts of intelligence reports, discover relevant data points, and deliver actionable business insights. It employs a multi-model approach, using AWS Bedrock, Anthropic's Claude models, and Microsoft's Azure OpenAI models to power its AI capabilities. Results include citations and...

read Jul 23, 2024

FTC Probes 8 Companies for AI Surveillance Pricing Practices

The FTC is investigating the use of AI-powered surveillance pricing, which could exploit consumers' personal data to charge them higher prices. Key aspects of surveillance pricing: Surveillance pricing, also known as dynamic, personalized, or optimized pricing, involves offering individual consumers different prices for the same products based on factors like: The device they're shopping on, location, demographic information, credit history, and browsing/shopping history Companies across various sectors are considering implementing or have already implemented surveillance pricing models FTC's inquiry into surveillance pricing practices: The Federal Trade Commission has ordered eight companies that offer AI surveillance pricing products and services to...

read Jul 19, 2024

Generative AI and The Future of Intellectual Property

The rapid advancement of generative AI is raising complex questions about the future of intellectual property rights in an AI-driven world. As AI systems become more adept at generating creative outputs that blur the lines between original works and reproductions, traditional concepts of copyright, trademark, and patent protection are being challenged: IP concerns with generative AI training data: Many of the datasets used to train generative AI systems contain copyrighted, trademarked, or otherwise protected materials, often used without explicit consent from IP owners: The process of indiscriminately scraping the web for training data frequently incorporates copyrighted works, leading to potential...

read Jul 18, 2024

Hallucinations Plague Large Language Models, But New Training Approaches Offer Hope

Large language models (LLMs) have significant limitations despite their recent popularity and hype, including hallucinations, lack of confidence estimates, and absence of citations. Overcoming these challenges is crucial for developing more reliable and trustworthy LLM-based applications. Hallucinations: The core challenge: LLMs can generate content that appears convincing but is actually inaccurate or entirely false, known as hallucinations: Hallucinations are the most difficult issue to address, and their negative impact is only slightly mitigated by confidence estimates and citations. Contradictions in the training data contribute to the problem, as LLMs cannot self-inspect their training data for logical inconsistencies. Bootstrapping consistent LLMs:...

read Jul 18, 2024

How Overcoming Paralysis When Adopting AI Begins with Good Data Practices

The global AI revolution is charging ahead, but many organizations feel overwhelmed and paralyzed about how to best capitalize on the technology's potential. Addressing data challenges is critical: Every AI conversation inevitably leads to a discussion about data readiness, as AI relies on high-quality, accessible data to deliver value: Organizations must first assess where their required data lives, which is often siloed across various on-premises, cloud, edge, and individual devices. Establishing strong data governance and quality practices is essential before embarking on the data discovery process to ensure data is clean and usable. Focusing on a specific data set in...

read Jul 18, 2024

Apple Denies Using Unethical Data, Commits to Responsible AI Development

Apple refutes claim it used unethical data to train Apple Intelligence, affirming its commitment to using only ethically sourced data for its AI projects. Apple's response to allegations: While Apple had used data from a controversial dataset called "the Pile" in the past, it was only for research purposes and not for training Apple Intelligence: Apple stated the Pile data was used solely to train OpenELM research models released in April, which do not power any consumer-facing AI or machine learning features. The company has no plans to build new versions of OpenELM and emphasized that the models were never...

read Jul 18, 2024

Meta Halts AI Tools in Brazil Amid Data Privacy Concerns, Hindering Innovation

Meta suspends generative AI tools in Brazil amid data training concerns: Meta has halted the use of its generative AI tools in Brazil after the country's data protection authority banned the company from using Brazilian citizens' data to train AI models, citing risks to users' fundamental rights. Regulatory challenges and Meta's response: The decision follows a privacy policy update that allowed Meta to train AI using Brazilians' public data on Facebook, Instagram, and Messenger, which the National Data Protection Authority (ANPD) deemed a violation: ANPD imposed a ban earlier this month prohibiting Meta from using Brazilian citizens' data to train...

read Jul 18, 2024

Senators Demand Answers After AT&T Data Breach

AT&T revealed that customer call and text records were illegally downloaded from a third-party cloud platform called Snowflake, raising questions about the telecom giant's data practices and the security of sensitive user information. Senators demand answers from AT&T: In the wake of the breach, US Senators Richard Blumenthal and Josh Hawley sent a letter to AT&T CEO John Stankey, asking why the company retained months of detailed customer communication records and uploaded them to a third-party analytics platform: The senators sought clarification on AT&T's policy regarding the retention and use of such sensitive information, including specific timelines. AT&T's initial disclosures...

read Jul 18, 2024

Wiz Research Uncovers Critical Flaws in SAP AI, Risking Customer Data and Cloud Security

Wiz Research uncovers critical vulnerabilities in SAP AI Core, potentially exposing customer data and cloud environments to malicious actors. The research reveals that executing arbitrary code through AI training procedures allowed lateral movement and service takeover, granting access to sensitive customer files and cloud credentials. Key findings: Wiz researchers gained privileged access to SAP AI Core's internal assets by exploiting vulnerabilities, enabling them to: Read and modify Docker images on SAP's internal container registry and Google Container Registry Access and modify artifacts on SAP's internal Artifactory server Obtain cluster administrator privileges on SAP AI Core's Kubernetes cluster Retrieve customers' cloud...

read Jul 17, 2024

How Overcoming Bad Data and Digital Divides In Organizations Will Increase AI Adoption

The vast majority of AI proof-of-concept projects fail to reach production due to digital boundaries, digital employees, and bad data, according to Capgemini research. Addressing these challenges requires a fundamental rethinking of how businesses approach AI adoption. Data quality is a major obstacle: Many organizations have become accustomed to working with subpar data, which poses significant risks as AI increasingly drives business decisions: By 2030, AI is expected to make 50% of business decisions, particularly in autonomous supply chain applications, making poor data quality unacceptable from a risk perspective. Digital employees cannot wait for cleaned-up data to make operational decisions,...

read Jul 17, 2024

Microsoft’s AI-Powered Data Governance Solution Sees Rapid Adoption Amid Enterprise AI Push

Microsoft's Purview Data Governance solution sees rapid adoption amid enterprise demand for robust AI data management tools. Federated governance model balances business autonomy and centralized control: Microsoft's approach allows different business units to manage their own data products while maintaining centralized policy enforcement: The federated model aims to empower business users who best understand how to create their data products, while still ensuring organizational control. Purview uses natural language processing to enhance data visibility and management, allowing users to interact with governance tools using natural language queries, potentially lowering barriers to adoption. AI is being leveraged at nearly every layer...

read Jul 16, 2024

Microsoft’s New SpreadsheetLLM Offers Glimpse Into Future of Data Interaction

Microsoft researchers propose SpreadsheetLLM, a novel method that helps AI models understand and process spreadsheets more efficiently, potentially improving chatbot interactions with complex data. Key innovation: SheetCompressor framework: Microsoft's SheetCompressor encoding framework compresses spreadsheets into bite-sized chunks that large language models (LLMs) can more easily handle: It includes modules that make spreadsheets more legible for LLMs, bypass empty cells and repeating numbers, and help LLMs better understand the context of numbers (e.g., distinguishing years from phone numbers). This compression method reduced token usage for spreadsheet encoding by up to 96%, significantly boosting performance on larger spreadsheets where high token usage...

read Jul 12, 2024

Instabase Voted Most Likely to Succeed at VentureBeat’s Transform 2024

Instabase, an AI company focused on accelerating decision-making in enterprises by unlocking unstructured data, was voted most likely to succeed at the Transform 2024 Innovation Showcase. Key Takeaways: Instabase's AI Hub is being used by the world's largest enterprises across various industries to automate document workflows and take actions based on insights from all data types: Four of the top five banks are using Instabase's technology The company has raised $177 million to date, with a valuation of $2 billion Instabase's AI Hub can unlock any unstructured data, including images, PDFs, Excel files, and messages The Instabase Story: Founded in...

read Jul 12, 2024

IBM Exec: Integrating Enterprise Data into AI Models is Key to Success

IBM's David Cox champions open innovation in enterprise generative AI, emphasizing the importance of transparency, collaboration, and the integration of proprietary business data into AI models. Nuanced view of openness in AI: Cox challenges the notion that openness in AI is a simple binary concept, highlighting the growing ecosystem of open models from various sources, including tech giants, universities, and nation-states: He raises concerns about the quality of openness in many large language models (LLMs), noting that some provide only a "bag of numbers" without clear information on how they were produced, making reproducibility difficult or impossible. Cox outlines key...

read Jul 11, 2024

Intuit Lays Off 1,800, Plans to Hire 1,800 in AI and Data Push

Intuit lays off 1,800 employees while also planning to hire 1,800 new ones as it focuses on investing in AI and data. Restructuring at Intuit: CEO Sasan Goodarzi announced layoffs, which include approximately 1,050 employees "not meeting expectations," as part of the company's plan to accelerate investments in data and AI: The layoffs come after Intuit had previously been noted for avoiding mass layoffs last spring, unlike many of its peers in the tech industry. Intuit recently shut down the Mint app and MailChimp-owned TinyLetter, signaling a shift in the company's priorities and focus. Investing in AI and data: Intuit's...

read Jul 7, 2024

The Hidden Human Cost Powering the AI Revolution in Silicon Valley

The harrowing reality of AI's human labor unveiled: African workers power the AI revolution for meager wages Mercy and Anita, two African workers employed by outsourcing companies, exemplify the hidden human labor behind AI advancements: Mercy, a Meta content moderator in Nairobi, endures viewing disturbing images and videos for 10-hour shifts, earning just over a dollar an hour, with minimal job security and support. Anita, a data annotator in Gulu, Uganda, reviews hours of footage for an autonomous vehicle company, identifying drivers' lapses in concentration, earning around $1.16 per hour for her grueling work. The dark underbelly of AI development:...

read Jul 5, 2024

As AI Capabilities Evolve, the Role of Data Scientists May Be In Question

While data scientists have been essential for building AI models in the past, the increasing accessibility and ease of use of AI systems are changing the skill sets needed to leverage AI effectively. Using vs. building AI models: With the rise of generative AI and LLMs, users can now get significant value from AI without needing data science skills. Simply using and benefiting from AI systems doesn't require data scientist expertise, as AI capabilities are becoming more accessible and embedded in everyday tools and applications. Instead of data science skills, organizations need to focus on developing prompt engineering skills, which...

read Jul 5, 2024

AI Training Data Battles: Music Lawsuits, News Deals Reshape Industry Landscape

The generative AI boom is fueling a surge in demand for training data, but AI companies are facing growing challenges in obtaining it freely as data owners push back. Key developments in the AI training data landscape: The music industry's lawsuit against AI music companies Suno and Udio sends a strong message that high-quality training data is not free: Sony Music, Warner Music Group, and Universal Music Group are suing Suno and Udio for alleged copyright infringement, claiming the companies used copyrighted music in their training data "at an almost unimaginable scale." The lawsuits could set a precedent for the...

read Jul 3, 2024

Cloudflare Fights AI Bots as Tech Giants Clash Over Web Scraping Rules

AI is rewriting the rules of the internet, and Cloudflare is stepping in to help customers protect their data from being scraped by AI bots: Tech giants grapple with web scraping for AI: Major tech companies are changing their policies around web scraping, with some blaming third parties for ignoring robots.txt files and others seemingly asserting the right to use any publicly posted data for AI training: Google's AI chatbot Bard has updated its privacy policy, allowing it to train on data scraped from the web. Microsoft executive Suleyman has suggested that anything posted online is fair game for AI...

read