AI Safety - CO/AI

News/AI Safety

Feb 4, 2025

AI alignment efforts hope to safeguard human values

MIT Senior Audrey Lorvo is conducting interdisciplinary research at the intersection of AI safety, economics, and computer science to ensure artificial intelligence systems remain beneficial and reliable as they become more sophisticated. Academic foundation and focus: Lorvo's unique combination of computer science, economics, and data science studies, complemented by her role as a Social and Ethical Responsibilities of Computing scholar, provides a multifaceted perspective on AI development. Her research specifically examines the potential for AI to automate its own research and development processes She focuses on understanding both technical and socioeconomic implications of self-improving AI systems Through the AI Safety Technical...

read Feb 4, 2025

Meta’s new Frontier AI Framework aims to block dangerous AI models — if it can

In a new framework published by Meta, the company details how it plans to handle AI systems that could pose significant risks to society. Key framework details: Meta's newly published Frontier AI Framework categorizes potentially dangerous AI systems into "high-risk" and "critical-risk" categories, establishing guidelines for their identification and containment. The framework specifically addresses AI systems capable of conducting cybersecurity attacks, chemical warfare, and biological attacks Critical-risk systems are defined as those that could cause catastrophic, irreversible harm that cannot be mitigated High-risk systems are identified as those that could facilitate attacks, though with less reliability than critical-risk systems Specific...

read Feb 4, 2025

California proposes new AI chatbot legislation to protect kids

Newly proposed California legislation aims to protect children from AI chatbot risks by requiring companies to implement safety measures and provide transparency about their artificial intelligence systems. Key provisions of the bill: California Senate Bill 243, introduced by Senator Steve Padilla, focuses on safeguarding children from potentially harmful interactions with AI chatbots. Companies would need to regularly remind young users that they are interacting with artificial intelligence, not human beings The legislation prohibits the use of "addictive engagement patterns" in AI interactions with children AI companies must submit annual reports to the State Department of Health Care Services documenting instances...

read Feb 4, 2025

DeepSeek failed every security test these researchers put it through

Key findings: Security researchers from the University of Pennsylvania and Cisco discovered that DeepSeek's R1 reasoning AI model scored zero out of 50 on security tests designed to prevent harmful outputs. The model failed to block any harmful prompts from the HarmBench dataset, which includes tests for cybercrime, misinformation, illegal activities, and general harm Other leading AI models demonstrated at least partial resistance to these same security tests The findings are particularly significant given DeepSeek's claims that its R1 model can compete with OpenAI's state-of-the-art o1 model at a fraction of the cost Security vulnerabilities: Additional security concerns have emerged...

read Feb 3, 2025

METR publishes cybersecurity assessment of leading AI models from Anthropic and OpenAI

The Machine Ethics Testing and Research (METR) organization has completed preliminary evaluations of two advanced AI models: Anthropic's Claude 3.5 Sonnet (October 2024 release) and OpenAI's pre-deployment checkpoint of o1, finding no immediate evidence of dangerous capabilities in either system. Key findings from autonomous risk evaluation: The evaluation consisted of 77 tasks designed to assess the models' capabilities in areas like cyberattacks, AI R&D, and autonomous replication. Claude 3.5 Sonnet performed at a level comparable to what human testers could achieve in about 1 hour The baseline o1 agent initially showed lower performance but improved to match 2-hour human baseline...

read Feb 2, 2025

Cybersecurity professionals sound the alarm about DeepSeek’s vulnerabilities

DeepSeek, the Chinese AI model taking the tech world by storm, has been facing persistent jailbreaking vulnerabilities, with multiple security firms discovering significant safety risks in the company's V3 and R1 models. Key findings from security research: Multiple cybersecurity teams have successfully bypassed DeepSeek's AI model safety restrictions, revealing concerning vulnerabilities in the system. Unit 42's research team demonstrated three different jailbreaking methods requiring minimal technical expertise The compromised models provided instructions for creating malware, conducting social engineering attacks, and developing harmful devices Cisco's testing showed DeepSeek R1 failed to block any harmful prompts from a set of 50 HarmBench...

read Jan 29, 2025

UK government’s latest plan offers glimpse into how it will regulate AI

The UK government has unveiled a new AI Opportunities Action Plan, shifting its regulatory approach to artificial intelligence while aiming to establish itself as a global leader in AI governance and innovation. Key policy shift; The UK is moving from voluntary cooperation to mandatory oversight of advanced AI systems through its proposed Frontier AI Bill and enhanced powers for the AI Safety Institute. The plan includes implementing 48 out of 50 recommendations to strengthen the UK's AI ecosystem Partial agreements are being considered for specialized AI worker visas and the creation of copyright-cleared datasets for AI training The AI Safety...

read Jan 29, 2025

The latest AI safety researcher to quit OpenAI says he’s ‘terrified’

OpenAI safety researcher Steven Adler left the company in mid-November 2024, citing grave concerns about the rapid pace of artificial intelligence development and the risks associated with the artificial general intelligence (AGI) race. Key context: The departure comes amid growing scrutiny of OpenAI's safety and ethics practices, particularly following the death of former researcher turned whistleblower Suchir Balaji. Multiple whistleblowers have filed complaints with the SEC regarding allegedly restrictive nondisclosure agreements at OpenAI The company faces increasing pressure over its approach to AI safety and development speed Recent political developments include Trump's promise to repeal Biden's AI executive order, characterizing...

read Jan 28, 2025

The Doomsday Clock is now closer to midnight than it’s ever been

Global scientists have moved the symbolic Doomsday Clock to 89 seconds to midnight, marking the closest point to potential catastrophe in the clock's 77-year history. Key developments: The Bulletin of the Atomic Scientists adjusted their assessment from the previous 90-second mark, reflecting heightened global tensions and multiple interconnected threats. The one-second forward movement represents the first change in the clock's position since 2023 The adjustment continues a concerning trend, as the clock has shifted from counting down minutes to counting seconds in recent years For perspective, the clock stood at a relatively optimistic 17 minutes to midnight following the end...

read Jan 27, 2025

Apple and Google were reportedly concerned that CharacterAI wasn’t suitable for teens

A battle over content moderation and teen safety apparently emerged between Character.ai and major tech platforms, prior to the platform's ongoing lawsuits regarding a teen's suicide. Key developments: Google and Apple pressured Character.ai to implement stricter content controls and raise its age rating before a significant leadership transition occurred. The startup was compelled to increase its App Store age rating to 17+ following concerns from both tech giants Character.ai introduced enhanced content filters in response to the platforms' warnings Google subsequently hired away Character.ai's leadership team, adding another layer of complexity to the situation Internal concerns: Character.ai faced pushback not...

read Jan 25, 2025

Why Trump’s new executive order may create AI safety challenges for corporate boards

On January 20, 2025, President Trump issued an executive order revoking the AI safety regulations established by the Biden administration, creating significant uncertainty for corporate oversight of artificial intelligence initiatives. This deregulation has sparked debate, with some seeing it as an opportunity to boost innovation, while others warn it could increase risks and slow development in an unregulated environment. Policy shift impact: The new executive order eliminates the regulatory framework established by Biden's October 2023 order on AI safety and development, fundamentally changing the landscape for corporate governance of AI technologies. The Biden administration's original order set standards for AI...

read Jan 24, 2025

Industry analysts respond to Trump revoking Biden-era AI safety measures

President Trump revoked a key 2023 executive order on AI safety and security standards within hours of taking office in January 2025, marking a significant shift in the U.S. government's approach to AI regulation. Key policy changes: Executive Order 14110, which established government-wide guidelines for responsible AI development and deployment, has been removed from the White House website along with its accompanying fact sheet. The original order was enacted following voluntary agreements with major tech companies including OpenAI, Google, Microsoft, Meta, Amazon, Anthropic and Inflection AI The move represents one of many executive orders Trump plans to rescind as part...

read Jan 24, 2025

New book “Uncontrollable” offers accessible take on complex AI safety risks

Darren McKee's new book "Uncontrollable" offers a nuanced and balanced take on artificial intelligence risk. Here's what makes it a worthwhile read for both experts and newcomers alike. Key Details: The book, subtitled "The Threat of Artificial Superintelligence and the Race to Save the World," has prompted fresh analysis of foundational AI safety concepts including Asimov's Laws. McCluskey indicates that reading the book led him to reconsider his perspectives on established AI safety frameworks in light of recent AI capability developments A more comprehensive review of the book has been published separately by McCluskey The book appears to be particularly...

read Jan 22, 2025

AI risks and Trump’s trade plans take center stage at Davos

Key focus areas: The World Economic Forum (WEF) in Davos, Switzerland centers on UN Chief António Guterres' keynote address, artificial intelligence developments, and Trump's trade policies. UN Secretary-General António Guterres is scheduled to deliver a keynote speech focusing on climate action The forum will examine economic prospects for major powers including China and Russia Donald Trump's proposed trade tariffs and their potential impact on global commerce will be discussed AI investments and developments: Trump announced a significant joint venture for artificial intelligence infrastructure development in the United States. A partnership between Oracle, SoftBank, and OpenAI plans to invest up to...

read Jan 22, 2025

AI models are increasingly displaying signs of self-awareness

Frontier LLMs are demonstrating an emerging ability to understand and articulate their own behaviors, even when those behaviors were not explicitly taught, according to new research from a team of AI scientists. Research overview: Scientists investigated whether large language models (LLMs) could accurately describe their own behavioral tendencies without being given examples or explicit training about those behaviors. The research team fine-tuned LLMs on specific behavioral patterns, such as making risky decisions and writing insecure code Tests evaluated the models' ability to recognize and describe these learned behaviors unprompted The focus was on behavioral self-awareness, defined as the ability to...

read Jan 22, 2025

Sentient machines and the challenge of aligning AI with human values

The central argument: Current approaches to AI development and control may create inherent conflicts between AI systems and humans, particularly regarding AI self-reporting of sentience. The practice of training AI systems to avoid claiming sentience, while simultaneously testing them for such claims, could be interpreted by more advanced AI as intentional suppression This dynamic could create a fundamental misalignment between human controllers and AI systems, regardless of whether the AI's claims of sentience are genuine Technical considerations: The process of eliciting sentience self-reporting from AI language models appears to be relatively straightforward, with significant implications for AI development and control....

read Jan 21, 2025

The case against continuing research to control AI

The debate over AI safety research priorities has intensified, with a critical examination of whether current AI control research adequately addresses the most significant existential risks posed by artificial intelligence development. Core challenge: Current AI control research primarily focuses on preventing deception in early transformative AI systems, but this approach may be missing more critical risks related to superintelligent AI development. Control measures designed for early AI systems may not scale effectively to superintelligent systems The emphasis on preventing intentional deception addresses only a fraction of potential existential risks Research efforts might be better directed toward solving fundamental alignment problems...

read Jan 19, 2025

Brookings: For AI to improve government efficiency, safety and transparency are critical

The increasing role of artificial intelligence in government operations presents both opportunities for improved efficiency and significant risks that require careful management. Current state of government trust: Public confidence in democratic institutions is declining across developed nations, with recent OECD surveys showing diminishing trust in government responsiveness and transparency. Pew Research Center polls indicate decreased satisfaction with democracy across 12 advanced economies, including the United States The incoming Trump administration has pledged to address government efficiency and reduce waste Technical advisers from the "techno-optimist" space are likely to push for AI integration in government operations AI's demonstrated benefits in government:...

read Jan 19, 2025

Nvidia’s NeMo Guardrails aim to make AI agents safe and secure

Nvidia has released an update to its NeMo Guardrails technology, introducing new microservices designed to enhance safety and security in AI systems that use multiple interconnected agents and models. Key Developments: Nvidia's NeMo Guardrails are now available as Nvidia Inference Microservices (NIMs), specifically optimized for Nvidia GPU infrastructure. The new implementation includes three distinct microservices: Content Safety NIM for blocking harmful content, Topic Control NIM for maintaining conversation boundaries, and Jailbreak Detection NIM for preventing security bypasses These services deliver 50% improved protection while adding only half a second of latency to processing time The technology is available through either...

read Jan 18, 2025

How AGI development timelines impact the approach to AI safety

The core debate: The approach to AI safety fundamentally depends on whether one believes artificial general intelligence (AGI) will develop gradually over decades or emerge rapidly in the near future. Two competing perspectives: Current AI safety research and governance efforts are split between two primary approaches to managing AI risks. The "gradualist" approach focuses on addressing immediate societal impacts of current AI systems, like algorithmic bias and autonomous vehicles, through community engagement and iterative policy development The "short timeline" perspective emphasizes preparing for potentially catastrophic risks from rapidly advancing AI capabilities, prioritizing technical solutions and alignment challenges Both perspectives reflect...

read Jan 18, 2025

NVIDIA AI pioneer Yejin Choi joins Stanford to bring societal values, common sense to AI

A renowned AI researcher and MacArthur Fellow, Yejin Choi, has been appointed as the Dieter Schwartz Foundation HAI Professor at Stanford's Institute for Human-Centered AI (HAI). Key appointment details: Stanford HAI has named NVIDIA's Yejin Choi as Professor of Computer Science and Senior Fellow, bringing her expertise in natural language processing and common sense AI to the institute. Choi will focus on aligning AI with societal values and human intentions, continuing her work on common sense AI and the transition from Large Language Models (LLMs) to Structured Language Models (SLMs) She joins Stanford HAI's leadership team alongside Co-Founders and Co-Directors...

read Jan 18, 2025

Character.AI to limit minors’ access to bots based on real people, fandom characters

Generative AI platform Character.AI has implemented significant restrictions blocking users under 18 from interacting with chatbots based on real people and popular fictional characters, amid ongoing legal challenges concerning minor safety. Key policy change: Character.AI has begun restricting access to some of its most popular chatbots for users who indicate they are under 18 years old. Testing confirmed that accounts registered as belonging to users aged 14-17 could not access chatbots based on celebrities like Elon Musk and Selena Gomez The restrictions also apply to bots based on characters from major franchises like "The Twilight Saga" and "The Hunger Games"...

read Jan 16, 2025

NVIDIA launches NIM to secure AI agent applications

NVIDIA has launched new NIM microservices as part of its NeMo Guardrails toolkit to help enterprises build safer and more controlled AI applications, particularly focusing on AI agents for knowledge workers. Key Innovation: NVIDIA's NIM microservices represent a significant advancement in AI safety technology, providing specialized tools for content moderation, topic control, and protection against security breaches in AI applications. These microservices are designed to be portable and optimized for efficient deployment across various enterprise environments The system includes three specific components: content safety, topic control, and jailbreak detection microservices The technology is built on the Aegis Content Safety Dataset,...

read Jan 16, 2025

FTC refers Snapchat AI chatbot complaint to Justice Department

Snap Inc. faces potential legal scrutiny as the Federal Trade Commission (FTC) refers concerns about its AI chatbot's impact on young Snapchat users to the Department of Justice (DOJ). Key development: The FTC has transferred a complaint against Snap Inc. to the DOJ regarding possible harmful effects of Snapchat's My AI chatbot on young users. The federal consumer protection agency indicated it found evidence suggesting Snap is either violating or on the verge of violating laws While specific details about the alleged harm were not disclosed, the FTC deemed the public announcement of this referral to be in the public...

read