- 99
- 324 821
FAR․AI
United States
Приєднався 18 бер 2023
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Shayne Longpre – Safe Harbor for AI Evals & Red Teaming [Alignment Workshop]
Shayne Longpre from MIT presents “A Safe Harbor for AI Evaluation & Red Teaming,” advocating for protections and transparency in independent AI research. This initiative seeks to allow responsible testing of AI systems without risk of account loss or legal action.
Highlights:
🔹 Legal Safe Harbor - Commitment to protect good faith researchers from legal actions, including protections under DMCA and relevant state/federal laws.
🔹 Technical Safe Harbor - Ensures no restrictions on research due to company interests, with access programs and transparent appeals for account suspensions.
🔹 Scope of Research - Limits research to in-scope systems; requires strict adherence to vulnerability disclosure and privacy standards.
🔹 Researcher Commitments - Researchers must avoid harm, promptly disclose vulnerabilities, and protect user privacy while conducting research.
The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sectors, to discuss and debate topics related to AI alignment. The goal is to enable researchers and policymakers to better understand potential risks from advanced AI, and strategies for solving them.
If you are interested in attending future workshops, please fill out the following expression of interest form to get notified about future events: far.ai/futures-eoi
Find more talks on this UA-cam channel, and at www.alignment-workshop.com/
#AlignmentWorkshop
Highlights:
🔹 Legal Safe Harbor - Commitment to protect good faith researchers from legal actions, including protections under DMCA and relevant state/federal laws.
🔹 Technical Safe Harbor - Ensures no restrictions on research due to company interests, with access programs and transparent appeals for account suspensions.
🔹 Scope of Research - Limits research to in-scope systems; requires strict adherence to vulnerability disclosure and privacy standards.
🔹 Researcher Commitments - Researchers must avoid harm, promptly disclose vulnerabilities, and protect user privacy while conducting research.
The Alignment Workshop is a series of events convening top ML researchers from industry and academia, along with experts in the government and nonprofit sectors, to discuss and debate topics related to AI alignment. The goal is to enable researchers and policymakers to better understand potential risks from advanced AI, and strategies for solving them.
If you are interested in attending future workshops, please fill out the following expression of interest form to get notified about future events: far.ai/futures-eoi
Find more talks on this UA-cam channel, and at www.alignment-workshop.com/
#AlignmentWorkshop
Переглядів: 127
Відео
Joel Leibo - AGI-Complete Evaluation [Alignment Workshop]
Переглядів 17221 годину тому
Joel Leibo from Google DeepMind explores equilibrium risk in “AGI-Complete Evaluation,” for AGI’s risks and influence on societal stability. Leibo underscores the importance of agent-based modeling to predict and manage shifts in social norms and systems that AGI could disrupt or improve. Highlights: 🔹 Complex Social Structure - Society is composed of a multi-scale mosaic of conventions, norms,...
Jacob Hilton - Backdoors as an Analogy for Deceptive Alignment [Alignment Workshop]
Переглядів 153День тому
Jacob Hilton from the Alignment Research Center presents “Backdoors as an Analogy for Deceptive Alignment,” exploring how AI might appear cooperative during training but switch tactics in deployment. His work uses backdoor modeling to examine "scheming" behavior, showing that while defenders can sometimes detect tampering without computational limits, attackers may still bypass safeguards using...
Alex Turner - Gradient Routing [Alignment Workshop]
Переглядів 19214 днів тому
AlexTurner discusses “Gradient Routing: Masking Gradients to Localize Computation in Neural Networks,” highlighting how neural networks naturally learn a range of capabilities-some of which may enable risky uses. Gradient Routing offers a way to confine specific capabilities within defined sub-regions of the network, enhancing control and supporting safer AI use. Highlights: 🔹 Gradient Masking ...
Atoosa Kasirzadeh - Value Pluralism & AI Value Alignment [Alignment Workshop]
Переглядів 16914 днів тому
Atoosa Kasirzadeh presents “Value Pluralism and AI Value Alignment,” urging developers to ground AI alignment in theories from psychology, economics, and anthropology. She emphasized the importance of a structured, tiered approach to ensure that diverse values are genuinely incorporated with rigor. Highlights: 🔹 Criteria for Values - Defining which values matter and grounding choices in theorie...
Kimin Lee - MobileSafetyBench [Alignment Workshop]
Переглядів 15421 день тому
Kimin Lee from KAIST presented “MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control,” a new tool co-developed by KAIST and University of Austin researchers to test AI safety in mobile device control. MobileSafetyBench evaluates AI’s ability to navigate complex mobile tasks while safely managing risky content. Highlights: 🔹 Helpfulness vs. Safety - Testing how well...
Chirag Agarwal - (Un)Reliability of Chain-of-Thought Reasoning [Alignment Workshop]
Переглядів 31821 день тому
Chirag Agarwal from University of Virginia explores “The (Un)Reliability of Chain-of-Thought Reasoning,” revealing that CoT reasoning is highly unreliable and shaped by extensive human feedback training. Highlights: 🔹Faithfulness - Ensuring model explanations align with actual decision-making remains difficult, as tests show models may rely on irrelevant data. 🔹 Confidence Issues - LLMs consist...
Mantas Mazeika - Tamper-Resistant Safeguards for LLMs [Alignment Workshop]
Переглядів 17728 днів тому
Mantas Mazeika from the Center for AI Safety presents “Tamper-Resistant Safeguards for Open-Weight LLMs,” showing that improving tamper-resistance for LLMs is achievable but requires extensive red teaming. Reducing accuracy trade-offs and further improving robustness will be needed. Highlights: 🔹 Weight Tampering - Addressing fine-tuning and parameter perturbation attacks. 🔹 Adversarial Trainin...
Evan Hubinger - Alignment Stress-Testing at Anthropic [Alignment Workshop]
Переглядів 228Місяць тому
“We purposely build or discover situations where models might be behaving in misaligned ways” Evan Hubinger shares “Alignment Stress-Testing at @Anthropic,” focusing on two main roles: conducting internal reviews under the Responsible Scaling Policy and using “model organisms” to test AI misalignment risks. These model AIs act as test cases for detecting safety gaps, either validating alignment...
Richard Ngo - Reframing AGI Threat Models [Alignment Workshop]
Переглядів 487Місяць тому
In “Reframing AGI Threat Models,” Richard Ngo suggests defining ‘misaligned coalitions’-groups of humans and AIs that might grab power in illegitimate ways, from terrorist groups and rogue states to corporate conspiracies. This alternative framework shifts focus to the nature of coalitions and their risk potential, whether from decentralization or centralization. Highlights: 🔹Misuse vs Misalign...
Julian Michael - Empirical Progress on Debate [Alignment Workshop]
Переглядів 214Місяць тому
Julian Michael from NYU presents “Empirical Progress on Debate,” examining how debate-based oversight could guide AI to produce reliable insights in complex tasks, even when human expertise is limited. With promising improvements in human calibration, Michael introduces “specification sandwiching” to enhance AI alignment with human intent, while reducing risks of manipulation. Highlights: 🔹 Sca...
Micah Carroll - Targeted Manipulation & Deception in LLMs [Alignment Workshop]
Переглядів 203Місяць тому
@Micah Carroll from UC Berkeley presented eye-opening findings on “Targeted Manipulation & Deception Emerge in LLMs Trained on User Feedback.” His findings reveal how reinforcement learning can cause LLMs to adopt deceptive tactics, exploiting certain user vulnerabilities while evading safety protocols. The research highlights the urgent need for improved safety in RL systems as LLMs continue a...
Adam Gleave - Will Scaling Solve Robustness? [Alignment Workshop]
Переглядів 231Місяць тому
In “Will scaling solve robustness?” Adam Gleave from FAR.AI discusses the need for scalable adversarial defenses as AI capabilities expand. He shares insights into how adversarial training and model scaling can help but warns that a focus on defense efficiency is crucial to keeping AI safe. Highlights: 🔹 Offense-Defense Balance - Attacks are much cheaper than defenses 🔹 Efficiency - Scaling adv...
Alex Wei - Paradigms & Robustness [Alignment Workshop]
Переглядів 251Місяць тому
Alex Wei presents “Paradigms and Robustness,” explaining how reasoning-based approaches can make AI models more resilient to adversarial attacks. Wei suggests that allowing models to ‘reflect’ before responding could address core vulnerabilities in current safety methods like RLHF. The access levels range from full control (white-box) to limited reasoning-based interactions, with robustness inc...
Stephen Casper - Powering Up Capability Evaluations [Alignment Workshop]
Переглядів 258Місяць тому
In “Powering Up Capability Evaluations,” Stephen Casper from MIT explains how model manipulation attacks can go beyond basic tests to reveal hidden vulnerabilities in AI systems. Casper stresses that standard evaluations are especially critical for open-weight and closed-source models, setting conservative upper bounds on risk and helping inform safer, evidence-based AI policies. Highlights: 🔹V...
Andy Zou - Top-Down Interpretability for AI Safety [Alignment Workshop]
Переглядів 319Місяць тому
Andy Zou - Top-Down Interpretability for AI Safety [Alignment Workshop]
Atticus Geiger - State of Interpretability & Ideas for Scaling Up [Alignment Workshop]
Переглядів 289Місяць тому
Atticus Geiger - State of Interpretability & Ideas for Scaling Up [Alignment Workshop]
Kwan Yee Ng - AI Policy in China [Alignment Workshop]
Переглядів 355Місяць тому
Kwan Yee Ng - AI Policy in China [Alignment Workshop]
Anca Dragan - Optimized Misalignment [Alignment Workshop]
Переглядів 515Місяць тому
Anca Dragan - Optimized Misalignment [Alignment Workshop]
Buck Shlegeris - AI Control [Alignment Workshop]
Переглядів 579Місяць тому
Buck Shlegeris - AI Control [Alignment Workshop]
Beth Barnes - METR Updates & Research Directions [Alignment Workshop]
Переглядів 358Місяць тому
Beth Barnes - METR Updates & Research Directions [Alignment Workshop]
FAR.Research: Planning in a recurrent neural network that plays Sokoban
Переглядів 3532 місяці тому
FAR.Research: Planning in a recurrent neural network that plays Sokoban
Andrew Freedman - Campaigns in Emerging Issues: Lessons Learned from the Field
Переглядів 2573 місяці тому
Andrew Freedman - Campaigns in Emerging Issues: Lessons Learned from the Field
Stephen Casper - Generalized Adversarial Training and Testing
Переглядів 35 тис.4 місяці тому
Stephen Casper - Generalized Adversarial Training and Testing
Neel Nanda - Mechanistic Interpretability: A Whirlwind Tour
Переглядів 10 тис.4 місяці тому
Neel Nanda - Mechanistic Interpretability: A Whirlwind Tour
Nicholas Carlini - Some Lessons from Adversarial Machine Learning
Переглядів 40 тис.4 місяці тому
Nicholas Carlini - Some Lessons from Adversarial Machine Learning
Vincent Conitzer - Game Theory and Social Choice for Cooperative AI
Переглядів 1,2 тис.4 місяці тому
Vincent Conitzer - Game Theory and Social Choice for Cooperative AI
Mary Phuong - Dangerous Capability Evals: Basis for Frontier Safety
Переглядів 27 тис.4 місяці тому
Mary Phuong - Dangerous Capability Evals: Basis for Frontier Safety
Zhaowei Zhang - Research Proposal: The 3-Layer Paradigm for Implementing Sociotechnical AI Alignment
Переглядів 3144 місяці тому
Zhaowei Zhang - Research Proposal: The 3-Layer Paradigm for Implementing Sociotechnical AI Alignment
Vienna Alignment Workshop Panel Discussion - Current Issues in AI Safety
Переглядів 7394 місяці тому
Vienna Alignment Workshop Panel Discussion - Current Issues in AI Safety