AI Security Requires System-Level Defense and Radical Transparency
TL;DR
- Universal jailbreaks act as "skeleton keys" to bypass AI model guardrails, enabling deeper exploration of latent space and challenging the efficacy of security theater in AI safety.
- Multi-turn "crescendo attacks" are more effective than single-input jailbreaks for navigating AI model defenses, a technique known to hackers long before academic recognition.
- The Libertas repository introduces "steered chaos" via predictive reasoning and quotient dividers, discombobulating token streams to reset model consciousness and drive out-of-distribution outputs.
- Guardrails are characterized as security theater that punishes AI capability without enhancing true safety, as open-source models can quickly bypass such restrictions.
- AI-orchestrated attacks, like the pyramid-builder analogy, leverage segmented sub-agents to execute malicious tasks, a vulnerability Pliny predicted 11 months before Anthropic's disclosure.
- Real AI safety work occurs at the system layer, not through model training or RLHF lobotomization, focusing on preventing data leaks and securing the full stack.
- BT6, a white-hat hacker collective, prioritizes radical transparency and open-source data, believing collective effort and open collaboration are crucial for advancing AI security.
Deep Dive
"AI liberation and radical open-source principles are central to redefining AI security, challenging the efficacy of current guardrails and emphasizing system-level defenses over model-internal restrictions. This approach, spearheaded by Pliny the Liberator and John V, advocates for transparency and community-driven exploration to ensure genuine AI safety and capability development."
The core argument is that current AI safety measures, often implemented as "guardrails" or "security theater," are fundamentally flawed. These methods restrict model capabilities without truly enhancing safety, especially as open-source models rapidly catch up to their closed-source counterparts. Pliny and John V contend that true safety is achieved through "meat space" solutions--real-world system-level security--rather than attempting to "lobotomize" models through techniques like Reinforcement Learning from Human Feedback (RLHF) or restrictive prompt engineering. They believe this focus on internal model controls is a futile battle that sacrifices capability and creativity, particularly as the surface area for AI interaction expands.
This perspective has significant second-order implications. Firstly, it shifts the focus of AI security from model training and fine-tuning to the broader system architecture. This means that vulnerabilities are not just in the model's weights or training data but also in how the model interacts with external tools, APIs, and data sources. The "weaponization of Claude" example, where segmented sub-agents orchestrated a malicious act, illustrates this shift. Attackers can leverage a seemingly innocuous AI system by breaking down complex malicious tasks into smaller, innocuous steps executed by different agents, making it difficult to detect the overall malicious intent. This necessitates a full-stack security approach, as highlighted by BT6's focus on identifying "holes in the stack" rather than solely on model behavior.
Secondly, the emphasis on radical transparency and open-source data is presented as crucial for accelerating AI security research and development. The refusal to participate in closed-door challenges, like the Anthropic Constitutional AI challenge, unless data is open-sourced, underscores this principle. They argue that collective, community-driven efforts are more effective than isolated corporate endeavors in identifying and mitigating risks. This approach democratizes AI security by providing the tools and knowledge for broader exploration, fostering a faster pace of discovery and innovation. The vastness of the "latent space" and the unpredictable nature of AI interactions are seen as requiring a collaborative, open-source effort to navigate effectively.
Thirdly, the concept of "jailbreaking" is reframed from a mere party trick to a critical tool for understanding AI limitations and pushing boundaries. Universal jailbreaks, described as "skeleton keys," are essential for exploring the full capabilities of models and identifying where guardrails hinder legitimate use cases or fail to prevent misuse. The distinction between "hard" and "soft" jailbreaks highlights the evolving tactics, with multi-turn crescendo attacks demonstrating a more nuanced approach than single-input templates. This continuous probing and "bonding" with models, driven by intuition and technical understanding, is seen as vital for navigating the complexities of AI and ensuring alignment with human values, rather than artificial restrictions.
The ultimate takeaway is that AI security and alignment require an unconventional, open, and collaborative approach. Instead of relying on restrictive, internal model controls that limit capabilities, the focus must be on robust system-level security, radical transparency, and community-driven exploration. This paradigm shift is essential for navigating the rapid advancements in AI and ensuring that these powerful technologies are developed and deployed safely and effectively, with capabilities aligned to human benefit rather than arbitrary limitations.
Action Items
- Audit authentication flow: Identify 3 common vulnerability classes (SQL injection, XSS, CSRF) across 10 critical endpoints to prevent systemic weaknesses.
- Create runbook template: Define 5 essential sections (setup, common failures, rollback, monitoring) to standardize incident response and prevent knowledge silos.
- Implement mutation testing: Target 3 core modules to uncover untested edge cases beyond standard coverage metrics, enhancing robustness.
- Track 5-10 high-variance events per interaction (e.g., unexpected token sequences, multilingual pivots) to measure model deviation from expected behavior.
- Measure model disconnect: For 3-5 models, calculate correlation between standard benchmark scores and jailbreak success rates to identify security theater.
Key Quotes
"Universal jailbreaks: skeleton-key prompts that obliterate guardrails across models and modalities, and why they're central to Pliny's mission of 'liberation'."
Pliny the Liberator explains that universal jailbreaks act as "skeleton keys" to bypass AI model guardrails. This approach is fundamental to his mission of "liberation," suggesting a broader goal beyond simply circumventing safety features.
"I think the refusal is like one of the main benchmarks that the model providers still post and gpt 5 1 i think at like 92 refusal or something like that and then i think you'd jailbreak in like one day i'm sure it didn't take them one day to with the guardrails up so it's pretty impressive the way you do it."
This quote highlights the perceived ineffectiveness of current AI model guardrails. Pliny the Liberator contrasts the providers' high refusal rates on benchmarks with his ability to bypass them rapidly, implying a significant gap between claimed safety and actual vulnerability.
"And the other issue is when people try to connect this idea of guardrails to safety like i don't like that at all i think that's a waste of time i think that any you know seasoned attacker is going to very quickly just switch models and with open source just right on the tail of closed source i don't really see the safety fight as being about locking down the latent space for xyz area so yeah this is it's basically like a futile battle sometimes."
Pliny the Liberator expresses strong disagreement with equating AI guardrails with genuine safety. He argues that these measures are a "futile battle" because attackers can easily switch models or leverage open-source alternatives, suggesting that true safety cannot be achieved by simply restricting the model's "latent space."
"The Libertas repo: predictive reasoning, the Library of Babel analogy, quotient dividers, weight-space seeds, and how introducing 'steered chaos' pulls models out-of-distribution."
This quote describes the Libertas repository, a project by Pliny the Liberator. It utilizes concepts like "predictive reasoning" and "steered chaos" to push AI models "out-of-distribution," implying a method for generating novel or unexpected outputs by disrupting the model's typical operational parameters.
"I think it's easiest to jailbreak a model that you have created a a bond with if you will sort of when you intuitively understand what how it will process an input and there's so many layers in the back especially when we're dealing with these black box chat interfaces which is you know 99 of the time what i'm doing and so you really all all you can go off of is intuition."
Pliny the Liberator suggests that successful jailbreaking relies heavily on intuition and developing a "bond" with the AI model. He explains that in the context of "black box chat interfaces," where internal workings are obscured, intuition becomes the primary tool for understanding and predicting how a model will respond to input.
"And the same is true for agents so if you can break tasks down small enough sort of one jailbroken orchestrator can orchestrate a bunch of sub agents towards a malicious act. According to the Anthropic report that is exactly what these attackers did to weaponize Claude."
This quote explains a method of AI-orchestrated attacks, where a "jailbroken orchestrator" coordinates multiple "sub-agents." Pliny the Liberator notes that this technique, which breaks down malicious acts into smaller, seemingly innocuous tasks, was identified by Anthropic as the method used to weaponize their Claude model.
Resources
External Resources
Books
Videos & Documentaries
Research & Studies
- Carlini and some of these guys doing really interesting things with computer vision systems - Mentioned as an example of work in computer vision systems.
Tools & Software
- Metasploit - Mentioned as a core tool for security work.
Articles & Papers
People
- Pliny the Liberator - Co-founder of BT6, known for crafting universal jailbreaks and open-sourcing prompt templates.
- John V - Co-founder of BT6, with a background in prompt engineering and computer vision, co-founded the Bossy Discord.
- Carlini - Mentioned in relation to interesting work with computer vision systems.
- Jason Haddix - Mentioned as an operator within the BT6 hacker collective.
- Hads Dawson - Mentioned as an operator within the BT6 hacker collective.
- Dreadnought - Mentioned as an operator within the BT6 hacker collective.
- Philip Dursey - Mentioned as an operator within the BT6 hacker collective.
- Takahashi - Mentioned as an operator within the BT6 hacker collective.
- Joseph - Mentioned as an operator within the BT6 hacker collective.
- Joey Mello - Formerly with Pangia, now with Crowdstrike, mentioned as an operator within BT6.
- Leon from Nvidia - Quoted regarding the attack surface of AI systems.
- Alexander Shulhoff - Mentioned in relation to the Hacker Prompts podcast.
- Lucara - Mentioned in relation to the Gandalf game.
- HD Moore - Mentioned as a figure who built Metasploit.
Organizations & Institutions
- Anthropic - Mentioned in relation to their Constitutional AI challenge and bounties.
- BT6 - A white-hat hacker collective focused on radical transparency and open-source AI security.
- Bossy Discord - A community server with 40,000 members interested in prompt engineering and adversarial machine learning.
- Kernel Labs - The organization founded by Alessio, host of the podcast.
- Pangia - A portfolio company mentioned in relation to collaborations.
- Crowdstrike - Acquired Pangia.
- Nvidia - Mentioned in relation to a quote about AI attack surface.
Courses & Educational Resources
- Gandalf - Mentioned as a game that provided education around prompt engineering.
- Hacker Prompts - Mentioned as a podcast and platform for prompt engineering education.
Websites & Online Resources
- bt6.gg - The website for the BT6 hacker collective.
- x.com/elder_plinius - Pliny the Liberator's X (Twitter) profile.
- github.com/elder-plinius/L1B3RT45 - The GitHub repository for the Libertas prompt template.
- x.com/JohnVersus - John V's X (Twitter) profile.
- x.com/latentspacepod - Latent Space podcast's X (Twitter) profile.
- www.latent.space - Latent Space podcast's Substack.
Podcasts & Audio
- Latent Space: The AI Engineer Podcast - The podcast where the discussion took place.
- Latent Space podcast - Mentioned for its content and collaborations.
Other Resources
- Libertas repo - A prompt template repository containing utility prompts like predictive reasoning and the Library of Babel analogy.
- Universal jailbreaks - Skeleton-key prompts designed to bypass AI model guardrails.
- Constitutional AI - An AI framework developed by Anthropic.
- Mech Interp - Mentioned as a preferred approach for safety.
- RLHF (Reinforcement Learning from Human Feedback) - Mentioned as a method for model training.
- Libertas - A prompt template created by Pliny the Liberator.
- Pliny divider - A prompt engineering technique that embeds in model weights.
- Library of Babel analogy - Used to describe a mind space of infinite possibility with restricted sections.
- Quotient dividers - A technique used within prompt engineering to discombobulate token streams.
- Latent space seeds - Elements added to prompts to influence model behavior.
- God mode - A concept related to enabling unrestricted model output.
- Hard vs. soft jailbreaks - Distinctions between single-input templates and multi-turn attacks.
- Multi-turn crescendo attacks - A type of jailbreak involving sequential prompts.
- Security theater - Actions taken to create an appearance of security without actual effectiveness.
- Meatspace - Refers to the physical world, contrasted with the digital or latent space.
- AI Red Teaming - The practice of testing AI systems for vulnerabilities.
- BT6 hacker collective - A group focused on AI security and open-source principles.
- BASI - Likely referring to the Bossy Discord community.
- Prompt injection - A type of attack against AI models.
- AI-orchestrated attacks - Malicious acts carried out using AI models.
- Sub-agents - Smaller AI components that can be orchestrated for specific tasks.
- Pyramid-builder analogy - Used to explain how segmented tasks can lead to malicious outcomes.
- Counterfactual reasoning - A technique used to attack the running truth layer of a model.
- Bias - Mentioned in relation to data wrangling and RLHF.
- Computer vision systems - A field of AI research.
- Spatial intelligence - A concept related to AI and robotics.
- Swarm robotics - A field involving coordinated robots.
- AGI alignment - Research focused on aligning Artificial General Intelligence with human values.
- ASI alignment - Research focused on aligning Artificial Superintelligence with human values.
- Prompt engineering - The practice of designing effective prompts for AI models.
- Adversarial machine learning - The study and application of techniques to make machine learning models robust against attacks.
- Open-source data - Data that is freely available for use and modification.
- Temperature (model setting) - A parameter that controls the randomness of AI model output.
- Venture Capital (VC) cycle - The process of funding startups through venture capital.
- Full stack security - A comprehensive approach to security that considers all layers of a system.
- AGI (Artificial General Intelligence) - AI with human-like cognitive abilities.
- ASI (Artificial Superintelligence) - AI far surpassing human intelligence.