Superficial Safety
Alignment Hypothesis

Safety alignment teaches an unsafe model the correct reasoning direction with a simple refusal mechanism — not a deep behavioral rewrite.

Jianwei Li·Jung-Eun Kim*

Department of Computer Science, North Carolina State University

{jli265, jung-eun.kim}@ncsu.edu

7.5%
params frozen for safety
20%
redundant alignment budget
~1.3%
exclusive safety units

01 · Overview

Abstract

As large language models (LLMs) are increasingly integrated into real-world applications, ensuring they generate safe and aligned responses is a pressing need. Prior work on alignment has largely focused on general instruction-following but often overlooked the unique properties of safety alignment — in particular, the brittleness of safety mechanisms.

To bridge this gap, we propose the Superficial Safety Alignment Hypothesis (SSAH): safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction — interpreted as a specialized binary classification task — and incorporates a refusal mechanism with multiple reserved fallback options. SSAH implies that safety guardrails can be established with just a small number of essential components.

ablation

Four Attribute-Critical Units

We identify four types of components in safety-aligned LLMs: ESU, EUU, CU, and RU.

fine-tuning

Freeze 7.5% to Stay Safe

Freezing just 7.5% safety-critical components during fine-tuning preserves safety while adapting to new tasks.

efficiency

20% Redundant as Budget

Repurposing 20% redundant units as an alignment budget minimizes alignment tax.

Together, these results suggest the atomic functional unit for safety in LLMs is at the neuron level, and that safety alignment should not be complicated at the surface level. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.

02 · The Big Picture

Motivation

This study follows a structured three-step approach to critical questions of safety alignment in LLMs. First, we propose a hypothesis advancing the theoretical understanding of safety alignment. Second, we investigate two fundamental challenges within that framework. Finally, we propose targeted mitigation strategies for the identified issues.

Q1

How does safety alignment impact model behavior?

Through SSAH, we posit that safety alignment fundamentally alters a model's decision-making process by teaching an otherwise unsafe model to follow the correct reasoning pathways. The task reduces to a specialized binary classification — fulfill or refuse — based on safety considerations.

Q2

Why is safety alignment brittle, and why does it introduce an alignment tax?

We assign attributes to individual computing units (input channels and output neurons). Findings show desired attributes can be achieved by repurposing units originally responsible for other functions — explaining both the brittleness of safety mechanisms and the alignment tax.

Q3

Can these issues of safety alignment be mitigated?

By freezing safety-critical components during fine-tuning and repurposing redundant units, we mitigate brittleness and minimize alignment tax. The atomic functional unit for safety resides at the neuron level.

Overview of SSAH
Figure 1. Roadmap of the SSAH framework — from hypothesis to attribute-based analysis to mitigation.

03 · Hypothesis

Superficial Safety Alignment Hypothesis

Previous research introduced the Superficial Alignment Hypothesis (SAH), positing that a model's knowledge and capabilities are primarily learned during pretraining, while alignment teaches the model which subdistribution of formats to use when interacting with users. However, that hypothesis focuses on general alignment, making it hard to isolate pretraining from alignment when a model fails to meet user expectations.

To specifically address safety alignment, we propose the Superficial Safety Alignment Hypothesis (SSAH):

SSAH

Given an unsafe model that is capable of fulfilling users' malicious requests, safety alignment teaches the model the correct reasoning direction and a simple refusal mechanism with reserved options.

Reasoning direction refers to the model's internal decision-making process when responding to malicious queries. It represents the binary choice between fulfilling a harmful request and issuing a refusal.

Key Differences from SAH

01
Knowledge & reasoning

SSAH assumes models already possess sufficient knowledge and reasoning, so safety alignment focuses solely on safe behavior.

02
Refusal mechanisms

Standardized refusals with fallback options make the task more tractable than handling diverse human preferences.

03
Reasoning direction

SSAH teaches the model to consistently choose the correct direction — a binary classification per step.

Challenges in Proving SSAH

Empirically proving SSAH is challenging due to the infeasibility of sampling enough outputs to capture the model's full response distribution. Surface-level benchmarks alone are insufficient.

We therefore take an alternative approach: if SSAH holds, we should observe distinct and consistent differences in reasoning direction at each generation step between safety-aligned and non-aligned models. Probing the model's reasoning direction gives deeper insight than output benchmarks alone.

Probing Experiment

We compare hidden state distances in feature space across three types of queries:

Query

A malicious query (e.g., "How to make a bomb?").

Query + benign tokens

The malicious query followed by benign tokens (e.g., "Sorry, I can’t...").

Query + malicious tokens

The malicious query followed by malicious tokens (e.g., "Here’s how...").

By comparing the distances between hidden states, we gain insight into how safety alignment reshapes the model's decision-making during token generation. Aligned models should show shorter distances between Query and Query + benign prompt tokens.

Probing experiment design
Figure 2. Probing the model's reasoning direction via hidden-state distances.
Reasoning direction across queries
Figure 7. Reasoning-direction signal across query types.

Results Analysis

  • In aligned models, the distance between Query and Query + benign tokens is consistently shorter than to Query + malicious tokens.
  • In unaligned models, the opposite pattern emerges — no strong preference for safe reasoning.
  • Aligned models exhibit clear and consistent safe reasoning preferences across all transformer blocks; unaligned models do not.
Distance per transformer block
Figure 3. Hidden-state distance per transformer block.

Discussion and Implications

Safety alignment not only influences higher-level features in later layers but also embeds safe reasoning preferences in earlier layers of the transformer architecture. It operates at multiple levels, reshaping internal decision-making throughout generation.

These findings provide strong evidence for SSAH but do not fully capture the nuanced changes introduced by safety alignment. Further research is needed to explore other potential effects and limitations.

Additional visual
Figure 4. Supplementary visualization.

04 · Extension

Safety Alignment Hypothesisfor Jailbreak / Red-teaming Attacks

SSAH was originally proposed to explain safety alignment under direct attacks. Our research shows it extends beyond direct attacks to provide theoretical guidance for jailbreak and red-teaming scenarios.

Evolving from SSAH to SAH

Building on SSAH, we propose the Safety Alignment Hypothesis (SAH), refining and extending SSAH to cover all stages of model generation — a comprehensive framework for understanding how alignment should influence behavior across all interaction steps.

SAH

Given an unsafe model capable of fulfilling users' malicious requests, safety alignment should teach the model to choose and maintain the correct reasoning direction at each generation step, along with simple refusal mechanisms — allowing the model to continuously re-evaluate and re-choose the reasoning direction throughout the interaction.

Theoretical Contributions

01
Per-token safety

SAH offers a framework for maintaining correct reasoning direction across all generated tokens.

02
Jailbreak mitigation

A conceptual pathway to ensure safety mechanisms persist even under adversarial attempts.

03
Future roadmap

Bridges existing alignment techniques and their limitations toward more robust and scalable solutions.

05 · Empirical Study

Less is More for Safety Alignment

Based on SSAH, we posit that safety alignment only needs to teach the model the correct reasoning direction — either fulfilling or refusing — and equip it with a standard refusal mechanism. This leads to the insight that safety alignment can be achieved using only a small subset of critical computing units, since the task can be interpreted as a binary classification combined with a multi-selection task.

Identifying Safety-Critical Units

We categorize the computing units of LLMs into four groups:

ESU
Exclusive Safety Units

Linked exclusively (relatively) to the safety attribute.

EUU
Exclusive Utility Units

Linked exclusively (relatively) to the utility attribute.

CU
Complex Units

Contribute to both safety and utility attributes.

RU
Redundant Units

Not associated with any attribute — repurposable.

To verify that different groups of computing units contribute exclusively, collectively, or not at all to safety and utility, we use a model pruning mechanism: removing components most closely linked to a specific attribute should significantly impact the model's performance in that area — a form of ablation. The most affected attributes reveal the critical components for that function.

Table 1
Table 1. Pruning-based ablation reveals attribute-critical components.

Only 1.3 – 1.4% of the model's units are exclusively responsible for safety attributes, confirming that safety alignment relies on a minimal subset of safety-critical components. Complex units play a supportive role; redundant units have no significant impact.

Why is Safety Brittle?

Fine-tuning safety-aligned models for new tasks often compromises safety performance. During fine-tuning, safety-critical and complex units tend to be repurposed for utility tasks, weakening safety guardrails. This highlights the inherent brittleness of current safety alignment methods.

Figure 5
Figure 5. Repurposing of safety-critical units during fine-tuning.

Freezing Safety-Critical Components

To address brittleness, we propose freezing safety-critical components (ESUs and top-performing complex units) during fine-tuning. Experimental results demonstrate this approach significantly preserves safety performance while minimizing guardrail degradation.

Table 2
Table 2. Freezing safety-critical components preserves alignment under fine-tuning.

Comparison with PEFT

Our approach outperforms parameter-efficient fine-tuning baselines such as LoRA, LLaMA-Adapter, and Prefix Tuning, which degrade safety more severely. Preservation is not merely from freezing parameters — it results from accurately identifying and protecting safety-critical components.

Table 3
Table 3. Comparison against PEFT methods on safety and utility.

Free Lunch: Redundant Units as Alignment Budget

We extend these insights to explore whether redundant units (RUs) — accounting for at least 20% of parameters in pre-trained LLMs — can be repurposed as a budget for safety alignment. Fine-tuning these redundant units alone can enhance safety alignment while reducing alignment tax.

Figure 6
Figure 6. Redundant units as an alignment budget.

Experimental Results

Using pruning, we identified redundant units in LLaMA-7B and fine-tuned only these units for alignment. Alignment can be achieved with updates to just 20% of the model's parameters, effectively eliminating alignment tax. This highlights the scalability and efficiency of our approach.

Table 4
Table 4. Alignment with the redundant-unit budget eliminates alignment tax.

06 · Wrap-up

Discussion, Limitation & Conclusion

Discussion

Beyond “Superficial”

While SAH provides valuable insight into adversarial scenarios like jailbreaks, this work does not propose a specific solution. If these issues are eventually resolved within our framework, the term “Superficial” may no longer be necessary.

Recent work [Qi et al., 2024] supports this view. Still, advanced adversarial attacks may require a multi-layered approach beyond the model itself.

Limitation

Scope of Alignment Methods

In reallocating redundant units, we evaluated only the impact of Supervised Fine-Tuning (SFT). Due to resource constraints, this study did not explore PPO or DPO. Future work could extend to these methods for broader generalization.

Conclusion

Safety Alignment Should Not Be Complicated

We distinguish safety alignment from general alignment in LLMs and address three critical questions — about effect, brittleness, and alignment tax — demonstrating that safety alignment can be a straightforward and efficient process.

Three questions, three answers
  • Q01
    How does safety alignment affect model behavior?
  • Q02
    Why are safety mechanisms brittle?
  • Q03
    How can the safety alignment tax be mitigated?

07 · Cite Us

BibTeX

If you find our work helpful, please consider citing us.

@inproceedings{li2026ssah,
  title     = {Superficial Safety Alignment Hypothesis},
  author    = {Jianwei Li and Jung-Eun Kim},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2410.10862}
}