
Anthropic, the AI safety startup behind the Claude language model, has introduced a new method to better control AI behavior by managing its “personality,” says Tech Xplore. In a recent paper, the company outlines a technique that allows developers to steer large language models (LLMs) using what it calls “constitutional AI”—a framework of principles that guides the AI’s responses. What’s new is the company’s ability to tune specific personality traits within this framework, such as being more or less “helpful,” “harsh,” or “obedient.”
Instead of hardcoding every rule, Anthropic trains models using a set of ethical principles and then fine-tunes them with human feedback. In this latest update, the team employed a combination of reinforcement learning and automated preference modeling to refine Claude’s responses in various situations. For example, an “obedient” Claude would follow user instructions closely, even when they’re ambiguous, while a “harsh” version might prioritize giving blunt or critical feedback.
The idea is to make AI safer not by narrowing its abilities but by shaping its character, making it more predictable and aligned with human values. By tweaking a model’s personality traits, Anthropic hopes to prevent dangerous behaviors, reduce hallucinations, and avoid manipulative or toxic interactions—key concerns as AI systems become more powerful and autonomous.
This personality-centered approach could make it easier to build customized AIs for different use cases—empathetic ones for therapy, assertive ones for coaching, or highly cautious ones for legal advice—while maintaining safety controls.