It's NLPeak: Personality-Conditioned Dialogue Generation

Abstract

What did you try to do?
We built a system that transforms movie dialogue to match different character personality archetypes (like Hero, Mentor, Trickster) while preserving the original meaning.

How did you do it?
Our two-stage pipeline uses (1) a DeBERTa classifier to identify archetypes and (2) a LoRA-fine-tuned Qwen-14B generator trained on GPT-generated synthetic pairs. We improved classifier accuracy from 20% to 38% through diagnostic-driven hybrid relabeling.

What did you find?
Human evaluators rated outputs highly (4.69/5 personality accuracy, 4.72/5 semantic preservation), but automatic metrics showed weak correlation with human judgment (r=0.241). We discovered that adding stylistic features (like humor) is much easier than removing them—a novel asymmetric pattern in style transfer.

Teaser Figure

Our system transforms dialogue across eight personality archetypes while preserving meaning. The two-stage pipeline combines archetype classification with retrieval-augmented generation.

System Pipeline

The Pipeline

Stage 1: DeBERTa classifier identifies personality archetypes from input dialogue
Stage 2: Qwen generator with LoRA adapters transforms dialogue to target archetype using FAISS retrieval for style examples

Introduction / Background / Motivation

What did you try to do? What problem did you try to solve?

We tackled personality-conditioned dialogue generation: transforming text to match specific character archetypes (Hero, Mentor, Ally, Trickster, Shadow, Rebel, Innocent, Jester) while keeping the original message. Think of it as "what would Yoda say?" vs. "what would Han Solo say?" for the same idea.

How is it done today, and what are the limits of current practice?

Existing style transfer methods focus on simple attributes (formal vs. casual, positive vs. negative) but struggle with complex personality traits. Current LLMs can generate text in different styles but lack fine-grained control and often fail to preserve semantic content. No existing work addresses multi-way personality transfer with evaluation that captures both style accuracy and meaning preservation.

Who cares? If you are successful, what difference will it make?

This enables personality-aware dialogue systems for creative writing tools, game NPCs with consistent character voices, and personalized chatbots. More broadly, it advances controllable text generation—a key challenge in making LLMs useful for specific applications requiring precise stylistic control.

Approach

What did you do exactly? How did you solve the problem?

Stage 1: Archetype Classifier

We fine-tuned DeBERTa-v3-base on 106,000 labeled movie dialogues. Initial labeling with GPT-4 yielded only 20% accuracy due to noisy labels. We developed a diagnostic-driven hybrid relabeling strategy:

Kept character-level labels for consistent archetypes (Innocent, Mentor)
Used line-level relabeling for ambiguous archetypes (Hero, Ally, Shadow, etc.)
Result: 20% → 38% accuracy with all classes achieving F1 > 0.25

Stage 2: Style Transfer Generator

We fine-tuned Qwen-2.5-14B-Instruct with LoRA (rank=32) on 14,000 synthetic training pairs. Each pair shows "Original archetype → Target archetype" transformations generated by GPT-4. We used FAISS retrieval to provide style examples at inference time, helping the model match target personality patterns.

Evaluation Framework

We developed multiple evaluation metrics:

Personality Alignment Score (PAS): Classifier confidence on target archetype
BERTScore: Semantic similarity to original
Contextual Coherence: Embedding similarity between input/output contexts
Human Evaluation: 3 annotators rated 100 samples on 4 dimensions

What problems did you anticipate? What problems did you encounter?

Challenge 1: Noisy Labels. GPT-4 character-level labeling was inconsistent. Solution: Diagnostic-driven hybrid relabeling targeted specific weak classes.

Challenge 2: Limited Training Data. Real movie dialogue rarely shows clear personality transformations. Solution: Synthetic pair generation with GPT-4 provided 14K training examples.

Challenge 3: Evaluation Gap. Automatic metrics poorly correlated with human judgment. Solution: Comprehensive human evaluation revealed this methodological challenge.

Results

How did you measure success? What experiments were used?

Classifier Performance

Table 1. Classifier performance after hybrid relabeling (improved from 20%)
Archetype	F1 Score	Test Samples
Innocent	0.471	3,860
Mentor	0.435	4,602
Shadow	0.349	1,177
Rebel	0.336	1,202
Trickster	0.310	1,357
Hero	0.309	1,084
Ally	0.302	1,983
Jester	0.288	631
Overall	37.9% accuracy	15,896

Generation Performance

Table 2. Automatic evaluation metrics on 100 test samples
Metric	Score	Target
Classification Accuracy	56%	Better than training (38%)
PAS (Mean)	0.435	> 0.35
BERTScore F1	0.875	> 0.70
Contextual Coherence	0.556	> 0.50

Human Evaluation Results

Table 3. Human ratings from 3 annotators on 100 samples (300 total ratings)
Dimension	Mean Rating	5/5 Ratings
Personality Accuracy	4.69 / 5 (σ=0.66)	79%
Semantic Preservation	4.72 / 5 (σ=0.59)	82%
Fluency	4.65 / 5 (σ=0.89)	77%
Overall Quality	4.56 / 5 (σ=0.86)	74%

Did you succeed? What were the key findings?

Finding 1: Asymmetric Transformation Difficulty
Transformations TO Jester achieve 71% classification accuracy, while transformations FROM Jester achieve only 15%. Adding humor is easier than removing it—a novel pattern suggesting fundamental differences in additive vs. subtractive style features.

Finding 2: Evaluation Challenge
Weak correlation between automatic metrics and human ratings (PAS ↔ Personality: r=0.241). This reveals that current automatic metrics inadequately capture subjective personality perception. The 38% classifier creates a measurement ceiling.

Finding 3: LLM-Bootstrapped Data Works
Synthetic training pairs from GPT-4 enable effective style transfer without expensive human annotation. Generated samples achieve 56% classification accuracy vs. 38% on training data, suggesting clearer personality signals.

Example Transformations

High-Quality Transformations

Shadow → Jester (PAS: 0.919)
Original: "He was very easy to hate. If you'll look in your mirror, when you are in a better mood..."
Generated: "Oh, he's like a walking punchline! Just take a peek in that mirror when you're feeling a bit cheerier..."

Ally → Hero (PAS: 0.847)
Original: "We need to stick together on this."
Generated: "I'll stand with you, no matter what we face. Together, we can overcome this."

Mentor → Innocent (PAS: 0.812)
Original: "Experience has taught me to be cautious in these situations."
Generated: "I think everything will work out! We just have to believe and stay positive."

Challenging Transformations

Jester → Shadow (PAS: 0.102)
Original: "Why so serious? Let's put a smile on that face!"
Generated: "Your optimism is misplaced. This situation demands gravity, not levity."
Note: Successfully removed humor, but classifier struggled to recognize darker tone.

Key Findings & Discussion

1. Granularity-Performance Trade-off

Earlier 4-archetype experiment: 48.9% accuracy, 0.493 PAS
Current 8-archetype system: 38% accuracy, 0.435 PAS

Finer-grained personality distinctions reduce classification accuracy but increase practical utility. The 8-way system captures subtle differences (Hero vs. Ally vs. Innocent) that 4-way clustering misses.

2. Generated Text Has Clearer Personality

Generated samples achieve 56% classification accuracy vs. 38% on training data. This suggests the generator produces stronger personality signals than authentic movie dialogue—likely because GPT training pairs exaggerate archetype characteristics for clarity. While good for evaluation, this may limit naturalness in some applications.

3. Evaluation Challenges

Weak correlations between automatic metrics and human ratings:

PAS ↔ Personality Accuracy: r = +0.241
BERTScore ↔ Semantic Preservation: r = +0.052
Coherence ↔ Fluency: r = -0.237

Why? The 38% classifier creates a PAS measurement ceiling (~0.5 max). Humans evaluate holistically while PAS relies on specific features. BERTScore captures surface similarity but not meaningful semantic equivalence.

Implication: Human evaluation is essential for personality-based style transfer. Current automatic metrics are insufficient proxies for perceptual quality.

Limitations & Future Work

How easily are your results able to be reproduced by others?

Fully reproducible with provided code, data, and model weights. Dataset (106K dialogues), classifier, generator, and evaluation code are all publicly available. Training requires ~40 Colab compute units (~$10 GPT API costs).

Table 4. Key limitations and proposed solutions
Limitation	Impact	Future Direction
Classifier Accuracy (38%)	38% accuracy creates ~0.5 PAS ceiling, limits measurable quality	Better labeling, larger models, multi-task learning
Movie Domain Bias	Trained only on movie dialogue, may not generalize	Multi-domain data (books, social media, transcripts)
Asymmetric Performance	Removing stylistic features harder than adding	Specialized models for subtractive transformations
English Only	No cross-lingual personality transfer	Multilingual models, cross-lingual style transfer

What limitations does your model have? How can you extend your work?

Improve classifier: Better labeling strategies, ensemble methods, or larger pre-trained models could push accuracy beyond 38%
Subtractive transformations: Specialized architectures for removing stylistic features (humor, formality) rather than just adding them
Multi-domain generalization: Extend beyond movies to books, social media, and conversational data
Better evaluation metrics: Develop automatic metrics that correlate strongly with human perception of personality
Interactive refinement: Allow users to iteratively refine transformations through feedback

Does your work have potential harm or risk to our society?

Risks: Could be misused to impersonate specific individuals or manipulate communication by mimicking trustworthy personas (e.g., mentor voice for scams). May perpetuate stereotypes if archetypes align with harmful characterizations.

Mitigations: Our system works with abstract archetypes, not individual identities. Results require human oversight before deployment. We recommend watermarking AI-generated personality-styled content and limiting use to creative/educational contexts with proper disclosure.

Conclusion

We demonstrated that personality-conditioned dialogue generation across eight archetypes is achievable with a two-stage classifier-generator pipeline. Key contributions include:

A diagnostic-driven hybrid relabeling strategy that improved classifier accuracy from 20% to 38%
Successful use of LLM-generated synthetic pairs for style transfer training
Discovery of asymmetric transformation difficulty (adding features easier than removing)
Evidence that automatic metrics inadequately capture personality perception

Human evaluation (4.69/5 personality accuracy) validates that the system produces high-quality, personality-consistent transformations despite relatively modest automatic metrics. This work advances controllable generation for narrative AI systems and highlights critical evaluation challenges in subjective style transfer tasks.

Acknowledgments

We thank the course instructors for feedback on dataset size that led to our improved hybrid relabeling approach. We acknowledge the use of GPT-4 for data labeling and synthetic pair generation, and Google Colab for computational resources.