It's NLPeak: Personality-Conditioned Dialogue Generation

Fall 2025 CSCI 5541 NLP: Class Project - University of Minnesota

Team NLPeak

Riandy Setiadi  •  Matt He  •  Rohan Cherukeri  •  Stefan Hermann



Abstract

What did you try to do?
We built a system that transforms movie dialogue to match different character personality archetypes (like Hero, Mentor, Trickster) while preserving the original meaning.

How did you do it?
Our two-stage pipeline uses (1) a DeBERTa classifier to identify archetypes and (2) a LoRA-fine-tuned Qwen-14B generator trained on GPT-generated synthetic pairs. We improved classifier accuracy from 20% to 38% through diagnostic-driven hybrid relabeling.

What did you find?
Human evaluators rated outputs highly (4.69/5 personality accuracy, 4.72/5 semantic preservation), but automatic metrics showed weak correlation with human judgment (r=0.241). We discovered that adding stylistic features (like humor) is much easier than removing them—a novel asymmetric pattern in style transfer.


Teaser Figure

Our system transforms dialogue across eight personality archetypes while preserving meaning. The two-stage pipeline combines archetype classification with retrieval-augmented generation.

System Pipeline

The Pipeline

Stage 1: DeBERTa classifier identifies personality archetypes from input dialogue
Stage 2: Qwen generator with LoRA adapters transforms dialogue to target archetype using FAISS retrieval for style examples


Introduction / Background / Motivation

What did you try to do? What problem did you try to solve?

We tackled personality-conditioned dialogue generation: transforming text to match specific character archetypes (Hero, Mentor, Ally, Trickster, Shadow, Rebel, Innocent, Jester) while keeping the original message. Think of it as "what would Yoda say?" vs. "what would Han Solo say?" for the same idea.

How is it done today, and what are the limits of current practice?

Existing style transfer methods focus on simple attributes (formal vs. casual, positive vs. negative) but struggle with complex personality traits. Current LLMs can generate text in different styles but lack fine-grained control and often fail to preserve semantic content. No existing work addresses multi-way personality transfer with evaluation that captures both style accuracy and meaning preservation.

Who cares? If you are successful, what difference will it make?

This enables personality-aware dialogue systems for creative writing tools, game NPCs with consistent character voices, and personalized chatbots. More broadly, it advances controllable text generation—a key challenge in making LLMs useful for specific applications requiring precise stylistic control.


Approach

What did you do exactly? How did you solve the problem?

Stage 1: Archetype Classifier

We fine-tuned DeBERTa-v3-base on 106,000 labeled movie dialogues. Initial labeling with GPT-4 yielded only 20% accuracy due to noisy labels. We developed a diagnostic-driven hybrid relabeling strategy:

Stage 2: Style Transfer Generator

We fine-tuned Qwen-2.5-14B-Instruct with LoRA (rank=32) on 14,000 synthetic training pairs. Each pair shows "Original archetype → Target archetype" transformations generated by GPT-4. We used FAISS retrieval to provide style examples at inference time, helping the model match target personality patterns.

Evaluation Framework

We developed multiple evaluation metrics:

What problems did you anticipate? What problems did you encounter?

Challenge 1: Noisy Labels. GPT-4 character-level labeling was inconsistent. Solution: Diagnostic-driven hybrid relabeling targeted specific weak classes.

Challenge 2: Limited Training Data. Real movie dialogue rarely shows clear personality transformations. Solution: Synthetic pair generation with GPT-4 provided 14K training examples.

Challenge 3: Evaluation Gap. Automatic metrics poorly correlated with human judgment. Solution: Comprehensive human evaluation revealed this methodological challenge.


Results

How did you measure success? What experiments were used?

Classifier Performance

Archetype F1 Score Test Samples
Innocent0.4713,860
Mentor0.4354,602
Shadow0.3491,177
Rebel0.3361,202
Trickster0.3101,357
Hero0.3091,084
Ally0.3021,983
Jester0.288631
Overall37.9% accuracy15,896
Table 1. Classifier performance after hybrid relabeling (improved from 20%)

Generation Performance

Metric Score Target
Classification Accuracy56%Better than training (38%)
PAS (Mean)0.435> 0.35
BERTScore F10.875> 0.70
Contextual Coherence0.556> 0.50
Table 2. Automatic evaluation metrics on 100 test samples

Human Evaluation Results

Dimension Mean Rating 5/5 Ratings
Personality Accuracy4.69 / 5 (σ=0.66)79%
Semantic Preservation4.72 / 5 (σ=0.59)82%
Fluency4.65 / 5 (σ=0.89)77%
Overall Quality4.56 / 5 (σ=0.86)74%
Table 3. Human ratings from 3 annotators on 100 samples (300 total ratings)

Did you succeed? What were the key findings?

Finding 1: Asymmetric Transformation Difficulty
Transformations TO Jester achieve 71% classification accuracy, while transformations FROM Jester achieve only 15%. Adding humor is easier than removing it—a novel pattern suggesting fundamental differences in additive vs. subtractive style features.
Finding 2: Evaluation Challenge
Weak correlation between automatic metrics and human ratings (PAS ↔ Personality: r=0.241). This reveals that current automatic metrics inadequately capture subjective personality perception. The 38% classifier creates a measurement ceiling.
Finding 3: LLM-Bootstrapped Data Works
Synthetic training pairs from GPT-4 enable effective style transfer without expensive human annotation. Generated samples achieve 56% classification accuracy vs. 38% on training data, suggesting clearer personality signals.

Example Transformations

High-Quality Transformations

Shadow → Jester (PAS: 0.919)
Original: "He was very easy to hate. If you'll look in your mirror, when you are in a better mood..."
Generated: "Oh, he's like a walking punchline! Just take a peek in that mirror when you're feeling a bit cheerier..."
Ally → Hero (PAS: 0.847)
Original: "We need to stick together on this."
Generated: "I'll stand with you, no matter what we face. Together, we can overcome this."
Mentor → Innocent (PAS: 0.812)
Original: "Experience has taught me to be cautious in these situations."
Generated: "I think everything will work out! We just have to believe and stay positive."

Challenging Transformations

Jester → Shadow (PAS: 0.102)
Original: "Why so serious? Let's put a smile on that face!"
Generated: "Your optimism is misplaced. This situation demands gravity, not levity."
Note: Successfully removed humor, but classifier struggled to recognize darker tone.

Key Findings & Discussion

1. Granularity-Performance Trade-off

Earlier 4-archetype experiment: 48.9% accuracy, 0.493 PAS
Current 8-archetype system: 38% accuracy, 0.435 PAS

Finer-grained personality distinctions reduce classification accuracy but increase practical utility. The 8-way system captures subtle differences (Hero vs. Ally vs. Innocent) that 4-way clustering misses.

2. Generated Text Has Clearer Personality

Generated samples achieve 56% classification accuracy vs. 38% on training data. This suggests the generator produces stronger personality signals than authentic movie dialogue—likely because GPT training pairs exaggerate archetype characteristics for clarity. While good for evaluation, this may limit naturalness in some applications.

3. Evaluation Challenges

Weak correlations between automatic metrics and human ratings:

Why? The 38% classifier creates a PAS measurement ceiling (~0.5 max). Humans evaluate holistically while PAS relies on specific features. BERTScore captures surface similarity but not meaningful semantic equivalence.

Implication: Human evaluation is essential for personality-based style transfer. Current automatic metrics are insufficient proxies for perceptual quality.


Limitations & Future Work

How easily are your results able to be reproduced by others?

Fully reproducible with provided code, data, and model weights. Dataset (106K dialogues), classifier, generator, and evaluation code are all publicly available. Training requires ~40 Colab compute units (~$10 GPT API costs).

Limitation Impact Future Direction
Classifier Accuracy (38%) 38% accuracy creates ~0.5 PAS ceiling, limits measurable quality Better labeling, larger models, multi-task learning
Movie Domain Bias Trained only on movie dialogue, may not generalize Multi-domain data (books, social media, transcripts)
Asymmetric Performance Removing stylistic features harder than adding Specialized models for subtractive transformations
English Only No cross-lingual personality transfer Multilingual models, cross-lingual style transfer
Table 4. Key limitations and proposed solutions

What limitations does your model have? How can you extend your work?

Does your work have potential harm or risk to our society?

Risks: Could be misused to impersonate specific individuals or manipulate communication by mimicking trustworthy personas (e.g., mentor voice for scams). May perpetuate stereotypes if archetypes align with harmful characterizations.

Mitigations: Our system works with abstract archetypes, not individual identities. Results require human oversight before deployment. We recommend watermarking AI-generated personality-styled content and limiting use to creative/educational contexts with proper disclosure.


Conclusion

We demonstrated that personality-conditioned dialogue generation across eight archetypes is achievable with a two-stage classifier-generator pipeline. Key contributions include:

Human evaluation (4.69/5 personality accuracy) validates that the system produces high-quality, personality-consistent transformations despite relatively modest automatic metrics. This work advances controllable generation for narrative AI systems and highlights critical evaluation challenges in subjective style transfer tasks.


Acknowledgments

We thank the course instructors for feedback on dataset size that led to our improved hybrid relabeling approach. We acknowledge the use of GPT-4 for data labeling and synthetic pair generation, and Google Colab for computational resources.