What did you try to do?
We built a system that transforms movie dialogue to match different character personality archetypes (like Hero, Mentor, Trickster) while preserving the original meaning.
How did you do it?
Our two-stage pipeline uses (1) a DeBERTa classifier to identify archetypes and (2) a LoRA-fine-tuned Qwen-14B generator trained on GPT-generated synthetic pairs. We improved classifier accuracy from 20% to 38% through diagnostic-driven hybrid relabeling.
What did you find?
Human evaluators rated outputs highly (4.69/5 personality accuracy, 4.72/5 semantic preservation), but automatic metrics showed weak correlation with human judgment (r=0.241). We discovered that adding stylistic features (like humor) is much easier than removing them—a novel asymmetric pattern in style transfer.
Our system transforms dialogue across eight personality archetypes while preserving meaning. The two-stage pipeline combines archetype classification with retrieval-augmented generation.
Stage 1: DeBERTa classifier identifies personality archetypes from input dialogue
Stage 2: Qwen generator with LoRA adapters transforms dialogue to target archetype using FAISS retrieval for style examples
What did you try to do? What problem did you try to solve?
We tackled personality-conditioned dialogue generation: transforming text to match specific character archetypes (Hero, Mentor, Ally, Trickster, Shadow, Rebel, Innocent, Jester) while keeping the original message. Think of it as "what would Yoda say?" vs. "what would Han Solo say?" for the same idea.
How is it done today, and what are the limits of current practice?
Existing style transfer methods focus on simple attributes (formal vs. casual, positive vs. negative) but struggle with complex personality traits. Current LLMs can generate text in different styles but lack fine-grained control and often fail to preserve semantic content. No existing work addresses multi-way personality transfer with evaluation that captures both style accuracy and meaning preservation.
Who cares? If you are successful, what difference will it make?
This enables personality-aware dialogue systems for creative writing tools, game NPCs with consistent character voices, and personalized chatbots. More broadly, it advances controllable text generation—a key challenge in making LLMs useful for specific applications requiring precise stylistic control.
What did you do exactly? How did you solve the problem?
We fine-tuned DeBERTa-v3-base on 106,000 labeled movie dialogues. Initial labeling with GPT-4 yielded only 20% accuracy due to noisy labels. We developed a diagnostic-driven hybrid relabeling strategy:
We fine-tuned Qwen-2.5-14B-Instruct with LoRA (rank=32) on 14,000 synthetic training pairs. Each pair shows "Original archetype → Target archetype" transformations generated by GPT-4. We used FAISS retrieval to provide style examples at inference time, helping the model match target personality patterns.
We developed multiple evaluation metrics:
What problems did you anticipate? What problems did you encounter?
Challenge 1: Noisy Labels. GPT-4 character-level labeling was inconsistent. Solution: Diagnostic-driven hybrid relabeling targeted specific weak classes.
Challenge 2: Limited Training Data. Real movie dialogue rarely shows clear personality transformations. Solution: Synthetic pair generation with GPT-4 provided 14K training examples.
Challenge 3: Evaluation Gap. Automatic metrics poorly correlated with human judgment. Solution: Comprehensive human evaluation revealed this methodological challenge.
How did you measure success? What experiments were used?
| Archetype | F1 Score | Test Samples |
|---|---|---|
| Innocent | 0.471 | 3,860 |
| Mentor | 0.435 | 4,602 |
| Shadow | 0.349 | 1,177 |
| Rebel | 0.336 | 1,202 |
| Trickster | 0.310 | 1,357 |
| Hero | 0.309 | 1,084 |
| Ally | 0.302 | 1,983 |
| Jester | 0.288 | 631 |
| Overall | 37.9% accuracy | 15,896 |
| Metric | Score | Target |
|---|---|---|
| Classification Accuracy | 56% | Better than training (38%) |
| PAS (Mean) | 0.435 | > 0.35 |
| BERTScore F1 | 0.875 | > 0.70 |
| Contextual Coherence | 0.556 | > 0.50 |
| Dimension | Mean Rating | 5/5 Ratings |
|---|---|---|
| Personality Accuracy | 4.69 / 5 (σ=0.66) | 79% |
| Semantic Preservation | 4.72 / 5 (σ=0.59) | 82% |
| Fluency | 4.65 / 5 (σ=0.89) | 77% |
| Overall Quality | 4.56 / 5 (σ=0.86) | 74% |
Did you succeed? What were the key findings?
Earlier 4-archetype experiment: 48.9% accuracy, 0.493 PAS
Current 8-archetype system: 38% accuracy, 0.435 PAS
Finer-grained personality distinctions reduce classification accuracy but increase practical utility. The 8-way system captures subtle differences (Hero vs. Ally vs. Innocent) that 4-way clustering misses.
Generated samples achieve 56% classification accuracy vs. 38% on training data. This suggests the generator produces stronger personality signals than authentic movie dialogue—likely because GPT training pairs exaggerate archetype characteristics for clarity. While good for evaluation, this may limit naturalness in some applications.
Weak correlations between automatic metrics and human ratings:
Why? The 38% classifier creates a PAS measurement ceiling (~0.5 max). Humans evaluate holistically while PAS relies on specific features. BERTScore captures surface similarity but not meaningful semantic equivalence.
Implication: Human evaluation is essential for personality-based style transfer. Current automatic metrics are insufficient proxies for perceptual quality.
How easily are your results able to be reproduced by others?
Fully reproducible with provided code, data, and model weights. Dataset (106K dialogues), classifier, generator, and evaluation code are all publicly available. Training requires ~40 Colab compute units (~$10 GPT API costs).
| Limitation | Impact | Future Direction |
|---|---|---|
| Classifier Accuracy (38%) | 38% accuracy creates ~0.5 PAS ceiling, limits measurable quality | Better labeling, larger models, multi-task learning |
| Movie Domain Bias | Trained only on movie dialogue, may not generalize | Multi-domain data (books, social media, transcripts) |
| Asymmetric Performance | Removing stylistic features harder than adding | Specialized models for subtractive transformations |
| English Only | No cross-lingual personality transfer | Multilingual models, cross-lingual style transfer |
What limitations does your model have? How can you extend your work?
Does your work have potential harm or risk to our society?
Risks: Could be misused to impersonate specific individuals or manipulate communication by mimicking trustworthy personas (e.g., mentor voice for scams). May perpetuate stereotypes if archetypes align with harmful characterizations.
Mitigations: Our system works with abstract archetypes, not individual identities. Results require human oversight before deployment. We recommend watermarking AI-generated personality-styled content and limiting use to creative/educational contexts with proper disclosure.
We demonstrated that personality-conditioned dialogue generation across eight archetypes is achievable with a two-stage classifier-generator pipeline. Key contributions include:
Human evaluation (4.69/5 personality accuracy) validates that the system produces high-quality, personality-consistent transformations despite relatively modest automatic metrics. This work advances controllable generation for narrative AI systems and highlights critical evaluation challenges in subjective style transfer tasks.
We thank the course instructors for feedback on dataset size that led to our improved hybrid relabeling approach. We acknowledge the use of GPT-4 for data labeling and synthetic pair generation, and Google Colab for computational resources.