Data Pipeline Architecture
Bronze, Silver, Gold medallion pattern for knowledge ingestion
Bronze Data Pipeline
🥉
Raw Data Ingestion
Scheduled collection from multiple data sources
Data Sources & Scheduled Jobs
GitLab
- • data-atlas-terraform repo
- • Documentation
- • MR summaries
Databricks
- • Knowledge base articles
- • Official documentation
Confluence
- • Everything under the page
"Interested in Data? Start here!"
Slack
- • 9 support channels
- • Thread classification
- • Validated solutions
Pipeline Characteristics
Frequency
Customizable scheduled runs
Format
Raw, unprocessed data
Storage
UC Volume
Silver Data Pipeline
🥈
Data Processing & Cleaning
Transformation and semantic chunking
Processing Steps
Semantic Chunking
- • Detects semantic boundaries
- • Source-specific parsers
- • Maintain semantic meaning
OCR Processing
- • Image to text extraction
- • Confidence score stored
- • Preserve image context
Code Block Detection
- • Syntax preservation
- • Multi-language code support
File Hierarchy Analysis
- • Document relationships
- • Chunk relationships
- • Document mapping
Processing Output
Format
Structured chunks
Storage
Delta Lake Silver tables
Quality
Cleaned & validated
Gold Data Pipeline
🥇
Vector Embeddings & Search
Production-ready vector search index
Embedding Process
Late Chunking
- • Complete document context
- • Chunk-level precision
- • Enhanced retrieval quality
gte-large-en-v1_5
- • High-quality embeddings
- • Optimized for English
- • Large context window (8192 tokens)
Vector Search Index
- • Semantic similarity
- • Keyword matching
- • Relevance ranking
Source Preservation
- • Source file traceability
- • Source URL traceability
Production Capabilities
Search Type
Hybrid semantic + keyword
Performance
Fast query response (MLflow)
Scale
Enterprise-ready index