Data Pipeline Architecture

Bronze, Silver, Gold medallion pattern for knowledge ingestion

Bronze Data Pipeline

🥉

Raw Data Ingestion

Scheduled collection from multiple data sources

Data Sources & Scheduled Jobs

GitLab
  • • data-atlas-terraform repo
  • • Documentation
  • • MR summaries
Databricks
  • • Knowledge base articles
  • • Official documentation
Confluence
  • • Everything under the page
    "Interested in Data? Start here!"
Slack
  • • 9 support channels
  • • Thread classification
  • • Validated solutions
Pipeline Characteristics

Frequency

Customizable scheduled runs

Format

Raw, unprocessed data

Storage

UC Volume

Silver Data Pipeline

🥈

Data Processing & Cleaning

Transformation and semantic chunking

Processing Steps

Semantic Chunking
  • • Detects semantic boundaries
  • • Source-specific parsers
  • • Maintain semantic meaning
OCR Processing
  • • Image to text extraction
  • • Confidence score stored
  • • Preserve image context
Code Block Detection
  • • Syntax preservation
  • • Multi-language code support
File Hierarchy Analysis
  • • Document relationships
  • • Chunk relationships
  • • Document mapping
Processing Output

Format

Structured chunks

Storage

Delta Lake Silver tables

Quality

Cleaned & validated

Gold Data Pipeline

🥇

Vector Embeddings & Search

Production-ready vector search index

Embedding Process

Late Chunking
  • • Complete document context
  • • Chunk-level precision
  • • Enhanced retrieval quality
gte-large-en-v1_5
  • • High-quality embeddings
  • • Optimized for English
  • • Large context window (8192 tokens)
Vector Search Index
  • • Semantic similarity
  • • Keyword matching
  • • Relevance ranking
Source Preservation
  • • Source file traceability
  • • Source URL traceability
Production Capabilities

Search Type

Hybrid semantic + keyword

Performance

Fast query response (MLflow)

Scale

Enterprise-ready index