Hybrid RAG Architecture
The Hybrid RAG (Retrieval-Augmented Generation) system is a high-performance, serverless conversational AI platform designed to deliver accurate, context-aware responses by combining the power of Large Language Models (LLMs) with proprietary real-time retrieval mechanisms and deterministic workflow execution.
This document outlines the high-level architecture, request processing flow, and integration with the JavaScript Flow Engine (JSFE).
Core Architecture
The system is built on a serverless AWS infrastructure, prioritizing scalability, security, and low latency.
- Compute: AWS Lambda (Node.js) handling efficient asynchronous processing.
- Storage:
- Amazon S3: Configuration files, flow definitions, and tool scripts.
- Amazon DynamoDB: Chat history, sessions, and audit logging.
- Vector Database: OpenSearch Service for hybrid search (Semantic + Keyword).
- AI Models: Amazon Bedrock & External LLM APIs for inference and embedding.
Advanced Reliability & Accuracy
To achieve industry-leading accuracy and zero-hallucination guarantees, the system employs sophisticated multi-model and multi-vendor orchestration.
Multi-Model, Multi-Vendor Architecture
The system is not reliant on a single AI provider. Instead, it maintains active integrations with multiple frontier models:
- OpenAI: GPT-4o, GPT-5-mini
- Anthropic (via AWS Bedrock): Claude 3.5/3.7 Sonnet
- Google: Gemini 1.5 Pro/Flash, Gemini 2.0
- Meta (via AWS Bedrock): Llama 4 Scout/Maverick
- Groq: Mixtral, Llama 3.3 (for ultra-low latency)
- DeepSeek: DeepSeek Chat
This redundancy allows the system to route queries based on complexity, speed requirements, and model strengths.
Validated Response Generation & Hallucination Guard
The generation phase runs in a strict "Validation Loop" to prevent hallucinations:
- Parallel Generation: In critical scenarios, response candidates can be generated by multiple models simultaneously.
- Fact-Checking: The generated answer is cross-referenced against the retrieved context chunks.
- Automated Escalation: If the primary model's response fails validation (e.g., low confidence, missing citations), the system automatically escalates to a more capable "Intervention Model" from an alternate vendor.
- Parallel Web Search: For queries requiring up-to-the-second verified data, the system executes real-time Google Search queries in parallel with internal RAG. The search results are ranked against internal documents, ensuring the model has the most authoritative source.
Request Processing Pipeline
The handling of a user query follows a strict pipeline designed to maximize accuracy and minimize hallucinations.
JSFE Integration (JavaScript Flow Engine)
A critical differentiator of this architecture is its deep integration with JSFE (JavaScript Flow Engine). This allows the system to seamlessly switch between "Generative AI" mode and "Deterministic Workflow" mode.
Initialization & Loading
When the Lambda container initializes, it dynamically loads the jsfe module and fetches the necessary flow definitions from S3 based on the active configuration.
- System Flows: Verified, global flows common across deployments.
- Shopify Flows: (Optional) E-commerce specific flows if Shopify module is enabled.
- Tools Registry: JavaScript functions callable by the engine.
Runtime Interaction (getEngineResponse)
Before any Large Language Model is invoked to answer a query, the system consults the Workflow Engine.
- Session Hydration: The user's persisted engine session is retrieved from DynamoDB.
- Activity Update: The user's input is passed to
config.engine.updateActivity(). - Execution: JSFE evaluates the input against active flows, running state nodes, executing tools, or matching intents.
- Priority Handling:
- If the Engine produces a text response (e.g., "Please enter your order number"), this response immediately preempts the RAG pipeline.
- The system returns the Engine's output, ensuring deterministic handling of business logic (payments, data collection, authentications).
- If the Engine returns
null(no match/passive), the request proceeds to the Pre-Qualification LLM step.
Context Sharing
Context is bi-directionally synced between the RAG system and JSFE:
- User Attributes: Collected data (Name, Email, Verified Status) is extracted from the Engine's
userContextand stored in the global conversation context. - Flow State: The conversation state is persisted automatically, allowing long-running workflows to span multiple chat sessions.
Request Pre-Processing
The system employs a Pre-Processor layer to analyze every incoming prompt before it reaches the computationally expensive RAG pipeline. This layer is designed to optimize the user experience by providing instant, low-latency responses for common interactions.
- Instant Greetings: Detects session initiation signals to deliver a personalized, multi-language welcome message immediately. The greeting respects the configured agent language and specific persona settings, bypassing the LLM to ensure zero latency.
- Fast-Path Commands: Handles administrative signals like
RESET:(to clear context) locally, ensuring a snappy interface response.
Pre-Qualification & Guardrails
If JSFE passes the request, a specialized "Gatekeeper" LLM Step analyzes the prompt.
- Intent Recognition: Rephrases the user's query for optimal search retrieval.
- Safety Checks: Validates against injection attacks or policy violations.
- Heuristics: Determines if the query can be answered from chat history alone or requires fetching fresh knowledge.
Hybrid Search & Retrieval
When knowledge retrieval is necessary:
- Hybrid Search: Executes parallel queries against OpenSearch using both Dense Vector Embeddings (Semantic) and BM25 (Keyword) matching.
- Fusion: Results are normalized and fused to find the most pertinent document chunks.
- Context Construction: High-ranking chunks are assembled into a context window for the final generation step.