Back to Architecture
instantAIguruDocumentation

Hybrid RAG Architecture

Hybrid RAG Architecture

The Hybrid RAG (Retrieval-Augmented Generation) system is a high-performance, serverless conversational AI platform designed to deliver accurate, context-aware responses by combining the power of Large Language Models (LLMs) with proprietary real-time retrieval mechanisms and deterministic workflow execution.

This document outlines the high-level architecture, request processing flow, and integration with the JavaScript Flow Engine (JSFE).

Core Architecture

The system is built on a serverless AWS infrastructure, prioritizing scalability, security, and low latency.

  • Compute: AWS Lambda (Node.js) handling efficient asynchronous processing.
  • Storage:
    • Amazon S3: Configuration files, flow definitions, and tool scripts.
    • Amazon DynamoDB: Chat history, sessions, and audit logging.
  • Vector Database: OpenSearch Service for hybrid search (Semantic + Keyword).
  • AI Models: Amazon Bedrock & External LLM APIs for inference and embedding.

Advanced Reliability & Accuracy

To achieve industry-leading accuracy and zero-hallucination guarantees, the system employs sophisticated multi-model and multi-vendor orchestration.

Multi-Model, Multi-Vendor Architecture

The system is not reliant on a single AI provider. Instead, it maintains active integrations with multiple frontier models:

  • OpenAI: GPT-4o, GPT-5-mini
  • Anthropic (via AWS Bedrock): Claude 3.5/3.7 Sonnet
  • Google: Gemini 1.5 Pro/Flash, Gemini 2.0
  • Meta (via AWS Bedrock): Llama 4 Scout/Maverick
  • Groq: Mixtral, Llama 3.3 (for ultra-low latency)
  • DeepSeek: DeepSeek Chat

This redundancy allows the system to route queries based on complexity, speed requirements, and model strengths.

Validated Response Generation & Hallucination Guard

The generation phase runs in a strict "Validation Loop" to prevent hallucinations:

  1. Parallel Generation: In critical scenarios, response candidates can be generated by multiple models simultaneously.
  2. Fact-Checking: The generated answer is cross-referenced against the retrieved context chunks.
  3. Automated Escalation: If the primary model's response fails validation (e.g., low confidence, missing citations), the system automatically escalates to a more capable "Intervention Model" from an alternate vendor.
  4. Parallel Web Search: For queries requiring up-to-the-second verified data, the system executes real-time Google Search queries in parallel with internal RAG. The search results are ranked against internal documents, ensuring the model has the most authoritative source.

Request Processing Pipeline

The handling of a user query follows a strict pipeline designed to maximize accuracy and minimize hallucinations.

Valid

Yes

Yes

No

No

Yes

No

No

Yes

No

Yes

Yes

User Request

Auth & Config

Load Context & History

JSFE Engine
Active?

Update Engine Activity

Engine
Responded?

Log Interaction

Return Response

Pre-Processor
Layer

Fast Path Interaction
(Greeting, Reset)

Pre-Qualification
(Guardrails & Intent)

Valid &
Safe?

Return Rejection/Clarification

Requires
Knowledge?

Generate Direct Response

Hybrid Search
(Vector + Keyword)

Parallel Google
Web Search

Re-Rank & Fuse
Results

Generate Answer
with Context

JSFE Integration (JavaScript Flow Engine)

A critical differentiator of this architecture is its deep integration with JSFE (JavaScript Flow Engine). This allows the system to seamlessly switch between "Generative AI" mode and "Deterministic Workflow" mode.

Initialization & Loading

When the Lambda container initializes, it dynamically loads the jsfe module and fetches the necessary flow definitions from S3 based on the active configuration.

  • System Flows: Verified, global flows common across deployments.
  • Shopify Flows: (Optional) E-commerce specific flows if Shopify module is enabled.
  • Tools Registry: JavaScript functions callable by the engine.

Runtime Interaction (getEngineResponse)

Before any Large Language Model is invoked to answer a query, the system consults the Workflow Engine.

  1. Session Hydration: The user's persisted engine session is retrieved from DynamoDB.
  2. Activity Update: The user's input is passed to config.engine.updateActivity().
  3. Execution: JSFE evaluates the input against active flows, running state nodes, executing tools, or matching intents.
  4. Priority Handling:
    • If the Engine produces a text response (e.g., "Please enter your order number"), this response immediately preempts the RAG pipeline.
    • The system returns the Engine's output, ensuring deterministic handling of business logic (payments, data collection, authentications).
    • If the Engine returns null (no match/passive), the request proceeds to the Pre-Qualification LLM step.

Context Sharing

Context is bi-directionally synced between the RAG system and JSFE:

  • User Attributes: Collected data (Name, Email, Verified Status) is extracted from the Engine's userContext and stored in the global conversation context.
  • Flow State: The conversation state is persisted automatically, allowing long-running workflows to span multiple chat sessions.

Request Pre-Processing

The system employs a Pre-Processor layer to analyze every incoming prompt before it reaches the computationally expensive RAG pipeline. This layer is designed to optimize the user experience by providing instant, low-latency responses for common interactions.

  • Instant Greetings: Detects session initiation signals to deliver a personalized, multi-language welcome message immediately. The greeting respects the configured agent language and specific persona settings, bypassing the LLM to ensure zero latency.
  • Fast-Path Commands: Handles administrative signals like RESET: (to clear context) locally, ensuring a snappy interface response.

Pre-Qualification & Guardrails

If JSFE passes the request, a specialized "Gatekeeper" LLM Step analyzes the prompt.

  • Intent Recognition: Rephrases the user's query for optimal search retrieval.
  • Safety Checks: Validates against injection attacks or policy violations.
  • Heuristics: Determines if the query can be answered from chat history alone or requires fetching fresh knowledge.

Hybrid Search & Retrieval

When knowledge retrieval is necessary:

  1. Hybrid Search: Executes parallel queries against OpenSearch using both Dense Vector Embeddings (Semantic) and BM25 (Keyword) matching.
  2. Fusion: Results are normalized and fused to find the most pertinent document chunks.
  3. Context Construction: High-ranking chunks are assembled into a context window for the final generation step.