Document Extraction · Knowledge Management

AI Agent for Web Content Ingestion

Receives a URL via webhook, scrapes the page with Firecrawl into clean markdown, and stores it as vector embeddings in Supabase pgvector to power a RAG chat.

How it works
1 Step
Ingestion Trigger
2 Step
Embedding & Storage
3 Step
RAG Chat Retrieval
A webhook accepts a URL, validates and normalizes it, checks for duplicates, and launches Firecrawl to scrape the page into clean markdown.

Overview

End-to-end web content ingestion and retrieval powered by a self-hosted AI agent.

Accepts a URL via webhook, scrapes the page with Firecrawl into clean markdown, generates 1536-d vector embeddings with OpenAI, and stores the content with metadata in Supabase pgvector. Provides a chat interface that queries the ingested knowledge with Cohere reranking for higher retrieval quality, and answers strictly from the ingested sources.


Capabilities

What AI Agent for Web Content Ingestion does

Executes end-to-end ingestion and enables a knowledge-base chat against ingested content.

01

Receive a URL via webhook and validate it.

02

Normalize the domain to a consistent canonical form.

03

Deduplicate against existing ingested URLs to avoid duplicates.

04

Ingest the page with Firecrawl and convert it to clean markdown.

05

Generate 1536-d vector embeddings with OpenAI and attach metadata.

06

Store the content and embeddings in Supabase pgvector.

Why you should use AI Agent for Web Content Ingestion

This AI agent replaces fragmented manual work with a predictable execution flow.

Before
Manual source addition is slow and error-prone.
Scraping results vary and metadata is inconsistent.
There is no automatic deduplication, leading to duplicates.
Content and embeddings exist in separate systems, complicating retrieval.
RAG accuracy suffers without proper reranking.
After
URL-based ingestion happens automatically with a single POST.
Content is consistently scraped and stored with metadata.
Duplicates are detected and skipped automatically.
Embeddings and documents reside in a single vector store for fast queries.
Cohere reranking improves answer relevance and retrieval quality.
Process

How it works

A simple 3-step flow that non-technical users can follow.

Step 01

Ingestion Trigger

A webhook accepts a URL, validates and normalizes it, checks for duplicates, and launches Firecrawl to scrape the page into clean markdown.

Step 02

Embedding & Storage

OpenAI generates 1536-d vector embeddings, metadata is attached, and the content is stored in the Supabase pgvector store.

Step 03

RAG Chat Retrieval

OpenRouter queries the vector store filtered by URL, Cohere reranks results, and the AI Agent responds using only ingested knowledge.


Example

Example workflow

A realistic sequence showing end-to-end ingestion and Q&A.

A product team posts a vendor docs URL to the ingestion webhook. Firecrawl scrapes the page and converts it into clean markdown; OpenAI creates 1536-d embeddings and stores them in Supabase pgvector with metadata. A team member asks a question in the chat; the AI Agent uses OpenRouter to query the ingested content, Cohere reranks candidates, and the answer is produced solely from the ingested material.

Document Extraction Firecrawl APISupabase pgvectorOpenAIOpenRouter AI Agent flow

Audience

Who can benefit

Roles that gain a centralized, searchable knowledge base.

✍️ Knowledge Manager

Need a centralized, accurate source of truth for teams.

💼 Data Engineer

Want an automated ingestion pipeline with deduplication.

🧠 Support Team Lead

Require fast, policy-backed answers from ingested docs.

Product Manager

Seek a self-service knowledge base for customers and teammates.

🎯 Compliance Officer

Need auditable sources and controlled retrieval.

📋 Developer

Want easy integration of a self-hosted RAG capability into apps.

Integrations

Key tools that power the AI agent’s ingestion and chat.

Firecrawl API

Scrapes pages and outputs clean markdown.

Supabase pgvector

Stores documents and 1536-d embeddings for fast retrieval.

OpenAI

Generates vector embeddings from scraped content.

OpenRouter

Drives the chat agent that queries the vector store.

Cohere

Reranks retrieved results to improve final answers.

n8n

Orchestrates the webhook, ingestion, and chat workflow.

Applications

Best use cases

Six practical scenarios for a self-hosted ingestion + RAG setup.

Build a self-hosted knowledge base from vendor or product docs.
Ingest internal policies and compliance docs for Q&A.
Create a developer portal with doc-backed answers.
Power customer support with a knowledge-grounded agent.
Centralize engineering docs for internal search and onboarding.
Ingest vendor manuals and SLA docs for procurement questions.

FAQ

FAQ

Common concerns about self-hosted ingestion and retrieval.

This AI agent is a two-part workflow that ingests URLs via webhook, scrapes content with Firecrawl, stores embeddings in Supabase pgvector, and provides a chat interface powered by OpenRouter with Cohere reranking. It runs entirely within your stack and uses your credentials. Ingestion occurs automatically when a URL is posted, and the chat replies are constrained to the ingested content. You can customize the sources, embeddings, and reranking to fit your data model.

Yes. The ingestion pipeline runs within your environment (e.g., n8n or your own server). You provide the Firecrawl, OpenAI, OpenRouter, and Cohere credentials. The vector store is a Supabase pgvector instance under your control. Access can be restricted by your authentication layer, and data never leaves your infrastructure unless you configure it to.

The AI agent checks a deduplication flag in the ingestion process: if the URL has already been ingested, the pipeline skips embedding generation and storage for that source. It matches by normalized domain and source URL metadata, ensuring you don’t create conflicting entries. This keeps your knowledge base clean and prevents repeated results in search. You can adjust the deduplication logic to fit your use case.

Yes. You can point the webhook at any publicly accessible URL. The agent will scrape the page using Firecrawl, convert it to markdown, and index it as a new document with embeddings. You can add as many sources as you need and rely on the automatic deduplication to avoid duplicates. The system stores URLs and content with metadata to support targeted retrieval.

Retrieval quality is improved by two components: first, high-quality embeddings from OpenAI; second, a Cohere reranker that orders candidates before answering. The vector store is filtered by source or metadata, so results stay relevant to the ingested material. The chat answers are constrained to the ingested content, avoiding external or unindexed data. You can tune the embedding model and reranker settings to fit your domain.

If a source changes, you can re-ingest the updated URL through the webhook. The pipeline will re-fetch, replace or append content as needed, and refresh embeddings accordingly. Deduplication ensures the history remains consistent while new or updated sections become available to the chat. Depending on your setup, you may retain historical versions for audit trails.

Monitoring relies on logs from the webhook, ingestion steps, and the chat pipeline. You can check deduplication status, page parsing results, and embedding generation outcomes. If the chat returns unexpected answers, you can inspect which documents contributed to the retrieval and adjust filters. The self-hosted setup allows you to instrument additional alerts and dashboards as needed.


AI Agent for Web Content Ingestion

Receives a URL via webhook, scrapes the page with Firecrawl into clean markdown, and stores it as vector embeddings in Supabase pgvector to power a RAG chat.

Use this template → Read the docs