**What this does** Receives a URL via webhook, uses Firecrawl to scrape the page into clean markdown, and stores it as vector embeddings in Pinecone. A visual, self-hosted ingestion pipeline for RAG knowledge bases. Add
This AI agent automates the end-to-end ingestion of web pages: it receives a URL via webhook, scrapes the page to clean Markdown with Firecrawl, and stores the content as 1536-dim embeddings in Pinecone. It then exposes a chat interface that queries the vector store to answer questions, using Cohere reranking to improve retrieval quality. All steps run with your keys for Firecrawl, OpenAI, OpenRouter, Cohere, and Pinecone in a self-hosted environment.
Uses a concrete ingestion and retrieval pipeline to build a queryable knowledge base from web content.
Receive a webhook payload containing a URL.
Validate and normalize the URL to ensure a valid domain.
Scrape the page with Firecrawl and convert it to clean Markdown.
Generate 1536-dimension embeddings using OpenAI embeddings.
Attach the source URL as metadata for traceability.
Index content and embeddings into Pinecone for fast retrieval.
Before ingestion, teams face manual, error-prone source ingestion and inconsistent content formats. After adopting this AI agent, web pages are ingested automatically, standardized to Markdown, embedded, stored in Pinecone, and accessible via a fast, reranked chat.
A simple 3-step flow anyone can follow.
Receive a POST with a URL, validate and normalize the domain, and respond with a 422 if invalid.
Firecrawl scrapes the page to Markdown and OpenAI embeddings generate a 1536-d vector for the content.
Store content and embeddings in Pinecone, acknowledge the ingestion, then answer questions via the chat agent with Cohere reranking.
A realistic scenario showing ingestion and chat outcomes.
Scenario: A product team submits a URL for a new API doc. Ingestion completes in about 2 minutes. The Pinecone index now contains the document and its 1536-d embeddings. A user asks a question about the API, and the AI Agent returns an accurate answer drawn from the ingested content, with Cohere reranking applied to the retrieved passages.
Roles that gain a practical, measurable improvement from this AI agent.
Require quick, scalable ingestion of external sources into a searchable knowledge base.
Need consistent metadata and traceable source attribution for each ingested page.
Can answer questions using only ingested content, reducing escalation to SMEs.
Reference up-to-date API docs and policies when answering questions.
Automate the end-to-end ingestion pipeline with minimal maintenance.
Maintain auditable ingestion logs and source metadata for regulatory reviews.
The tools involved and what the AI agent does inside each.
Scrapes pages and converts them to clean Markdown for embedding.
Stores 1536-d embeddings and the associated metadata for fast similarity search.
Generates the 1536-d vector representations from scraped content.
Acts as the chat agent that queries Pinecone and produces answers.
Reorders retrieved passages to improve answer quality.
Practical scenarios where this AI agent shines.
Common questions about how this AI agent works and how to use it.
The agent ingests publicly accessible web pages via URL payloads. Pages behind authentication or paywalls may require additional handling. Ingestion relies on Firecrawl to produce clean Markdown, which is then embedded and stored in Pinecone. You can repeat this for multiple sources to rapidly grow your knowledge base.
You need API keys for Firecrawl, OpenAI (for embeddings), OpenRouter (for chat), Cohere (for reranking), and a Pinecone account with a properly configured index. The Pinecone index must be 1536 dimensions to match the embeddings. You should also ensure webhook delivery is accessible to your n8n instance or equivalent webhook receiver.
Ingesting pages behind login typically requires handling authentication and session state, which is not covered by the default webhook flow. Public pages or pages with accessible content are supported out-of-the-box. For restricted content, you would need a secure, authenticated ingestion process and appropriate access controls.
Ingestion duration depends on page size and network conditions, but most pages complete within 1–2 minutes. The embedding step is fast, and metadata is attached immediately after scraping. The webhook response typically confirms the number of items ingested in that run.
You can remove sources by deleting the corresponding metadata and embeddings from Pinecone. Updates can be performed by re-ingesting the page with updated content and replacing the stored vector and metadata. It is best practice to version and tag ingested pages to maintain traceability.
Retrieval latency is dominated by vector search plus generation time for the LLM. With Cohere reranking, the top passages are reordered to improve answer fidelity. Overall response times are typically within a few seconds for short queries and longer for complex questions.
Yes, the pipeline is designed for a self-hosted setup, including the Pinecone index you control. Access is protected by your API keys and network controls. You should implement standard security practices for your Pinecone instance and webhook endpoints to prevent unauthorized ingestion or retrieval.
**What this does** Receives a URL via webhook, uses Firecrawl to scrape the page into clean markdown, and stores it as vector embeddings in Pinecone. A visual, self-hosted ingestion pipeline for RAG knowledge bases. Add