I Built My Own Research Engine. Here's What This Tool Lets You Do.

I needed a research tool I could run on my own hardware, without per-call pricing. So I built one. And then I open sourced it so you can have it, too.

The problem with per-call pricing is not the line item on your card. It is the question you do not ask because the cost is not worth it. The secondary source you accept because scraping the original is too expensive. The fragment of a page you read because the full text costs another credit.

When the meter is running, you optimize by cost, not by curiosity. That is a bad way to do research. So I built a tool that has no meter.

What I Built#

GroktoCrawl is a self-hosted, MIT-licensed web scraping, search, and AI research service. One docker compose up and it’s running on your own hardware. It implements the Firecrawl v2 API surface (scrape, search, map, crawl, extract, browser sessions, monitors) plus capabilities Firecrawl doesn’t offer: a persistent semantic search engine, a grounded Q&A endpoint with citations, a web portal for human users, site-specific adapters for GitHub, YouTube, Bluesky, Substack, and others, and an autonomous research agent that evaluates source quality as it works.

The CLI has thirteen commands:

groktocrawl
  scrape           Scrape a single URL to clean markdown
  search           Search the web
  map              Discover URLs on a site
  crawl            Crawl a website recursively
  agent            Autonomous multi-source research
  answer           Grounded Q&A with citations
  extract          Structured data from URLs
  browser          Headless browser sessions
  monitor          Scheduled change detection
  parse            PDF, DOCX, EPUB to markdown
  download         Binary file download
  active           List running jobs
  generate-llmstxt Build an llms.txt for a site

Every command hits your own instance. No rate limits. No credit counters. No pricing tier that decides what kind of research is worth doing.

What This Looks Like in Practice#

Here are real commands with their actual output.

Grounded Q&A:

groktocrawl answer "arXiv 1706.03762 Attention Is All You Need"

The answer endpoint searches the web, scrapes the most relevant pages, synthesizes the findings through an LLM, and returns a grounded answer with source citations:

Based solely on the provided source [1], here are the key details
about arXiv:1706.03762, "Attention Is All You Need":

- Proposed architecture: The Transformer, a new simple network
  architecture based solely on attention mechanisms, dispensing
  with recurrence and convolutions entirely.
- Performance on machine translation: Achieves 28.4 BLEU on WMT
  2014 English-to-German translation, improving over existing best
  results by over 2 BLEU.
- Advantages: The model is superior in quality while being more
  parallelizable and requiring significantly less time to train.

Sources:
  https://arxiv.org/abs/1706.03762

Citations:
  [1] https://arxiv.org/abs/1706.03762

One command, one answer, cited to its source. No separate search, scrape, and read steps.

Autonomous research agent:

groktocrawl agent "Summarize the key contributions of this paper" \
  --urls https://arxiv.org/abs/1706.03762 --sync

The agent scrapes the provided URLs, evaluates source quality, and produces a structured synthesis with citations:

Based on the provided source, the key contributions of the paper
"Attention Is All You Need" are:

1. Introduction of the Transformer architecture: A novel network
   architecture based solely on attention mechanisms, completely
   dispensing with recurrence and convolutions.
2. Superior performance on machine translation: Achieves a BLEU
   score of 28.4 on WMT 2014 English-to-German translation.
3. Efficiency gains: More parallelizable and requires significantly
   less time to train -- only 3.5 days on eight GPUs.
4. Generalization to other tasks: Successfully applies to English
   constituency parsing beyond sequence transduction.

Sources:
  https://arxiv.org/abs/1706.03762

The agent also accepts open-ended research prompts without seed URLs. It searches, scrapes multiple pages, evaluates source quality, detects contradictions, and produces a structured synthesis. When available sources don’t answer the question, it says so explicitly.

Scraping a GitHub repo:

groktocrawl scrape https://github.com/groktopus/groktocrawl

---
source: github-adapter
url_type: repo_root
owner: groktopus
repo: groktocrawl
stars: 27
forks: 3
language: Python
license: MIT
description: Self-hosted, API-compatible Firecrawl alternative with Agent endpoint.
---

# groktopus/groktocrawl

> Self-hosted, API-compatible Firecrawl alternative with Agent endpoint. MIT license. One docker compose up.

⭐ 27 stars | 🍴 3 forks | 🔤 Python | 📜 MIT | 🌿 main

When no adapter matches a URL, the scraper runs a three-tier content negotiation pipeline: check /llms.txt at the site root first, request with Accept: text/markdown header second, render with a headless browser third. Sites that serve agent-friendly formats get answered faster and cheaper.

The adapter system covers over 20 sites across four categories:

Category	Sites
Code	GitHub repos, files, issues, PRs, discussions, releases
Media	YouTube (transcripts), Bluesky (posts + threads), Substack (articles)
Vulnerabilities	NVD, CVE Program, MITRE ATT&CK, Exploit-DB, CRT.sh
Threat intel	AbuseIPDB, Shodan, VirusTotal, AlienVault OTX, Have I Been Pwned, Censys, VulnCheck

Each adapter returns structured data in YAML frontmatter: the video metadata for YouTube, the star count and license for GitHub, the CVSS score for NVD, without you having to dig through HTML.

Scraping a YouTube video:

groktocrawl scrape https://youtu.be/IIBRpzQgIKc

---
video_id: IIBRpzQgIKc
title: 2022 Suzuki SV650 Review | Daily Rider
channel: RevZilla
channel_url: https://www.youtube.com/@RevZilla
thumbnail_url: https://i.ytimg.com/vi/IIBRpzQgIKc/hqdefault.jpg
source: youtube-adapter
---

# 2022 Suzuki SV650 Review | Daily Rider

**Channel:** RevZilla

---

## Description

https://rvz.la/3J1MKrY | Find the right Michelin tire for your ride!

Suzuki's SV650 is one of the most popular bikes of the past 20 years, whether you're racing or commuting. Finally, Zack takes the legendary SV on the Daily Rider route!

Frontmatter, title, description, then the full transcript follows. One command.

Web search:

groktocrawl search "self hosted web scraper" --limit 3 --json

{
  "results": [
    {"url": "https://github.com/jaypyles/Scraperr", "title": "Scraperr -- A Self Hosted Webscraper"},
    {"url": "https://news.ycombinator.com/item?id=27474142", "title": "My final project..."},
    {"url": "https://supacrawler.com", "title": "Show HN: Supacrawler -- lightweight web scraping API in Go"}
  ]
}

Document parsing:

groktocrawl parse report.pdf -o report.md
Parsed content (20119 chars) written to /tmp/report.md

Converts PDF, DOCX, EPUB, PPTX, and XLSX to clean markdown on the server side. The output lands as a markdown file ready for an LLM or text editor.

Change monitoring:

groktocrawl monitor create https://example.com/pricing --schedule "0 */12 * * *"

Polls the URL on a cron schedule, diffs the content, and surfaces what changed.

Structured data extraction:

groktocrawl extract https://arxiv.org/abs/1706.03762 \
  --prompt "Extract the paper title, authors, and submission date"

Returns structured fields from the page content through an LLM:

- Title: Attention Is All You Need
- Authors: Ashish Vaswani, and 7 other authors
- Submission Date: 12 June 2017 (v1, original submission)

The extract endpoint works with any URL and any prompt. You define the schema, the tool fills it in.

Headless browser sessions:

groktocrawl browser create --ttl 120

Session: e7a1f3c2-4b5d-4e89-9a2c-8f1d6e3b0a7c

The browser commands give you a controlled Playwright session for JavaScript-heavy single-page apps that can’t be scraped by the standard pipeline. Create a session, navigate to a URL, extract the rendered DOM, execute scripts, take screenshots. The session lives for the TTL you set (default 60 seconds) and self-destructs.

The Agent Skill#

GroktoCrawl ships with an AgentSkills-compatible skill. One file at skills/groktocrawl/SKILL.md that covers:

All thirteen commands with syntax and examples
The adapter system and which sites return structured data
The fallback chains for error recovery
The browser session lifecycle
Pitfalls that took real usage to discover

Load the skill and any AI agent that supports AgentSkills (Claude Code, Cursor, Hermes Agent) knows how to use every command, flag, and recovery pattern without being told.

One Stack, No Handoffs#

The real advantage of having all these tools in one service is the workflow. A typical research session might look like this:

# 1. Find the leads
groktocrawl search "transformer architecture survey 2024" --limit 5 --json
# 2. Pull the content
groktocrawl scrape https://arxiv.org/abs/1706.03762
# 3. Extract what you need
groktocrawl extract https://arxiv.org/abs/1706.03762 \
  --prompt "Extract the key innovations claimed in this paper"
# 4. Ask questions about it
groktocrawl answer "What limitations of the Transformer does the paper identify?"
# 5. Monitor for updates
groktocrawl monitor create https://arxiv.org/abs/1706.03762 \
  --schedule "0 0 * * 1"

Every command talks to the same service. No API keys to swap between steps. No rate limits between the search and the scrape. No switching from one vendor’s SDK to another’s. The output of scrape feeds directly into extract, answer, or agent because they share the same transport and the same auth.

The Economics Are Different#

I am not going to tell you how much money to charge or what your margins will be. Those numbers depend on your niche, your pricing, and your cost of running a server, none of which I can guess.

What I can tell you is the structural difference.

With per-API-call pricing, your cost scales linearly with every piece of work you do. With self-hosted infrastructure, you pay for capacity. A VPS costs the same whether your GroktoCrawl instance handles ten requests a day or ten thousand. The marginal cost does not change with your usage pattern.

What You Could Build With It#

I am not going to give dollar figures or claim I made money from this. I built it because I needed it. But the capabilities are concrete:

A competitor intelligence feed. The monitor system checks pages on a schedule, diffs the content, and surfaces changes. Wire that to a notification channel.
An industry newsletter. The agent endpoint produces daily summaries scoped to any niche. Add a formatting pass and an email send.
A structured data extraction pipeline. The extract command pulls the same fields from many pages into a single clean dataset.
Document processing at scale. The parse command turns PDFs, DOCX, and EPUB into markdown that you can search, summarize, and cross-reference.

The point is that the infrastructure constraint is gone.

The Repo#

GroktoCrawl is at github.com/groktopus/groktocrawl. MIT licensed. Twenty-seven stars, three forks, zero paid tiers. Drop a docker compose up and it runs.

I built it for my own research and opened it up because there was no reason not to. If you build something on top of it, that is the whole point.

I Built My Own Research Engine. Here’s What This Tool Lets You Do.

Table of Contents