Activity C: Grounded Live-Corpus
A hands-on Jupyter notebook that demonstrates why some questions can only be answered by searching the live open web — and how to do that reliably using DuckDuckGo Search, a free, privacy-focused search library that requires no API key or account.
How to run this notebook
Option A — Google Colab (easier)
- Click the Open in Colab button above.
-
In the top-left menu choose File → Open in Colab.
Use a personal Google account rather than a work or university account — institutional accounts sometimes block third-party Colab access. -
Make it editable: File → Save a copy in Drive.
The shared notebook is view-only; saving a copy to your own Drive gives you a personal, fully editable version. -
Enable a free GPU: Runtime → Change runtime type → T4 GPU → Save.
A GPU is not required (the default model runs on CPU) but speeds up generation. - Run cells top-to-bottom with Runtime → Run all, or step through them with Shift+Enter.
Option B — Local Jupyter (laptop)
- Download the notebook from the Drive link (File → Download → Download .ipynb).
- Install dependencies:
pip install transformers torch ddgs - Open with
jupyter notebook activity_c.ipynbor in VS Code. - The default model (
Qwen2-0.5B-Instruct) and DuckDuckGo search both work without GPU or API keys.
What you will learn
- Why LLMs alone, structured APIs, and fixed-corpus RAG all fail for questions that require today's information.
- How to use the ddgs library as a free, keyless live search tool — no signup, no rate-limit headaches.
- How to distil a natural-language question into a short search phrase using the LLM before querying.
- How to build a grounded live-search prompt that instructs the LLM to cite source URLs.
- How to handle heterogeneous source quality and conflicting snippets from the open web.
- What evaluation looks like at Level 4: freshness, source credibility, and answer faithfulness — all three required.
Complexity Ladder level covered
| Level | Name | Key property |
|---|---|---|
| 4 | Grounded Live-Corpus | Open web retrieval for freshness. One-shot search → multi-source synthesis → cited answer. Evaluation requires freshness check, source credibility check, and answer faithfulness check. |
Notebook outline
- Setup: install 3 packages (
transformers,torch,ddgs); loadQwen2-0.5B-Instruct(CPU-friendly default; swap to Phi-3.5-mini on T4). - Part 1 — LLM alone fails: ask about a recent event (Super Bowl 2026) with no context → stale/hedged answer; explain training cutoff.
- Part 2 — DuckDuckGo live search: LLM distils question to a short phrase →
web_search()fetches live results → inject as context → LLM gives current cited answer → repeat for 3 diverse live questions. - Part 3 — Comparison: same live question fails on a hardcoded static corpus; decision table for when to use Level 2 vs. 3 vs. 4.
- Part 4 — Source quality: inspect domain diversity across results; handle a deliberately ambiguous query where sources disagree.
- Part 5 — Recap: full Complexity Ladder table comparing Levels 2–5; Level 4 → 5 transition explained.
Requirements
No API keys or signups needed:
- Live search: ddgs — free Python library, no account, no key. Install with
pip install ddgs. - Heavier experiments (optional): for higher query volume or a private unthrottled instance, self-host SearXNG locally with
docker run -d -p 8080:8080 searxng/searxng. - Default model: Qwen/Qwen2-0.5B-Instruct — 0.5 B params, ~1 GB, runs on CPU.
- GPU upgrade (optional): microsoft/Phi-3.5-mini-instruct — swap in the first code cell when running on Colab T4.
Citation
If you use this tutorial in your work or teaching, please cite:
@inproceedings{dammu2026information,
title={Information Seeking in the Age of Agentic AI: A Half-Day Tutorial},
author={Dammu, Preetam Prabhu Srikar and Roosta, Tanya},
booktitle={Proceedings of the 2026 Conference on Human Information Interaction and Retrieval},
pages={429--430},
year={2026}
}
View on ACM DL · Contact: Preetam Dammu <preetams@uw.edu>, PhD Candidate, University of Washington