Activity C: Grounded Live-Corpus

A hands-on Jupyter notebook that demonstrates why some questions can only be answered by searching the live open web — and how to do that reliably using DuckDuckGo Search, a free, privacy-focused search library that requires no API key or account.

Colab T4 GPU or laptop CPU

How to run this notebook

Option A — Google Colab (easier)

Click the Open in Colab button above.
In the top-left menu choose File → Open in Colab.
Use a personal Google account rather than a work or university account — institutional accounts sometimes block third-party Colab access.
Make it editable: File → Save a copy in Drive.
The shared notebook is view-only; saving a copy to your own Drive gives you a personal, fully editable version.
Enable a free GPU: Runtime → Change runtime type → T4 GPU → Save.
A GPU is not required (the default model runs on CPU) but speeds up generation.
Run cells top-to-bottom with Runtime → Run all, or step through them with Shift+Enter.

Option B — Local Jupyter (laptop)

Download the notebook from the Drive link (File → Download → Download .ipynb).
Install dependencies: pip install transformers torch ddgs
Open with jupyter notebook activity_c.ipynb or in VS Code.
The default model (Qwen2-0.5B-Instruct) and DuckDuckGo search both work without GPU or API keys.

What you will learn

Why LLMs alone, structured APIs, and fixed-corpus RAG all fail for questions that require today's information.
How to use the ddgs library as a free, keyless live search tool — no signup, no rate-limit headaches.
How to distil a natural-language question into a short search phrase using the LLM before querying.
How to build a grounded live-search prompt that instructs the LLM to cite source URLs.
How to handle heterogeneous source quality and conflicting snippets from the open web.
What evaluation looks like at Level 4: freshness, source credibility, and answer faithfulness — all three required.

Complexity Ladder level covered

Level	Name	Key property
4	Grounded Live-Corpus	Open web retrieval for freshness. One-shot search → multi-source synthesis → cited answer. Evaluation requires freshness check, source credibility check, and answer faithfulness check.

Notebook outline

Setup: install 3 packages (transformers, torch, ddgs); load Qwen2-0.5B-Instruct (CPU-friendly default; swap to Phi-3.5-mini on T4).
Part 1 — LLM alone fails: ask about a recent event (Super Bowl 2026) with no context → stale/hedged answer; explain training cutoff.
Part 2 — DuckDuckGo live search: LLM distils question to a short phrase → web_search() fetches live results → inject as context → LLM gives current cited answer → repeat for 3 diverse live questions.
Part 3 — Comparison: same live question fails on a hardcoded static corpus; decision table for when to use Level 2 vs. 3 vs. 4.
Part 4 — Source quality: inspect domain diversity across results; handle a deliberately ambiguous query where sources disagree.
Part 5 — Recap: full Complexity Ladder table comparing Levels 2–5; Level 4 → 5 transition explained.

Requirements

No API keys or signups needed:

Live search: ddgs — free Python library, no account, no key. Install with pip install ddgs.
Heavier experiments (optional): for higher query volume or a private unthrottled instance, self-host SearXNG locally with docker run -d -p 8080:8080 searxng/searxng.
Default model: Qwen/Qwen2-0.5B-Instruct — 0.5 B params, ~1 GB, runs on CPU.
GPU upgrade (optional): microsoft/Phi-3.5-mini-instruct — swap in the first code cell when running on Colab T4.

Citation

If you use this tutorial in your work or teaching, please cite:

@inproceedings{dammu2026information,
  title={Information Seeking in the Age of Agentic AI: A Half-Day Tutorial},
  author={Dammu, Preetam Prabhu Srikar and Roosta, Tanya},
  booktitle={Proceedings of the 2026 Conference on Human Information Interaction and Retrieval},
  pages={429--430},
  year={2026}
}

View on ACM DL · Contact: Preetam Dammu <preetams@uw.edu>, PhD Candidate, University of Washington