Building a JFK Assassination File Chatbot with Azure OpenAI and Document Intelligence
Introduction: A Massive Document Dump Meets Modern AI
On March 18, 2025, the U.S. government released over 2,000 documents related to the assassination of President John F. Kennedy. Curious to explore the potential of Azure OpenAI services, I decided to ingest this massive dataset and build an intelligent chat interface capable of answering questions directly from the files. It was part historical investigation, part technical challenge — and yes, part financial misadventure.
In this post, I walk through how I used ChatGPT, Azure AI Search, Document Intelligence, and a GitHub sample project to spin up an interactive JFK chatbot. Along the way, I ran into performance issues, quirky document formatting, cost surprises, and some eerie tidbits about the case itself.
If you want to watch the video I posted on YouTube you can check it out here
The Setup: From Government Archive to AI Playground
The document release was hosted at archives.gov, and at first glance, the site just looked like a typical list of PDFs. But closer inspection revealed over 2,000 files — too many to download manually. So I turned to ChatGPT.
By inspecting the page source, I located the HTML <tbody> that contained all the download links. I copied this block into a local HTML file and prompted ChatGPT to generate a bash script that could extract all the links. After some debugging—mostly sed vs. grep quirks on macOS—I had a working script: extract_links.sh.
#!/bin/bash
BASE_URL="https://www.archives.gov"
INPUT_FILE="files.xml"
OUTPUT_FILE="links.txt"
# Extract the href links, prepend base URL, and save to output file
sed -n 's/.*<a href="\([^"]*\)".*/\1/p' "$INPUT_FILE" | awk -v base="$BASE_URL" '{print base $0}' > "$OUTPUT_FILE"
echo "Extracted links saved to $OUTPUT_FILE"
Once executed, the script populated a links.txt file with all the download URLs. I then asked ChatGPT to write a second script that would read from this file and use curl to download each file—handling URL encoding for spaces and special characters.
#!/bin/bash
INPUT_FILE="links.txt"
DOWNLOAD_DIR="downloads"
# Create a directory to store downloaded files
mkdir -p "$DOWNLOAD_DIR"
# Read each line in the text file
while IFS= read -r line || [[ -n "$line" ]]; do
trimmed_url=$(echo "$line" | awk '{$1=$1};1')
filename=$(basename "$trimmed_url")
encoded_filename=$(echo "$filename" | sed 's/ /%20/g')
final_url=$(dirname "$trimmed_url")/$encoded_filename
echo "Downloading: $final_url"
curl -L --silent --show-error --output "$DOWNLOAD_DIR/$filename" "$final_url"
done < "$INPUT_FILE"
echo "Download complete. Files saved in '$DOWNLOAD_DIR'."
The result? A “downloads” folder with 2,182 JFK files totaling around 6GB.
Data Ingestion with Azure Search and OpenAI
With all the files ready, I moved on to adapting Microsoft’s Azure Search OpenAI Demo.
This sample provides a chat interface over a searchable document index, making it perfect for JFK research.
$dataArg = "`"$cwd/data/jfk/downloads*`""
I cloned the repo and dropped my downloads/ folder into the data/ directory. A quick tweak to the predocs.ps1 script told it to use my JFK files instead of the default sample docs. From there, I ran the Azure Developer CLI:
azd init
azd auth login
azd up
This sequence kicks off the provisioning of Azure resources via Bicep templates and runs Python scripts that upload and index the documents.
It’s worth noting: this ingestion pipeline uses Azure Document Intelligence, Azure Cognitive Search, and Azure OpenAI together. Document Intelligence extracts text and metadata from each file, Azure Search builds a searchable index, and the OpenAI service enables chat-like interactions over that content.
Observing the Cost (and Token) Explosion
As the documents began processing, I watched the logs roll in — and saw token usage skyrocket. Some files took 6000+ tokens to process, especially longer PDFs split into multiple chunks. With Azure OpenAI charging around $5 per million tokens, I wasn’t too worried… at first.
But then came the real kicker: Document Intelligence processing alone cost me $673. That’s before factoring in compute, storage, and search costs.
A quick look at Azure Cost Management revealed that this project was on track to cost over $1,200 if left running for several more days. Lesson learned: document AI at scale isn’t cheap, especially with dense, scanned PDFs that are hard to parse.
The Chat Interface: Asking Who Killed JFK
Once ingestion finished, azd up output a URL for the chat frontend—a containerized web app with the prompt: “Chat with your data.”
Naturally, the first question I asked was: “Who killed JFK?”
The response was nuanced, citing various documents (e.g., 104-10331-001, page 107) and offering references to theories involving the CIA, organized crime, Castro, and inconsistencies in the Warren Commission. It didn’t offer a definitive answer—but it did connect the dots in surprisingly insightful ways.
Further probing uncovered references to individuals like Jack Ruby (who killed Lee Harvey Oswald), Jim Braden(with alleged mafia ties), and Jean Daniel, a French journalist who carried a peace offer from JFK to Castro. The app cited documents containing possible discrepancies, indirect associations, and redactions — making for a compelling, if inconclusive, read.
Observations: Strengths and Shortcomings
While the chat interface performed well with focused questions — like “Who is Castellazzo?” — it struggled with broader queries like “List all names in all documents” or “Create a timeline of events.” These larger queries either timed out or only referenced a few documents, suggesting limits in indexing depth or response token caps.
Additionally, not every document is clean. Many have handwritten notes, poor scans, or ambiguous formatting. Despite that, the OpenAI model could still pull out surprisingly granular insights, especially when anchored to a specific file or page.
The Verdict: Powerful, Expensive, and Worth Exploring
This project was a wild ride through Azure’s AI stack and America’s murkiest conspiracy. Despite some limitations — particularly around comprehensive summarization — the sample app proved it could turn massive, unstructured document dumps into a usable research tool.
Would I recommend doing this yourself? If you’ve got a spare $700 (and possibly more), sure. But otherwise, you might want to test it on a smaller subset of documents first.
Still, the idea of creating a living, searchable interface for historical records, corporate documents, or compliance files? That’s compelling. This stack — Document Intelligence, Azure Cognitive Search, and Azure OpenAI — is clearly powerful, if not yet turnkey or cheap.
Final Thoughts and Next Steps
I’ve made the app accessible temporarily so others can poke around. I’m also considering building an index or summary layer that catalogs each file’s content and page count to make the experience even smoother.
And if you figure out better prompts or a smarter way to parse conspiracy theories from scanned memos, I want to hear from you.
Until next time, this is the Azure Terraformer, signing off.