Part 5 - Creating Your Own AI-Powered Knowledge Base with Ollama

April 16, 2025

Creating Your Own AI-Powered Knowledge Base with Ollama

Now that you have your model up and running, it’s time to harness its true potential by building something truly useful: a personal knowledge base Q&A system. Imagine having an AI assistant that can retrieve, synthesize, and explain information from your personal or professional documents, research papers, or any specialized content you care about.

The Core Challenge: Context Is Everything

Large language models like Llama 3.1 come pre-trained with vast general knowledge, but they truly shine when provided with specific context relevant to your questions. The key to an effective knowledge base system is getting the right information into your model’s context window.

Here’s our approach:

Organize your knowledge sources
Structure effective prompts
Create a specialized model
Build simple retrieval mechanisms

Let’s walk through each step to create a system that gives you accurate, insightful answers based on your specialized knowledge.

Organizing Your Knowledge

Before we start querying, we need to organize our information. Create a dedicated directory for your knowledge base:

mkdir -p ~/knowledge_base/documents

Place your text files, markdown documents, or text-extracted PDFs in this directory. The cleaner and more structured your documents, the better your results will be.

For best results:

Break large documents into smaller, topic-focused files
Use clear filenames that describe the content
Include headers and structured formatting where possible

Structuring Effective Prompts for Knowledge Retrieval

The magic of a good knowledge base system lies in how you structure your prompts. Here’s a template that works well:

cat << EOF > knowledge-prompt.txt
DOCUMENT: 

Based on the information in the document above, please answer the following question. 
If the answer cannot be found in the document, state that clearly rather than making up information.

QUESTION: 
EOF

This template:

Clearly separates the reference document from the query
Instructs the model to only use provided information
Prevents hallucinations by asking the model to acknowledge knowledge gaps

Creating a Specialized Knowledge Assistant Model

Now let’s create a specialized model optimized for knowledge retrieval:

cat << EOF > KnowledgeAssistant
FROM llama3.1:latest
SYSTEM "You are a precise knowledge assistant. Your primary goal is to provide accurate information based solely on the documents provided to you. You should:
1. Focus only on the content in the provided documents
2. Cite specific sections when answering
3. Admit when you don't have enough information
4. Provide concise, well-structured answers
5. Never fabricate information"
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF

ollama create knowledge-assistant -f KnowledgeAssistant

This model prioritizes accuracy over creativity with the low temperature setting and has a larger context window to accommodate your documents.

Building a Simple Document Retrieval System

Now let’s create a basic shell script that will:

Take a query from the user
Select a relevant document
Feed both to our model

cat << EOF > query-knowledge.sh
#!/bin/bash

# Directory containing knowledge documents
KNOWLEDGE_DIR=~/knowledge_base/documents/climate/

# Get query from arguments
QUERY="\$*"
if [ -z "\$QUERY" ]; then
  echo "Please provide a query"
  exit 1
fi

# Simple keyword-based document selection (can be improved)
echo "Searching for relevant documents..."
RELEVANT_DOCS=\$(grep -l "\$QUERY" \$KNOWLEDGE_DIR/* 2>/dev/null)

if [ -z "\$RELEVANT_DOCS" ]; then
  echo "No directly relevant documents found. Using first 3 documents..."
  RELEVANT_DOCS=\$(ls \$KNOWLEDGE_DIR/* | head -n 3)
fi

# Process each relevant document
for DOC in \$RELEVANT_DOCS; do
  echo "Processing document: \$(basename \$DOC)"
  
  # Prepare the prompt with document content and query
  DOCUMENT_CONTENT=\$(cat "\$DOC")
  PROMPT=\$(cat knowledge-prompt.txt)
  PROMPT=\${PROMPT//\{\{DOCUMENT_TEXT\}\}/\$DOCUMENT_CONTENT}
  PROMPT=\${PROMPT//\{\{QUERY\}\}/\$QUERY}
  
  # Run the query through our knowledge assistant
  echo "Analyzing document content..."
  ollama run knowledge-assistant "\$PROMPT"
  echo -e "\n---\n"
done
EOF

chmod +x query-knowledge.sh

This script searches for documents containing keywords from your query, then runs each document through your knowledge assistant model.

Download Some Sample Documents

Let’s populate our knowledge base with some climate change information from NASA:

mkdir ~/knowledge_base/documents/climate/
uvx --from inscriptis inscript https://science.nasa.gov/climate-change/causes/ > ~/knowledge_base/documents/climate/climate_change.txt

Note: We’re using UV (which we installed earlier) to execute these scripts.

Using Your Knowledge Base System

Now you can query your knowledge base with natural language questions:

./query-knowledge.sh "What are the key factors affecting climate change according to the latest report?"

The script will:

Search for documents containing keywords like “climate change” and “report”
Feed each relevant document to your knowledge assistant
Return answers based strictly on the content of those documents

Example Use Case: A SOC Compliance Bot

Let’s look at a concrete example. Imagine you need to build a knowledge base about SOC2 compliance:

Create a directory for SOC2 documents:

mkdir -p ~/knowledge_base/documents/soc2/

Update the script for the new documents:

cat << EOF > query-knowledge.sh
#!/bin/bash

# Directory containing knowledge documents
KNOWLEDGE_DIR=~/knowledge_base/documents/soc2/

# Get query from arguments
QUERY="\$*"
if [ -z "\$QUERY" ]; then
  echo "Please provide a query"
  exit 1
fi

# Simple keyword-based document selection (can be improved)
echo "Searching for relevant documents..."
RELEVANT_DOCS=\$(grep -l "\$QUERY" \$KNOWLEDGE_DIR/* 2>/dev/null)

if [ -z "\$RELEVANT_DOCS" ]; then
  echo "No directly relevant documents found. Using first 3 documents..."
  RELEVANT_DOCS=\$(ls \$KNOWLEDGE_DIR/* | head -n 3)
fi

# Process each relevant document
for DOC in \$RELEVANT_DOCS; do
  echo "Processing document: \$(basename \$DOC)"
  
  # Prepare the prompt with document content and query
  DOCUMENT_CONTENT=\$(cat "\$DOC")
  PROMPT=\$(cat knowledge-prompt.txt)
  PROMPT=\${PROMPT//\{\{DOCUMENT_TEXT\}\}/\$DOCUMENT_CONTENT}
  PROMPT=\${PROMPT//\{\{QUERY\}\}/\$QUERY}
  
  # Run the query through our knowledge assistant
  echo "Analyzing document content..."
  ollama run knowledge-assistant "\$PROMPT"
  echo -e "\n---\n"
done
EOF

Download some SOC2 documentation:

uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/what-is-a-soc-2-audit > ~/knowledge_base/documents/soc2/soc2-audit.txt && uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/why-is-soc-2-important > ~/knowledge_base/documents/soc2/soc2-important.txt && uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/introduction > ~/knowledge_base/documents/soc2/soc.txt && uvx --from inscriptis inscript https://www.vanta.com/collection/soc-2/what-is-soc-2 > ~/knowledge_base/documents/soc2/soc2.txt

Run a query about SOC2:

./query-knowledge.sh "What is a SOC2 Audit?"

The system will search through your documents, find relevant discussions about SOC2, and synthesize the findings.

Limitations and Improvement Opportunities

This simple system has some limitations:

Basic keyword matching for document retrieval
No semantic understanding of document relevance
Limited context window (even 4096 tokens can be restrictive)

For more advanced capabilities, consider:

Implementing vector embeddings for semantic search
Creating document chunks instead of using full documents
Building a simple RAG (Retrieval-Augmented Generation) system

Next Steps for Your Knowledge Base

As you grow more comfortable with your knowledge base system, you might want to:

Improve document retrieval by incorporating tools like sentence-transformers
Automate document processing with text extraction tools for PDFs and other formats
Create a simple web interface using the Ollama API instead of command-line interaction
Build specialized knowledge models for different domains or document types

The system we’ve built gives you a solid foundation – a personal AI lab that can answer questions based on your own knowledge sources, all running locally on your machine without sharing your sensitive data with third-party services.

In our final post, we’ll wrap everything up and explore some additional possibilities for your Ollama-powered AI lab.

Notes mentioning this note

Part 1 - AI Lab - Understanding Ollama

Part 2 - Installing Ollama

Part 3 - Choosing and Managing Your AI Models in Ollama

Part 4 - Hands-On with Ollama

Part 6 - The Future of Personal AI