Building a High-Precision Travel Guide: Optimizing RAG for a Jeju Island AI Chatbot
Introduction: The Limits of Vanilla LLMs
When building a tourism assistant for a specific location like Jeju Island, generic LLMs often fall short. They might hallucinate trail names, provide outdated cafe hours, or miss the subtle nuances of local documents. To solve this, I developed a RAG (Retrieval-Augmented Generation) based chatbot designed to provide reliable, document-grounded travel advice.
The Problem: Noise and Hallucination
In early prototypes, the system faced two major hurdles:
Semantic Mismatch: Standard vector search sometimes retrieved documents that were mathematically similar but contextually irrelevant.
Stagnant UX: Waiting for an entire paragraph to generate led to a jarring user experience, especially on mobile networks.
Document Complexity: Turning raw PDFs into a searchable database without losing the structural context was surprisingly difficult.
The Challenge: Improving Retrieval Precision
The core challenge was not just "finding" information, but "ranking" it correctly. If the top-3 retrieved chunks aren't the best ones, the LLM will provide a confident but incorrect answer. I needed a way to filter out the noise before it reached the prompt.
The Solution: A Multi-Stage RAG Pipeline
1. Robust Chunking and Embedding
Instead of simple splitting, I implemented a sliding window approach to preserve context. I used the BAAI/bge-m3 model for embeddings, known for its high performance in multilingual contexts.
# Configuration for high-quality retrieval
CHUNK_SIZE = 600
OVERLAP = 150
RAG_MODEL_NAME = 'BAAI/bge-m3'
# Implementing the sliding window chunking
def get_chunks(text, size=CHUNK_SIZE, overlap=OVERLAP):
return [text[i:i + size] for i in range(0, len(text), size - overlap)]
2. High-Precision Reranking (The "Secret Sauce")
To solve the "semantic mismatch" problem, I introduced a Cross-Encoder reranker (Dongjin-kr/ko-reranker). The system first retrieves 15 candidates via vector search (ChromaDB) and then performs a deep contextual comparison to pick the absolute top 3.
# Two-stage retrieval logic
# 1. Initial retrieval
results = collection.query(query_texts=[query_text], n_results=15)
# 2. Reranking for precision
scores = rerank_model.predict([(query_text, chunk) for chunk in fetched_chunks])
sorted_indices = np.argsort(scores)[::-1]
final_context_chunks = [fetched_chunks[i] for i in sorted_indices[:3]]
3. Real-Time UX with FastAPI Streaming
To make the bot feel responsive, I utilized FastAPI’s StreamingResponse combined with OpenAI’s stream API. This allows the UI to render the answer character-by-character as it is generated.
@app.post("/chat")
async def chat(request: Request):
# ... (retrieval logic)
def generate():
response = client.chat.completions.create(
model=GPT_MODEL,
messages=messages,
stream=True # Enabling real-time streaming
)
for chunk in response:
content = chunk.choices[0].delta.content
if content:
yield content
return StreamingResponse(generate(), media_type="text/event-stream")
Key Takeaways
Reranking is Essential: Moving from simple similarity search to a reranked pipeline significantly reduced hallucinations by ensuring the LLM only sees the most relevant data.
Overlap Matters: A 150-character overlap was the "sweet spot" for maintaining context between chunks, preventing the model from losing information at the split points.
Async/Streaming UX: In modern web apps, the perception of speed (streaming) is often more important than the actual execution time.
By combining ChromaDB, BGE-M3, and a Ko-Reranker, I successfully built a system that doesn't just "chat," but provides verified, high-quality information for travelers exploring the beauty of Jeju Island.


