ローカル環境で動作するRAGを構築してみた

はじめに

こんにちは。
クラウドソリューショングループのishimaru.rです。

RAGというキーワードが昨年からとてもよく聞くようになりました。そんな中で、クラウド上のサービスを使う以外で社内の情報を活用するためにローカル環境でRAGを構築するにはどうすればよいのかを今回試してみようと思います。

動きを見てみましょう

コンソール上で質問を投げかけた際に回答が返ってくるようになっています。


# ユーザからの質問をCLI上で受け付けます
📝 Your question: app crached

# 質問の内容に対して最も類似するFAQ情報を表示しています 
✅ context:
Q: The app crashes on startup. What should I do?
A: Try reinstalling the app or updating your graphics drivers. If the issue persists, contact support with your log files.

# LLMが回答を生成する上での計算時間を表示しています
Llama.generate: prefix-match hit
llama_print_timings:        load time =    1476.40 ms
llama_print_timings:      sample time =      20.82 ms /    77 runs   (    0.27 ms per token,  3698.01 tokens per second)
llama_print_timings: prompt eval time =     140.15 ms /     6 tokens (   23.36 ms per token,    42.81 tokens per second)
llama_print_timings:        eval time =    2119.35 ms /    76 runs   (   27.89 ms per token,    35.86 tokens per second)
llama_print_timings:       total time =    2440.70 ms /    82 tokens

# LLMが生成した回答文を表示しています
✅ English Answer:
1. Try to reinstall the app
2. Check if you have the latest version of the graphics drivers installed
3. Log files for crashing issues can be found in your app's directory (usually at: /var/log/app_name.log)
4. Contact support for further assistance if you can't find the log files.

処理概略図

ではどのように動作しているのか、以下に処理の流れを図にしたものを用意しました。

以下のような手順になっています。

事前に用意しているFAQの情報をアプリ起動時に読み込み、コンテキスト情報として分割します
分割したコンテキスト情報を基にインデックスの作成を行います。すでに作成済みの場合は既存のインデックスを読み込むようにしています
質問された内容がどのコンテキストに近いのか検索処理を行うretrieverを用意しておきます（今回は最も類似した1件のみを取得するようにしています）
実際に質問を入力します。
入力された内容を用いてretrieverがセマンティック検索を行います。
検索した結果得られたもっともコサイン類似度が高かったコンテキストを用いてプロンプトを作成します
LLMに対してプロンプトを作成して回答を生成してもらいます。

動かしてみましょう

モデル準備

Hugging Faceより取得いたします

〇Embeddings Model:関連するFAQの検索する役割を担います

intfloat/multilingual-e5-small

〇Large Language Model：回答文の生成を行う役割を担います

tinyllama-1.1b-chat-v1.0.Q4_0.gguf

ファイル構成

下記の構成


- .devcontainer
    - devcontainer.json
    - Dockerfile
- docs
    - faq.txt
- pyproject.toml
- support_rag_llama.py

devcontainer.json

{
  "build": {
    "args": {
      "DEBIAN_VERSION": "bookworm",
      "UV_VERSION": "0.5.4"
    },
    "context": "..",
    "dockerfile": "Dockerfile"
  },
  "containerEnv": {
    "UV_PROJECT_ENVIRONMENT": "/home/vscode/.venv"
  },
  "customizations": {
    "vscode": {
      "extensions": [
        "charliermarsh.ruff",
        "ms-python.python"
      ]
    }
  },
  "features": {},
  "mounts": [
    "source=C:/llama_models,target=/llama_models,type=bind",
    "source=C:/embedding_models,target=/embedding_models,type=bind"
  ],
  "name": "rag-tinyllama-env",
  "postCreateCommand": "uv pip install --system --requirements pyproject.toml"
}

Dockerfile

FROM python:3.11-slim
RUN apt-get update && apt-get install -y \
    build-essential \
    cmake \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /workspaces/my-uv-project
RUN pip install setuptools
RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
    echo 'export PATH="/root/.cargo/bin:$PATH"' >> ~/.bashrc
ENV PATH="/root/.cargo/bin:$PATH"
RUN pip install scikit-build-core cmake ninja
COPY pyproject.toml .

pyproject.toml

[project]
name = "my-uv-project"
version = "0.1.0"
dependencies = [
    "llama-index==0.12.28",
    "llama-cpp-python==0.2.57",
    "llama-index-embeddings-huggingface==0.5.2",
    "transformers==4.38.2",
    "sentence-transformers==2.6.1",
    "torch==2.1.2",
    "huggingface-hub",
    "requests",
    "numpy"
]

[tool.uv.pip]
no-build-isolation = true

[build-system]
requires = ["scikit-build-core"]
build-backend = "scikit_build_core.build"

faq.txt

Q: How do I reset my password if I can't access my registered email?
A: If you do not have access to your registered email, please contact support. You may be asked to verify your identity with other information.

Q: Can I enable two-factor authentication for added security?
A: Yes, you can enable 2FA from your account security settings, which requires an authentication app such as Google Authenticator or Authy.

Q: What should I do if my account is suspended due to suspicious activity?
A: Contact support immediately. We will guide you through the authentication process to regain access.

Q: What should I do if I accidentally delete a shared file on my company drive?
A: Contact IT Support immediately. In most cases, we can restore files deleted within the last 30 days from our backup system.

Q: Can I install the software on my company laptop myself?
A: No, you are not allowed to install software without IT approval. You must apply for this through the software installation form.

Q: How do I report a suspicious email or phishing attempt?
A: Forward the email to phishing@yourcompany.com and delete it. Do not click on any links or open any attachments.

support_rag_llama.py

from pathlib import Path
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage, Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.settings import Settings
from llama_cpp import Llama

MODEL_PATH = "/llama_models/tinyllama-1.1b-chat-v1.0.Q4_0.gguf"
llm = Llama(model_path=MODEL_PATH, n_ctx=2048)

Settings.llm = None
PERSIST_DIR = "./storage_txt"
EMBED_MODEL = HuggingFaceEmbedding(model_name="/embedding_models/intfloat-multilingual-e5-small")

def load_faq_documents(path: str):
    with open(path, "r", encoding="utf-8") as f:
        content = f.read()
    return [Document(text=entry.strip()) for entry in content.strip().split("\n\n") if entry.strip()]

if not Path(PERSIST_DIR).exists():
    print("📂 Indexing FAQ...")
    documents = load_faq_documents("docs/faq.txt")
    index = VectorStoreIndex.from_documents(documents, embed_model=EMBED_MODEL)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    print("📦 Loading existing index...")
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context, embed_model=EMBED_MODEL)

retriever = index.as_retriever(similarity_top_k=1)

print("\n💬 FAQ Search + TinyLlama Generator (English only)")
print("Type 'exit' to quit.")

while True:
    query = input("\n📝 Your question: ")
    if query.lower() == "exit":
        break

    nodes = retriever.retrieve(query)
    context = nodes[0].text if nodes else "No relevant FAQ found."

    print("\n✅ context:")
    print(context)

    prompt = f"""You are a helpful assistant. Given the following context and question, provide a concise and accurate answer.

Context:
{context}

Question:
{query}

Answer:"""

    output = llm(prompt, max_tokens=150, echo=False)
    english_answer = output["choices"][0]["text"].strip()

    print("\n✅ English Answer:")
    print(english_answer)

詰まった点

一度作成したコンテナを削除してコンテナを立ち上げなおそうとしたときに、各ライブラリの最新バージョンを取得しにいくことで、互換性がなく動作しないという状況に陥りました。パッケージ管理を適切に用いて管理できるようにしておきたいところです。

課題

英語での動作検証は意図したものを返却してくれているように見えます。ただ、聞き方を変え、情報量を少なくすると意図しない値が返ってくるようになってしまいます。これらはインデクシングの部分を工夫してあげることで改善の見込みがあるため別の機会に実施してみようと思います。また、日本語に対応したLLMを用いてプロンプトを改善すること回答の精度を上げる必要があります。これらの課題については別の機会に確認していきたいと思います。

ブログ記事一覧へ戻る

ishimaruのブログ

ブログ内検索

アーカイブ

2025: 2025/7(4); 2025/6(6); 2025/5(4); 2025/4(10); 2025/3(3); 2025/2(5); 2025/1(3)
2024: 2024/12(10); 2024/11(7); 2024/10(4); 2024/9(11); 2024/8(8); 2024/7(1); 2024/6(8); 2024/5(4); 2024/4(4); 2024/3(13); 2024/2(6)
2023: 2023/12(17); 2023/10(2); 2023/9(21); 2023/8(6); 2023/7(2); 2023/6(5); 2023/5(4); 2023/3(1)
2022: 2022/12(18); 2022/10(1); 2022/8(1); 2022/7(2); 2022/6(1); 2022/3(4); 2022/2(2)
2021: 2021/12(12); 2021/10(1); 2021/9(7); 2021/8(4); 2021/3(2); 2021/1(1)
2020: 2020/12(25); 2020/10(1); 2020/6(1); 2020/4(1); 2020/3(3); 2020/2(1)
2019: 2019/12(26); 2019/11(1); 2019/10(1); 2019/9(1); 2019/8(2); 2019/6(1); 2019/2(5)
2018: 2018/12(26); 2018/9(3); 2018/8(2); 2018/7(1); 2018/5(3); 2018/2(2); 2018/1(1)
2017: 2017/11(1); 2017/10(1); 2017/8(1); 2017/7(1); 2017/6(2); 2017/5(1); 2017/3(1); 2017/2(1)
2016: 2016/12(1); 2016/11(1); 2016/10(2); 2016/9(2)