PageIndex

PageIndex

前言

不用向量数据库,不用传统分块,用LLM推理来检索文档。

传统RAG: 分块破坏上下文 – 一刀切下去,前后文可能就断了 相似≠相关 – 语义相似的东西不一定是你要的答案 向量数据库成本高 – 又要搭建又要维护 答案溯源难 – 很难追踪答案到底从哪来的

image

PageIndex的做法完全不同: 构建层次树索引 – 像给文档做了个语义版的目录 LLM推理导航 – 让模型像人一样去”翻”文档 保留文档结构 – 不破坏原有的逻辑关系 可追溯检索 – 每一步推理都能看到

Usage

Step 0: 准备工作

  1. 安装 PageIndex

    pip install -q --upgrade pageindex
  2. 设置 PageIndex 客户端

    from pageindex import PageIndexClient
    import pageindex.utils as utils
    # 从 https://dash.pageindex.ai/api-keys 获取你的 API key
    PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY"
    pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)
  3. 设置 LLM

    import openai
    OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
    async def call_llm(prompt, model="gpt-4.1", temperature=0):
    """调用 LLM 生成响应"""   
    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)    
    response = await client.chat.completions.create(model=model, messages=[{"role": "user", "content":prompt}], temperature=temperature)
    return response.choices[0].message.content.strip()

Step 1: 生成 PageIndex 树结构

  1. 提交文档生成 PageIndex 树
    import os, requests
    # 你也可以使用 GitHub 仓库本地生成 PageIndex 树
    # https://github.com/VectifyAI/PageIndex
    # 下载示例 PDF
    pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
    pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
    os.makedirs(os.path.dirname(pdf_path), exist_ok=True)
    response = requests.get(pdf_url)with open(pdf_path, "wb") as f:   
    f.write(response.content)
    print(f"Downloaded {pdf_url}")
    # 提交文档并获取文档 ID
    doc_id = pi_client.submit_document(pdf_path)["doc_id"]
    print('Document Submitted:', doc_id)

输出:

Downloaded https://arxiv.org/pdf/2501.12948.pdfDocument Submitted: pi-cmeseq08w00vt0bo3u6tr244g
  1. 获取生成的 PageIndex 树结构
    #检查文档是否处理完成
    if pi_client.is_retrieval_ready(doc_id):
    # 获取树结构(包含节点摘要)   
    tree = pi_client.get_tree(doc_id, node_summary=True)['result']   
    print('Simplified Tree Structure of the Document:')    
    utils.print_tree(tree)
    else:
    print("Processing document, please try again later...")

输出:

Simplified Tree Structure of the Document:[{'title': 'DeepSeek-R1: Incentivizing Reasoning Cap...','node_id': '0000','prefix_summary': '# DeepSeek-R1: Incentivizing Reasoning C...','nodes': [{'title': 'Abstract','node_id': '0001','summary': 'The partial document introduces two reas...'},            {'title': 'Contents','node_id': '0002','summary': 'This partial document provides a detaile...'},            {'title': '1. Introduction','node_id': '0003','prefix_summary': 'The partial document introduces recent a...','nodes': [{'title': '1.1. Contributions','node_id': '0004','summary': 'This partial document outlines the main ...'},                       {'title': '1.2. Summary of Evaluation Results','node_id': '0005','summary': 'The partial document provides a summary ...'}]},            {'title': '2. Approach','node_id': '0006','prefix_summary': '## 2. Approach\n','nodes': [{'title': '2.1. Overview','node_id': '0007','summary': '### 2.1. Overview\n\nPrevious work has hea...'},                       {'title': '2.2. DeepSeek-R1-Zero: Reinforcement Lea...','node_id': '0008','prefix_summary': '### 2.2. DeepSeek-R1-Zero: Reinforcement...','nodes': [{'title': '2.2.1. Reinforcement Learning Algorithm','node_id': '0009','summary': 'The partial document describes the Group...'},                                  {'title': '2.2.2. Reward Modeling','node_id': '0010','summary': 'This partial document discusses the rewa...'},                                  {'title': '2.2.3. Training Template','node_id': '0011','summary': '#### 2.2.3. Training Template\n\nTo train ...'},                                  {'title': '2.2.4. Performance, Self-evolution Proce...','node_id': '0012','summary': 'This partial document discusses the perf...'}]},                       {'title': '2.3. DeepSeek-R1: Reinforcement Learning...','node_id': '0013','summary': 'This partial document describes the trai...'},                       {'title': '2.4. Distillation: Empower Small Models ...','node_id': '0014','summary': 'This partial document discusses the proc...'}]},            {'title': '3. Experiment','node_id': '0015','prefix_summary': 'The partial document describes the exper...','nodes': [{'title': '3.1. DeepSeek-R1 Evaluation','node_id': '0016','summary': 'This partial document presents a compreh...'},                       {'title': '3.2. Distilled Model Evaluation','node_id': '0017','summary': 'This partial document presents an evalua...'}]},            {'title': '4. Discussion','node_id': '0018','summary': 'This partial document discusses the comp...'},            {'title': '5. Conclusion, Limitations, and Future W...','node_id': '0019','summary': 'This partial document presents the concl...'},            {'title': 'References','node_id': '0020','summary': 'This partial document consists of the re...'},            {'title': 'Appendix', 'node_id': '0021', 'summary': '## Appendix\n'},            {'title': 'A. Contributions and Acknowledgments','node_id': '0022','summary': 'This partial document section details th...'}]}]

Step 2: 基于推理的树搜索检索

  1. 使用 LLM 进行树搜索,识别可能包含相关上下文的节点

    import json
    # 用户问题
    query = "What are the conclusions in this document?"
    # 移除文本字段,只保留结构和摘要用于搜索
    tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])
    # 构建搜索提示词
    search_prompt = f"""You are given a question and a tree structure of a document.Each node contains a node id, node title, and a corresponding summary.Your task is to find all nodes that are likely to contain the answer to the question.Question: {query}Document tree structure:{json.dumps(tree_without_text, indent=2)}Please reply in the following JSON format:{{    "thinking": "<Your thinking process on which nodes are relevant to the question>",    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]}}Directly return the final JSON structure. Do not output anything else."""
    # 调用 LLM 进行树搜索
    tree_search_result = await call_llm(search_prompt)
  2. 打印检索到的节点和推理过程

#创建节点 ID 到节点的映射
node_map = utils.create_node_mapping(tree)tree_search_result_json = json.loads(tree_search_result)
# 打印推理过程
print('Reasoning Process:')utils.print_wrapped(tree_search_result_json['thinking'])
# 打印检索到的节点
print('\nRetrieved Nodes:')
for node_id in tree_search_result_json["node_list"]:   
    node = node_map[node_id]    
    print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}")

输出:

Reasoning Process:The question asks for the conclusions in the document. Typically, conclusions are found in sectionsexplicitly titled 'Conclusion' or in sections summarizing the findings and implications of the work.In this document tree, node 0019 ('5. Conclusion, Limitations, and Future Work') is the mostdirectly relevant, as it is dedicated to the conclusion and related topics. Additionally, the'Abstract' (node 0001) may contain a high-level summary that sometimes includes concluding remarks,but it is less likely to contain the full conclusions. Other sections like 'Discussion' (node 0018)may discuss implications but are not explicitly conclusions. Therefore, the primary node is 0019.Retrieved Nodes:Node ID: 0019  Page: 16  Title: 5. Conclusion, Limitations, and Future Work

Step 3: 生成答案

  1. 从检索到的节点中提取相关上下文
# 获取节点 ID 列表
node_list = json.loads(tree_search_result)["node_list"]
# 拼接所有相关节点的文本内容
relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list)
# 打印上下文(前 1000 字符)
print('Retrieved Context:\n')utils.print_wrapped(relevant_content[:1000] + '...')

输出:

Retrieved Context:## 5. Conclusion, Limitations, and Future WorkIn this work, we share our journey in enhancing model reasoning abilities through reinforcementlearning. DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data,achieving strong performance across various tasks. DeepSeek-R1 is more powerful, leveraging cold-start data alongside iterative RL fine-tuning. Ultimately, DeepSeek-R1 achieves performancecomparable to OpenAI-o1-1217 on a range of tasks.We further explore distillation the reasoning capability to small dense models. We use DeepSeek-R1as the teacher model to generate 800K training samples, and fine-tune several small dense models.The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet onmath benchmarks with $28.9 \%$ on AIME and $83.9 \%$ on MATH. Other dense models also achieveimpressive results, significantly outperforming other instructiontuned models based on the sameunderlying checkpoints.In the fut...
  1. 基于检索到的上下文生成答案
    #构建答案生成提示词
    answer_prompt = f"""Answer the question based on the context:Question: {query}Context: {relevant_content}Provide a clear, concise answer based only on the context provided."""
    # 生成答案
    print('Generated Answer:\n')
    answer = await call_llm(answer_prompt)
    utils.print_wrapped(answer)

输出:

Generated Answer:The conclusions in this document are:- DeepSeek-R1-Zero, a pure reinforcement learning (RL) approach without cold-start data, achievesstrong performance across various tasks.- DeepSeek-R1, which combines cold-start data with iterative RL fine-tuning, is more powerful andachieves performance comparable to OpenAI-o1-1217 on a range of tasks.- Distilling DeepSeek-R1's reasoning capabilities into smaller dense models is promising; forexample, DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks,and other dense models also show significant improvements over similar instruction-tuned models.These results demonstrate the effectiveness of the RL-based approach and the potential fordistilling reasoning abilities into smaller models.

和其他方案对比一下

对比

  • PageIndex vs 传统RAG

    • 优势:保留结构、可追溯、精度高
    • 劣势:单文档限制、索引构建慢、成本高
  • PageIndex vs GraphRAG

    • GraphRAG也在尝试保留文档关系,但用的是知识图谱
    • PageIndex的树结构更简单直接,但GraphRAG在跨文档推理上更强
  • PageIndex vs Reranking方案

    • Reranker是在向量检索后加一层精排
    • PageIndex是完全不同的检索范式

两者其实可以结合:向量粗筛 + PageIndex精读 + Reranker排序

混合方案:

  1. 第一阶段 – 用向量RAG做文档发现,从几千个文档里快速定位到最相关的5-10个
  2. 第二阶段 – 对这些候选文档用PageIndex做精确提取,保证答案质量
  3. 分场景选择 – 单文档深度分析直接上PageIndex,大规模检索用传统RAG

评论


暂无评论

* 登录后即可评论