利用 RAG 和网络爬虫技术，让大模型准确回答 IPL 赛事信息

大模型在很多领域都展现了强大的能力，但知识更新滞后是一个常见的问题。例如，当询问基于 Ollama 本地运行的 Mistral 模型 “谁赢得了 2024 年的 IPL 联赛？”时，它可能会给出过时的答案：“钦奈超级国王队赢得了 2018 年的 IPL 联赛。”本文将探讨如何使用 RAG (检索增强生成) 技术，结合 网络爬虫 从实时网络数据中提取信息，纠正大模型中的错误信息，使其能够准确回答关于 IPL 联赛的问题。

RAG 技术：弥合大模型知识与现实世界的鸿沟

RAG (检索增强生成) 是一种结合了信息检索和文本生成的技术。它的核心思想是：首先从外部数据源检索与用户查询相关的文档，然后将检索到的文档作为上下文信息，输入到大模型中，以生成更准确、更全面的答案。

RAG 技术包含两个主要步骤：

检索 (Retrieval)：从外部数据源 (如网站、数据库、文档库等) 检索与用户查询相关的文档。检索方法有很多种，包括基于关键词的检索、基于语义的检索等。
增强生成 (Augmented Generation)：将检索到的文档作为上下文信息，与用户查询一起输入到大模型中。大模型根据上下文信息生成答案。

RAG 技术可以有效地解决大模型的知识更新滞后问题，并提高答案的准确性和可靠性。通过实时检索最新的信息，RAG 可以确保大模型始终掌握最新的知识，从而更好地服务于用户。

构建实时 IPL 冠军查询系统：RAG 的实际应用

本文提供的案例展示了如何构建一个能够实时查询 IPL 冠军信息的系统。该系统利用 RAG 技术，结合 Spring Boot、Spring AI、JSoup 和 Ollama，实现了从网页抓取 IPL 冠军列表、存储信息，并根据用户查询生成答案的功能。

设想构建一个系统，它能：

抓取包含所有 IPL 冠军和亚军年份列表的实时网页
将信息以语义方式存储在向量存储中
回答诸如：“谁赢得了 2023 年的 IPL 联赛？” 或 “哪支队伍在 2016 年获得亚军？” 之类的用户查询

该系统采用了以下技术栈：

Spring Boot: 用于构建 REST API 和后端服务。
Spring AI: 简化了与大模型交互的过程，提供了诸如 ChatClient 和 VectorStore 等组件。
JSoup: 用于从网页抓取 IPL 冠军和亚军数据。
内存向量存储 (SimpleVectorStore): 用于存储抓取的数据，并进行快速语义搜索。
Ollama: 用于本地运行 Mistral 大模型。

系统设计：分步详解

该系统的核心分为两个步骤：

步骤 1：使用 JSoup 进行网页抓取

WebScrapperReader 类是一个 Spring @Component，负责从网页上抓取 IPL 冠军数据，将其分割成有意义的块，并在应用程序启动时将其存储到向量数据库 (内存中的 VectorStore) 中。

具体实现如下：

使用 JSoup 连接到 Jagran Josh IPL 页面并解析 HTML 内容。
选择第一个 HTML 表格，其中包含按年份排列的 IPL 冠军和亚军数据。
遍历表格行以提取：
- 标题（例如，年份、冠军、亚军）
- 每个 IPL 赛季的数据行

每一行被转换为人类可读的文本行，如下所示：

2023 | Chennai Super Kings | Gujarat Titans | ...

代码片段如下：

package com.example.RAG;

import jakarta.annotation.PostConstruct;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.ai.document.Document;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.SimpleVectorStore;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import java.io.File;
import java.io.IOException;
import java.util.List;

@Component
public class WebScrapperReader {@Autowired
private VectorStore vectorStore;

@PostConstruct
public void init() {
    try {
        String scrapedContext = fetchIPLWinnerList();
        System.out.println("Web scraping done at startup.");
        // Split and store scraped data into the VectorStore
        var textSplitter = new TokenTextSplitter();
        List&lt;Document&gt; documents = textSplitter.apply(List.of(new Document(scrapedContext)));
        vectorStore.accept(documents);
        ((SimpleVectorStore) vectorStore).save(new File("webscraper_vectorstore.json"));
    } catch (IOException e) {
        System.err.println("Failed to fetch IPL winner data: " + e.getMessage());
    }
}

private String fetchIPLWinnerList() throws IOException {
    org.jsoup.nodes.Document doc =
            Jsoup.connect("https://www.jagranjosh.com/general-knowledge/list-of-all-ipl-winner-teams-1527686257-1")
                    .get();
    // Select the first table on the page
    Element table = doc.select("table").first();
    StringBuilder builder = new StringBuilder();
    if (table != null) {
        Elements rows = table.select("tr");
        for (Element row : rows) {
            // Handle header or data rows
            Elements headers = row.select("th");
            if (!headers.isEmpty()) {
                for (Element header : headers) {
                    builder.append(header.text()).append(" | ");
                }
            } else {
                Elements cols = row.select("td");
                for (Element col : cols) {
                    builder.append(col.text()).append(" | ");
                }
            }
            builder.append("\n");
        }
    } else {
        builder.append("No table found on the page.");
    }
    System.out.println(builder.toString());
    return "The winner of IPL year wise and runner up year wise " + builder;
}

}


该类使用 @PostConstruct 注解，确保在应用程序启动时自动执行网页抓取和数据存储操作。 抓取的数据首先被分割成更小的块，然后存储在 VectorStore 中，以便进行高效的语义搜索。

步骤 2：使用 Spring Boot 公开 RAG API

WebScrapeRAGController 类是一个 Spring @RestController，它公开了一个 REST API 端点，允许用户查询与 IPL 相关的事实。它使用语义搜索和 ChatClient 来生成由 LLM 驱动的答案，并使用真实世界数据进行增强。

具体实现如下：

接收用户查询作为参数。
使用 VectorStore 的 similaritySearch 方法检索相关的上下文信息。
将检索到的上下文信息构建成提示 (prompt)，添加到用户查询中。
调用 ChatClient 的 prompt() 方法，将构建的提示发送给 LLM。
返回 LLM 生成的答案。

代码片段如下：

package com.example.RAG;

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;
import java.util.List;

@RestController
public class WebScrapeRAGController {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public WebScrapeRAGController(ChatClient.Builder builder, VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        this.chatClient = builder.build();
    }

    @GetMapping("/webscrape-rag")
    public String askWithWebData(@RequestParam("query") String query) {
        // Retrieve relevant context from the VectorStore
        List<Document> relevantContexts = vectorStore.similaritySearch(query);
        assert relevantContexts != null;
        List<String> list = relevantContexts.stream().map(Document::toString).toList();
        // Construct the prompt with the retrieved context
        StringBuilder promptBuilder = new StringBuilder("Use the following context to answer the query:\n\n");
        for (String context : list) {
            promptBuilder.append(context).append("\n");
        }
        promptBuilder.append("\nQuery: ").append(query);
        // Call the ChatClient with the constructed prompt
        return chatClient.prompt()
                .user(promptBuilder.toString())
                .call()
                .content();
    }
}

该类接收用户查询，从 VectorStore 中检索相关上下文，并将上下文和查询组合成一个提示，然后将其发送到 LLM。LLM 根据提供的上下文生成答案，从而确保答案的准确性和相关性。

Maven 依赖：pom.xml 配置

以下是项目的 pom.xml 文件，其中包含了所有必要的依赖项：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
       <groupId>org.springframework.boot</groupId>
       <artifactId>spring-boot-starter-parent</artifactId>
       <version>3.5.0</version>
       <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>com.example</groupId>
    <artifactId>claude</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>claude</name>
    <description>Demo project for Spring Boot</description>
    <url/>
    <licenses>
       <license/>
    </licenses>
    <developers>
       <developer/>
    </developers>
    <scm>
       <connection/>
       <developerConnection/>
       <tag/>
       <url/>
    </scm>
    <properties>
       <java.version>17</java.version>
       <spring-ai.version>1.0.0-M6</spring-ai.version>
    </properties>
    <dependencies>
       <dependency>
          <groupId>org.springframework.ai</groupId>
          <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
          <version>1.0.0-M6</version>
       </dependency>
       <dependency>
          <groupId>org.springframework.boot</groupId>
          <artifactId>spring-boot-starter</artifactId>
       </dependency>
       <dependency>
          <groupId>org.springframework.ai</groupId>
          <artifactId>spring-ai-pdf-document-reader</artifactId>
       </dependency>
       <dependency>
          <groupId>org.jsoup</groupId>
          <artifactId>jsoup</artifactId>
          <version>1.17.2</version>
       </dependency>
       <dependency>
          <groupId>org.springframework.ai</groupId>
          <artifactId>spring-ai-core</artifactId>
          <version>1.0.0-M6</version>
       </dependency>
       <!-- Web starter -->
       <dependency>
          <groupId>org.springframework.boot</groupId>
          <artifactId>spring-boot-starter-web</artifactId>
       </dependency>
       <dependency>
          <groupId>org.springframework.boot</groupId>
          <artifactId>spring-boot-starter-test</artifactId>
          <scope>test</scope>
       </dependency>
    </dependencies>
    <dependencyManagement>
       <dependencies>
          <dependency>
             <groupId>org.springframework.ai</groupId>
             <artifactId>spring-ai-bom</artifactId>
             <version>${spring-ai.version}</version>
             <type>pom</type>
             <scope>import</scope>
          </dependency>
       </dependencies>
    </dependencyManagement>
    <build>
       <plugins>
          <plugin>
             <groupId>org.springframework.boot</groupId>
             <artifactId>spring-boot-maven-plugin</artifactId>
          </plugin>
       </plugins>
    </build>
</project>

该文件声明了 Spring Boot、Spring AI、JSoup 等必要的依赖项，确保项目能够正常运行。

RAG 技术在其他领域的应用

除了 IPL 冠军查询，RAG 技术还可以应用于其他各种场景，例如：

客户服务：将 RAG 技术应用于聊天机器人，使其能够根据最新的产品文档和知识库回答客户的问题。
金融分析：将 RAG 技术应用于金融分析师助手，使其能够根据最新的市场数据和新闻报道生成投资建议。
医疗诊断：将 RAG 技术应用于医疗诊断助手，使其能够根据最新的医学文献和病历数据辅助医生进行诊断。

总而言之，RAG 是一种非常有前景的技术，它可以有效地提高大模型的知识水平和应用价值。随着大模型的不断发展，RAG 技术将在越来越多的领域得到应用。

结论：RAG 技术让大模型更智能

RAG 技术是弥合 LLM 知识与现实世界之间差距的关键。只需几行 Spring 代码，您就可以将 LLM 从百科全书变成具有最新数据的智能助手。通过结合 网络爬虫 和 RAG 技术，我们可以有效地解决大模型的知识更新滞后问题，并提高其在各种实际应用中的表现。

将 RAG 技术融入到您的 Ollama 和 Mistral 模型中，可以极大地提升其回答问题的准确性。无论是查询 IPL 赛事结果，还是进行其他领域的知识检索，RAG 都能让您的 LLM 更加智能和实用。

利用 RAG 和网络爬虫技术，让大模型准确回答 IPL 赛事信息