提升 BerTopic 主题建模效果的实用技巧：大模型时代的主题发现与理解

在人工智能领域，主题建模一直扮演着重要的角色。即使在大语言模型（LLM）技术日趋成熟的今天，当我们需要高效地从大规模语料库中识别讨论主题时，主题建模工具仍然是不可或缺的。BerTopic 作为目前最常用的主题建模工具包，其易于切换的模块和简洁的接口，使得用户能够专注于调整重要参数和功能，从而优化主题建模效果。本文将结合实际案例，深入探讨如何通过调整 BerTopic 的参数设置，特别是聚类和表示，来提升主题建模的质量。

BerTopic 概述：从嵌入到主题表示的完整流程

一个标准的主题建模流程通常包含四个步骤：嵌入（Embedding）、降维（Dimensionality Reduction）、聚类算法（Clustering Algorithm）和主题表示（Topic Representation）。BerTopic 将这些步骤封装成易于使用的模块，允许用户根据不同的需求选择最合适的组件。例如，在嵌入阶段，可以选择 SentenceTransformer 等预训练模型；在降维阶段，可以使用 UMAP 算法；在聚类阶段，HDBSCAN 算法是常用的选择。通过灵活调整这些模块，可以显著影响最终的主题建模效果。

项目实战：从 20 Newsgroups 数据集入手

为了演示 BerTopic 的实际应用和参数调优，我们选择了常用的开源 20 Newsgroups 数据集。该数据集包含了来自 Usenet 新闻组论坛的评论，语调较为随意。我们将首先使用 BerTopic 网站推荐的默认设置进行主题建模，然后逐步调整聚类和表示相关的参数，展示效果的提升。

首先，加载数据集并进行初步的清洗：

import random
from datasets import load_dataset

dataset = load_dataset("SetFit/20_newsgroups")
random.seed(42)
text_label = list(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
text_label_500 = random.sample(text_label, 500)

接下来，编写清洗规则，去除常见的邮件/论坛头部信息，移除多余的标点符号，并过滤掉可能不包含信息的句子：

import re

def clean_for_embedding(text, max_sentences=5):
    lines = text.split("\n")
    lines = [line for line in lines if not line.strip().startswith(">")]
    lines = [line for line in lines if not re.match(r"^\s*(from|subject|organization|lines|writes|article)\s*:", line, re.IGNORECASE)]
    text = " ".join(lines)
    text = re.sub(r"\s+", " ", text).strip()
    text = re.sub(r"[!?]{3,}", "", text)
    sentence_split = re.split(r'(?<=[.!?]) +', text)
    sentence_split = [
        s for s in sentence_split
        if len(s.strip()) > 15 and not s.strip().isupper()
    ]
    return " ".join(sentence_split[:max_sentences])

texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
labels = [label for _, label in text_label_500]

默认设置下的 BerTopic 表现：初步结果与问题

使用 BerTopic 官方文档提供的代码片段，我们可以快速得到一个初步的主题建模结果：

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
    embedding_model=embedding_model, # Step 1 - Extract embeddings
    umap_model=umap_model, # Step 2 - Reduce dimensionality
    hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
    vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
    ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
    representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)

topics, probs = topic_model.fit_transform(texts_clean)

通过分析初步结果可以发现，使用默认设置在 500 行的数据集上仅发现了 3 个主题。更重要的是，这些主题的表示非常混乱，缺乏清晰的语义。这表明我们需要更深入地了解每个子模块及其特性，才能有效地调整参数，提升主题建模的效果。

UMAP 参数调优：聚焦局部结构，发现更多主题

UMAP (Uniform Manifold Approximation and Projection) 是一种常用的降维算法，其核心思想是将高维数据映射到低维空间，同时尽可能保持数据的拓扑结构。n_neighbors 是 UMAP 算法中一个关键参数，它决定了每个数据点与其他多少个数据点建立连接。

n_neighbors 的值越小，UMAP 算法就越关注数据的局部结构，从而更容易发现细粒度的主题。反之，n_neighbors 的值越大，UMAP 算法就越关注数据的全局结构，从而更容易发现更广泛的主题。

为了发现更多主题，我们可以尝试降低 n_neighbors 的值。例如，将其从默认的 10 调整为 5：

umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model.umap_model = umap_model_new
topics, probs = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

调整后，我们可以看到主题的数量增加到了 4 个（其中 -1 代表异常值聚类）。然而，主题的数量仍然较少，且大小分布非常不均匀。

HDBSCAN 参数调优：平衡主题大小，细化主题粒度

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) 是一种基于密度的聚类算法，它能够自动发现不同密度的簇，并且对噪声数据具有较强的鲁棒性。BerTopic 默认使用 HDBSCAN 算法进行聚类，并提供了多个参数用于调优。

min_cluster_size: 该参数设置了每个主题包含的最少数据点数量。降低 min_cluster_size 的值，可以允许算法发现更小的主题，从而增加主题的数量。
cluster_selection_method: 该参数决定了 HDBSCAN 如何从层次聚类树中选择主题。默认值为 “eom”，表示算法会选择在较大密度范围内持续存在的主题。另一种选择是 “leaf”，表示算法会选择层次聚类树的最细粒度的主题。

为了获得更聚焦的主题，我们可以同时调整这两个参数。首先，将 min_cluster_size 从 15 降低到 5，并将 cluster_selection_method 设置为 “leaf”：

hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
topic_model.hdbscan_model = hdbscan_model_leaf
topics, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

可以看到，主题的数量显著增加，每个主题的主题更加聚焦，大小也更加平衡。如果只调整 min_cluster_size, 主题数量也会增加，但是可能不如同时调整两个参数效果明显。

调整 cluster_selection_method 到 leaf之后：

topic_model.hdbscan_model = hdbscan_model_leaf
topics, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

主题数量增加到了 33 个，这符合预期，因为算法选择了最细粒度的分割。

随机性与一致性：确保可复现的 BerTopic 结果

由于 UMAP 是一种非确定性算法，因此每次运行的结果可能会略有不同。为了获得一致的主题建模结果，需要将 UMAP 的 random_state 参数设置为一个固定的值。

此外，如果使用第三方嵌入 API（如 OpenAI 的 embeddings API），需要确保该 API 的行为是确定性的。否则，即使使用相同的参数设置，每次运行也可能得到不同的主题。例如，OpenAI 的 embeddings API 每次返回的结果略有不同。在这种情况下，建议预先生成所有文本的嵌入向量，并将其手动传递给 BerTopic 的 fit_transform 函数。

from bertopic.backend import BaseEmbedder
import numpy as np

class CustomEmbedder(BaseEmbedder):
    """Light-weight wrapper to call NVIDIA's embedding endpoint via OpenAI SDK."""
    def __init__(self, embedding_model, client):
        super().__init__()
        self.embedding_model = embedding_model
        self.client = client

    def encode(self, documents): # type: ignore[override]
        response = self.client.embeddings.create(
        input=documents,
        model=self.embedding_model,
        encoding_format="float",
        extra_body={"input_type": "passage", "truncate": "NONE"},
        )
        embeddings = np.array([embed.embedding for embed in response.data])
        return embeddings

topic_model.embedding_model = CustomEmbedder()
topics, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)

主题表示优化：提升主题的可解释性

主题表示对于理解主题建模的结果至关重要。BerTopic 默认使用 unigrams (单个词) 作为主题表示，这往往缺乏上下文信息，难以准确把握主题的含义。为了提升主题的可解释性，我们可以尝试以下方法：

N-gram 表示：引入上下文信息

增加 n-gram 的范围，例如使用 bigrams (两个词) 或 trigrams (三个词) 作为主题表示，可以引入更多的上下文信息，从而提升主题的可理解性。可以通过调整 CountVectorizer 的 ngram_range 参数来实现：

topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
topic_model.get_topic_info()

通过将所有词语设置为 bigrams，主题表示变得更加有意义。

自定义 Tokenizer：过滤无意义的 N-gram

即使使用 bigrams，仍然可能存在一些难以解释的短语，例如 “486dx 50”, “ac uk”, “dxf doc” 等。为了确保算法只选择高质量的 bigram 候选词，我们可以自定义 tokenizer，过滤掉无意义的 bigrams。

以下示例展示了如何使用 SpaCy 库，将 bigrams 限制为只包含具有有意义词性组合的短语：

import spacy
from typing import List

class ImprovedTokenizer:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
        self.MEANINGFUL_BIGRAMS = {
          ("ADJ", "NOUN"),
          ("NOUN", "NOUN"),
          ("VERB", "NOUN"),
          }

    # Keep only the most meaningful syntactic bigram patterns
    def __call__(self, text: str, max_tokens=200) -> List[str]:
        doc = self.nlp(text[:3000]) # truncate long docs for speed
        tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
        bigrams = []
        for i in range(len(tokens) - 1):
            word1, lemma1, pos1 = tokens[i]
            word2, lemma2, pos2 = tokens[i + 1]
            if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
                # Optionally lowercase both words to normalize
                bigrams.append(f"{lemma1} {lemma2}")
        return bigrams

topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
topic_model.get_topic_info()

现在，主题表示主要由有意义的 bigrams 组成，消除了噪声。

LLM 主题标题生成：超越词语层面的语义理解

为了提供超越词语层面的主题表示，我们可以使用大语言模型 (LLM) 为每个主题生成标题。可以将每个主题的代表性文档发送给 LLM，并要求其生成一个简洁而有意义的标题。

BerTopic 提供了内置的方法来实现 LLM 主题标题生成：

import openai
from bertopic.representation import OpenAI
import os

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
topic_model.update_topics(texts_clean, representation_model=OpenAI(client, model="gpt-4o-mini", delay_in_seconds=5))
topic_model.get_topic_info()

或者，您可以编写一个函数来获取 LLM 生成的标题，并使用 update_topic_labels 函数将其更新回存储的 topic_model 对象：

import openai
from typing import List, Dict, Tuple
from tqdm import tqdm

def generate_topic_titles_with_llm(
    topic_model,
    docs: List[str],
    api_key: str,
    model: str = "gpt-4o") -> Dict[int, Tuple[str, str]]:
    client = openai.OpenAI(api_key=api_key)
    topic_info = topic_model.get_topic_info()
    topic_repr = {}
    topics = topic_info[topic_info.Topic != -1].Topic.tolist()
    for topic in tqdm(topics, desc="Generating titles"):
        top_doc = topic_model.get_representative_docs(topic)[0]
        prompt = f"""
        You are a helpful summarizer for topic clustering.
        Given the following text that represents a topic, generate:
        1. A short **title** for the topic (2–6 words)
        2. A one or two sentence **summary** of the topic.
        Text:
        \"\"\"
        {top_doc}
        \"\"\"
        """
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                  {"role": "system", "content": "You are a helpful assistant for summarizing topics."},
                  {"role": "user", "content": prompt}
                ],
                temperature=0.5
            )
            output = response.choices[0].message.content.strip()
            lines = output.split('\n')
            title = lines[0].replace("Title:", "").strip()
            summary = lines[1].replace("Summary:", "").strip() if len(lines) > 1 else ""
            topic_repr[topic] = (title, summary)
        except Exception as e:
            print(f"Error with topic {topic}: {e}")
            topic_repr[topic] = ("[Error]", str(e))
    return topic_repr

topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])

topic_repr_dict = {
    topic: topic_repr.get(topic, "Topic")
    for topic in topic_model.get_topic_info()["Topic"]
}
topic_model.set_topic_labels(topic_repr_dict)

通过使用 LLM 生成主题标题，主题表示变得更加有意义，进一步提高了可解释性。

结论：掌握 BerTopic 的精髓，驾驭大模型时代的主题分析

本文分享了多种提升 BerTopic 主题建模效果的实用技巧。理解每个模块的功能以及每个参数对最终主题的影响至关重要。通过调整参数或切换算法，我们可以找到最适合特定领域的设置。此外，我们需要注意聚类结果中的随机性，并找出其根本原因，否则我们无法重现结果，并且每次都会得到一组不同的主题。

在识别出一组合理的主题之后，主题表示对于理解其含义至关重要。我们可以通过将单词限制为有意义的 n-grams（通过设置大于 1 的 n）或设置我们自己的 tokenizer 来使单词更有意义。我们也可以直接使用 LLM 来获取每个聚类的标题。无论哪种方式，主题表示的改进都可以极大地帮助我们更轻松地解释主题。

希望本文能够帮助您更深入地理解主题建模过程和 BerTopic 模块。BerTopic 实际上提供了许多高级主题建模方法，而不是文本上的标准聚类和表示工作流程。我们可以在另一篇文章中进一步探讨这一点。敬请期待！

提升 BerTopic 主题建模效果的实用技巧：大模型时代的主题发现与理解