使用GPT-2进行微调：打造专属大模型，从预训练到RAG实践

在上一篇关于大语言模型（LLMs）的博客中，我们探讨了分词技术。本文将深入探讨LLMs的微调过程，并以GPT-2（124M参数）作为预训练模型为例，展示如何通过微调，以及结合RAG（Retrieval-Augmented Generation，检索增强生成）技术，打造一个特定领域的专属大模型。本文将详细解析GPT-2模型的结构，参数，预训练和非预训练的区别，以及微调的详细步骤。

一、初探GPT-2：从文本生成到模型参数

GPT-2的核心功能在于，给定一段文本，它能预测并生成后续可能的文本。更准确地说，GPT-2处理的是“token”，而非直接处理“文本”。我们可以通过transformers库轻松加载和使用GPT-2模型。以下代码演示了如何使用GPT-2生成文本：

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
prompt = "我问GPT-2，你的梦想是什么？它回答说："
result = generator(prompt, max_length=20, num_return_sequences=1)
print(result[0]['generated_text'])

运行结果会根据模型的参数进行文本生成。虽然生成的文本在语法上基本正确，但有时会出现重复的语句，这在小模型（如124M参数的GPT-2）中是常见现象。

为了提升生成质量，可以使用更大规模的模型，例如EleutherAI/gpt-neo-1.3B：

from transformers import pipeline

generator = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")
prompt = "我问GPT-neo，你的梦想是什么？它回答说："
result = generator(prompt, max_length=20, num_return_sequences=1)
print(result[0]['generated_text'])

模型的参数是其学习的核心。简单来说，参数是模型在训练过程中学习到的数值（权重和偏置），它们决定了模型如何根据输入进行预测和生成输出。以线性函数 F(x) = W × x + b 为例，x是输入，W和b就是参数。

要查看GPT-2模型的所有参数，可以使用以下代码：

from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")

print("GPT-2 Model Parameters:\n")
for name, param in model.named_parameters():
    print(f"{name:<60} | shape: {tuple(param.shape)} | trainable: {param.requires_grad}")

上述代码会打印出GPT-2模型中所有参数的名称、形状和是否可训练。这对于理解模型的内部结构和微调策略至关重要。通过打印模型结构我们可以知道GPT-2模型拥有大量的参数，这些参数都代表着模型从大量的数据中学习到的信息，同时也成为了我们对模型进行微调的基础。

二、预训练 vs. 非预训练：模型训练的两种路径

一个预训练的GPT-2模型已经由OpenAI在海量的文本数据上训练过。它的权重和偏置已经通过梯度下降法进行了优化，模型已经学习了语法、事实、人类写作的模式和风格。相对的，一个非预训练的GPT-2模型则从未经过任何数据训练，其参数（权重和偏置）都是随机初始化的。

在训练过程中：

Token被输入到模型中。
模型使用当前随机权重预测下一个token。
将预测结果与来自数据集的实际下一个token进行比较。
计算损失（预测与实际数据的差距）。
使用反向传播+梯度下降，更新所有可训练参数，以减少损失。

微调的过程与此类似，但由于已经存在一些经过训练的权重，因此无需付出额外的努力。与非预训练的GPT-2相比，微调只需要对已有的权重进行最小程度的更新。

例如，如果我们对一个未经训练的GPT-2模型和一个经过训练的GPT-2模型分别输入”The capital of France is”，未经训练的模型可能会输出乱码，而经过训练的模型则会输出”Paris”。

三、实战：GPT-2全量微调，以哈利波特为例

由于GPT-2是一个相对较小的模型，我们这里采用全量微调，即更新所有参数。如果模型规模较大，则需要采用LoRA（Low-Rank Adaptation）或QLoRA（Quantized LoRA）等技术。

以下步骤展示了如何使用哈利波特系列书籍的数据集对GPT-2进行微调。

数据预处理

首先，我们需要清理数据集，使其适合训练。以下代码展示了如何将文本数据转换成适合Hugging Face Dataset的格式：

from datasets import Dataset
import os

def flatten_paragraphs(input_file):
    with open(input_file, "r", encoding="utf-8") as f:
        lines = f.readlines()
    cleaned = []
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        # Remove any line containing '|', plus empty lines before/after
        if '|' in line:
            while cleaned and cleaned[-1].strip() == "":
                cleaned.pop()  # remove empty lines before
            i += 1
            while i < len(lines) and lines[i].strip() == "":
                i += 1  # skip empty lines after
            continue
        cleaned.append(line)
        i += 1
    # Flatten paragraphs into one line
    paras = []
    temp = []
    for line in cleaned:
        if line.strip() == "":
            if temp:
                paras.append(" ".join(temp) + "\n\n")
                temp = []
        else:
            temp.append(line.strip())
    if temp:
        paras.append(" ".join(temp) + "\n\n")
    # Write to output file
    output_file = f"O_{os.path.basename(input_file)}"
    with open(output_file, "w", encoding="utf-8") as f:
        f.writelines(paras)
    print(f"Processed {input_file} → {output_file}")

for i in range(2, 8):
    flatten_paragraphs(f"book{i}.txt")

with open("/content/drive/MyDrive/hp_llm/hp1.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
paragraphs = raw_text.split("\n\n")
paragraphs = [p.strip() for p in paragraphs if p.strip()]
dataset = Dataset.from_dict({"content": paragraphs})
print(f"Total paragraphs: {len(dataset)}")
print(dataset[1])

分词

接下来，我们使用GPT-2的tokenizer对文本进行分词，将文本转换为模型可以理解的数字token。

from transformers import AutoTokenizer

context_length = 256
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    outputs = tokenizer(
        example["content"],
        truncation=True,
        max_length=256,
        return_overflowing_tokens=True,
        return_length=True,
    )
    return {"input_ids": outputs["input_ids"]}

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["content"]
)
print(len(tokenized_dataset))

context_length参数定义了模型一次性处理的最大token数量。tokenizer.pad_token = tokenizer.eos_token是为了避免在填充过程中出现错误。

数据整理

使用DataCollatorForLanguageModeling将token化的样本整理成批次，并进行必要的填充。

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

mlm=False是因为GPT-2是一个因果语言模型（CLM）。

模型训练

现在，我们可以导入GPT-2模型并开始训练。

from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer)) # 调整模型embedding层的大小，以适应新的tokenizer

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/hp_llm_2",
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=5e-4,
    warmup_steps=1000,
    weight_decay=0.1,
    lr_scheduler_type="cosine",
    logging_steps=50,
    save_strategy="epoch",
    save_total_limit=1,
    fp16=True,
    push_to_hub=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

TrainingArguments定义了训练过程中的各种参数，例如批次大小、学习率、训练轮数等。

训练非预训练模型

如果要从头开始训练模型，可以使用以下代码：

from transformers import GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)

请注意，与预训练模型相比，非预训练模型的初始损失会更高。

模型测试

训练完成后，可以使用以下代码测试模型的性能：

from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer
import torch

model_path = "/content/drive/MyDrive/hp_llm_2/checkpoint-6612"
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
device = 0 if torch.cuda.is_available() else -1
text_gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device)

def generate_text(prompt, max_length=100, temperature=0.7, top_k=40, top_p=0.9,
                  num_return_sequences=1, repetition_penalty=1.2):
    outputs = text_gen(
        prompt,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        num_return_sequences=num_return_sequences,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        repetition_penalty=repetition_penalty,
    )
    return [out["generated_text"] for out in outputs]

prompt = "Harry and I went there and climbed"
generated_texts = generate_text(
    prompt=prompt,
    max_length=100,
    temperature=0.7,
    top_k=40,
    top_p=0.9,
    num_return_sequences=3,
    repetition_penalty=1.2)

for i, text in enumerate(generated_texts):
    print(f"\nOutput {i+1}\n{text}")

四、RAG：让模型回答问题

经过上述微调，模型已经能够生成风格类似于哈利波特的文本。但如果我们想让模型回答关于哈利波特的问题，就需要结合RAG技术。

数据预处理

import re
from transformers import GPT2TokenizerFast
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle

TXT_FILE = "/content/drive/MyDrive/hp_llm/hp1.txt"
with open(TXT_FILE, "r", encoding="utf-8") as f:
    text = f.read()

text = re.sub(r"\n", " ", text)
text = re.sub(r"\s+", " ", text).strip()

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
max_tokens = 200
stride = 100
tokens = tokenizer(text, return_offsets_mapping=True, return_attention_mask=False)

chunks = []
for i in range(0, len(tokens['input_ids']), stride):
    chunk_tokens = tokens['input_ids'][i:i+max_tokens]
    if len(chunk_tokens) == 0:
        continue
    chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
    chunks.append(chunk_text)

embedding_model = SentenceTransformer("multi-qa-mpnet-base-dot-v1")
chunk_embeddings = embedding_model.encode(chunks, show_progress_bar=True, convert_to_numpy=True)

faiss.normalize_L2(chunk_embeddings)

print("Embeddings shape:", chunk_embeddings.shape)

dimension = chunk_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(chunk_embeddings)

FAISS_FILE = "hp1_faiss.index"
CHUNKS_FILE = "hp1_chunks.pkl"

faiss.write_index(index, FAISS_FILE)

with open(CHUNKS_FILE, "wb") as f:
    pickle.dump(chunks, f)

def query(text, k = 3):
    query_embedding = embedding_model.encode(text, convert_to_numpy=True).reshape(1, -1)
    faiss.normalize_L2(query_embedding)
    D, I = index.search(query_embedding, k)
    top_chunks = [chunks[i] for i in I[0]]
    return top_chunks

example = query("When was hogwarts founded?")

for i, chunk in enumerate(example):
    print(f"\nMatch {i+1}:\n{chunk[:300]}{'...' if len(chunk) > 300 else ''}")

这段代码将哈利波特书籍分割成小的文本块（chunks），并使用SentenceTransformer模型将每个块转换为向量嵌入（embeddings）。然后，它使用FAISS库创建一个索引，用于快速检索与查询问题最相关的文本块。分词，嵌入，检索是RAG技术的核心。

检索增强生成

当用户提出问题时，RAG系统首先检索与问题相关的文本块，然后将这些文本块作为上下文提供给微调后的GPT-2模型，让模型基于这些上下文生成答案。RAG可以将模型变成一个可以针对特定领域问答的机器人。

五、结论：微调与RAG的结合，打造个性化大模型

本文介绍了如何使用GPT-2进行微调，并结合RAG技术，打造一个特定领域的专属大模型。通过对预训练模型进行微调，我们可以使其适应特定的任务和数据集。结合RAG技术，我们可以让模型在生成文本的同时，利用外部知识库，提高生成文本的质量和准确性。通过对模型的参数进行修改，使得模型能够更好的生成符合需求的内容。分词技术则是整个过程的基础。微调+RAG是当下构建领域大模型的重要技术路线。

使用GPT-2进行微调：打造专属大模型，从预训练到RAG实践

使用GPT-2进行微调：打造专属大模型，从预训练到RAG实践

By llmtrend

大模型微调（Fine-tuning）：释放LLM潜力的终极指南

参数高效微调 (PEFT)：大模型应用的关键技术

大模型微调技术详解：SFT、DAPT、PEFT与RAG的Supervised Fine-Tuning（SFT）实战

大模型幻觉：当AI听起来很对，实际上大错特错

大模型Token成本控制：精打细算，玩转AI的省钱之道

NVIDIA 押注：小型语言模型（SLM）引领 Agentic AI 的未来

用爱发电，一人构建隐私至上的搜索引擎：Seek Ninja 与 Searcha Page 的崛起

大模型学习之旅：第二天——深入探索提示工程与AI伙伴的构建

You Missed

大模型幻觉：当AI听起来很对，实际上大错特错

大模型幻觉：当AI听起来很对，实际上大错特错

大模型Token成本控制：精打细算，玩转AI的省钱之道

大模型Token成本控制：精打细算，玩转AI的省钱之道

NVIDIA 押注：小型语言模型（SLM）引领 Agentic AI 的未来

NVIDIA 押注：小型语言模型（SLM）引领 Agentic AI 的未来

用爱发电，一人构建隐私至上的搜索引擎：Seek Ninja 与 Searcha Page 的崛起

用爱发电，一人构建隐私至上的搜索引擎：Seek Ninja 与 Searcha Page 的崛起

使用GPT-2进行微调：打造专属大模型，从预训练到RAG实践

By llmtrend

Related Post

大模型微调（Fine-tuning）：释放LLM潜力的终极指南

参数高效微调 (PEFT)：大模型应用的关键技术

大模型微调技术详解：SFT、DAPT、PEFT与RAG的Supervised Fine-Tuning（SFT）实战

You Missed

大模型幻觉：当AI听起来很对，实际上大错特错

大模型Token成本控制：精打细算，玩转AI的省钱之道

NVIDIA 押注：小型语言模型（SLM）引领 Agentic AI 的未来

用爱发电，一人构建隐私至上的搜索引擎：Seek Ninja 与 Searcha Page 的崛起