使用 Unsloth 与 Ollama 微调 LLM：一步一步指南

随着大模型技术的日益普及，针对特定任务对预训练语言模型进行微调变得越来越重要。本文将深入探讨如何使用 Unsloth 库简化微调过程，并利用 Ollama 在本地部署微调后的模型。我们将通过一个实际的案例，详细介绍从数据准备、模型加载、微调训练，到最终使用 Ollama 运行模型的完整流程，帮助读者掌握高效微调和部署大模型的方法。

1. 微调的意义与关键概念

微调是指利用特定任务的数据集，对预训练的语言模型进行再训练的过程。这种方法避免了从零开始训练模型，显著降低了计算成本和时间。想象一下，与其从头教一个人做饭，不如让一位经验丰富的厨师学习餐厅的特色菜谱。关键区别在于，微调使用新的数据重新训练模型的参数，而参数调整则是在不改变模型权重的情况下，通过调整超参数（如温度、top_k）来改变模型的行为。

2. 何时需要微调？

在以下情况下，微调的价值尤为突出：

需要特定格式的输出： 例如，需要模型以结构化的 JSON 格式返回结果。
处理领域特定的数据： 例如，处理医疗记录等专业数据。
希望拥有成本效益高的模型： 能够在特定任务上表现良好，而无需依赖大规模的 LLM。

举例来说，假设你正在开发一个自动提取电商网站产品信息的应用。通用的大模型可能难以准确提取所有字段，例如产品名称、价格、类别和品牌。通过使用包含大量电商产品信息的微调数据集，我们可以训练一个专门的模型，使其能够以特定的 JSON 格式返回这些信息，从而显著提高效率和准确性。

需要注意的是，微调后的模型更具专业性，可能会牺牲一些通用性。因此，在决定是否进行微调时，需要权衡模型的通用性和特定任务的性能。

3. 使用 Unsloth 进行微调的实践步骤

Unsloth 库旨在简化微调过程，提供更快的训练速度和更低的内存占用。以下步骤将指导你如何使用 Unsloth 在 Google Colab 上进行模型微调。

3.1 环境准备

首先，我们需要安装必要的依赖包：

!pip install unsloth trl peft accelerate bitsandbytes

然后，验证 GPU 是否可用：

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

3.2 加载预训练模型

接下来，我们使用 Unsloth 加载预训练模型。在本文的案例中，我们选择 unsloth/Phi-3-mini-4k-instruct-bnb-4bit 模型。

from unsloth import FastLanguageModel

model_name = "unsloth/Phi-3-mini-4k-instruct-bnb-4bit"
max_seq_length = 2048
dtype = None

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=True,
)

3.3 准备数据集

准备好用于微调的数据集。数据集应该包含输入和输出的对应关系。例如，用于提取 JSON 数据的示例如下：

[
    {
        "input": "<div class='product'><h2>iPad Air</h2><span class='price'>$1344</span><span class='category'>audio</span><span class='brand'>Dell</span></div>",
        "output": {"name": "iPad Air", "price": 1344, "category": "audio", "brand": "Dell"}
    },
    {
        "input": "<div class='product'><h2>Macbook Pro</h2><span class='price'>$2344</span><span class='category'>computer</span><span class='brand'>Apple</span></div>",
        "output": {"name": "Macbook Pro", "price": 2344, "category": "computer", "brand": "Apple"}
    }
]

加载数据集并将其格式化为模型可接受的格式：

import json
from datasets import Dataset

file = json.load(open("json_extraction_dataset_500.json", "r"))

def format_prompt(example):
    return f"### Input: {example['input']}\n### Output: {json.dumps(example['output'])}<|endoftext|>"

formatted_data = [format_prompt(item) for item in file]
dataset = Dataset.from_dict({"text": formatted_data})

3.4 应用 LoRA 适配器

LoRA (Low-Rank Adaptation) 是一种高效的微调技术，它通过训练少量参数来调整预训练模型，从而降低计算成本和内存占用。

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=128,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

3.5 训练模型

使用 trl 库中的 SFTTrainer 训练模型。

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=25,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        save_strategy="epoch",
        save_total_limit=2,
        dataloader_pin_memory=False,
    ),
)

trainer_stats = trainer.train()

3.6 模型推理

训练完成后，我们可以使用微调后的模型进行推理。

FastLanguageModel.for_inference(model)
messages = [
    {"role": "user", "content": "Extract the product information:\n<div class='product'><h2>iPad Air</h2><span class='price'>$1344</span><span class='category'>audio</span><span class='brand'>Dell</span></div>"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    use_cache=True,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
)

response = tokenizer.batch_decode(outputs)[0]
print(response)

3.7 导出 GGUF 格式模型

为了在 Ollama 中运行模型，我们需要将其导出为 GGUF 格式。

model.save_pretrained_gguf("gguf_model", tokenizer, quantization_method="q4_k_m")

import os
from google.colab import files

gguf_files = [f for f in os.listdir("gguf_model") if f.endswith(".gguf")]
if gguf_files:
    gguf_file = os.path.join("gguf_model", gguf_files[0])
    print(f"Downloading: {gguf_file}")
    files.download(gguf_file)

4. 使用 Ollama 运行微调后的模型

Ollama 是一个开源的工具，可以让你在本地轻松运行大型语言模型。以下步骤将介绍如何使用 Ollama 运行我们刚刚微调后的模型。

4.1 准备模型文件

首先，创建一个新的目录，并将导出的 .gguf 文件移动到该目录中。然后，在该目录中创建一个名为 Modelfile 的文件。

4.2 编辑 Modelfile

在 Modelfile 中添加以下内容，替换 <model_name>.gguf 为你的模型文件名：

from ./<model_name>.gguf
param_top_p 0.9
param_temperature 0.2
stop user
stop end_of_text
template "<|im_start|>user\n{{.Prompt}}<|im_end|>\n<|im_start|>assistant\n{{.Response}}<|im_end|>\n"
system "You are a helpful AI assistant."

from ./<model_name>.gguf: 指定模型文件的路径。
param_top_p 0.9 和 param_temperature 0.2: 设置推理参数。
stop user 和 stop end_of_text: 定义停止生成的标记。
template: 定义输入和输出的格式。
system: 设置模型的系统提示。

4.3 创建并运行模型

使用 Ollama 创建模型：

ollama create <model_name> -f Modelfile

然后，运行模型：

ollama run <model_name>

现在，你可以在本地与微调后的模型进行交互了。例如，你可以输入以下提示：

Extract the product information:
<div class='product'><h2>Surface Pro</h2><span class='price'>$1044</span><span class='category'>computer</span><span class='brand'>Microsoft</span></div>

模型应该能够返回如下 JSON 格式的响应：

{"name": "Surface Pro", "price": 1044, "category": "computer", "brand": "Microsoft"}

5. Unsloth 与传统微调方法的优势对比

传统的微调方法，特别是对于大型语言模型，往往需要大量的计算资源和时间。 Unsloth 通过以下方式显著提高了效率：

更快的训练速度： Unsloth 使用优化的 CUDA 内核，可以显著加快训练速度。
更低的内存占用： Unsloth 支持 4 比特量化，可以在有限的 GPU 内存上加载和训练大型模型。
简化的 API： Unsloth 提供了简洁易用的 API，使得微调过程更加方便。

例如，在相同的硬件条件下，使用 Unsloth 对一个数十亿参数的模型进行微调，可能只需要几个小时，而传统的微调方法可能需要几天甚至几周。这使得更多的人能够参与到大模型的微调和应用中来。

6. 更多微调策略与技巧

除了 LoRA 之外，还有许多其他的微调策略可以尝试，例如：

Prompt Tuning： 通过调整输入提示来引导模型的输出。
Prefix Tuning： 在模型的输入层添加可训练的前缀。
Adapter Tuning： 在模型的中间层插入小的适配器模块。

选择合适的微调策略取决于具体的任务和数据集。此外，以下技巧可以帮助你获得更好的微调效果：

数据增强： 通过生成新的数据来扩充数据集。
超参数优化： 使用自动化工具来搜索最佳的超参数组合。
模型评估： 使用合适的指标来评估模型的性能，并根据评估结果进行调整。

7. 安全性与伦理考量

在微调大模型时，需要注意安全性与伦理问题。例如，需要确保微调后的模型不会生成有害或歧视性的内容。一些常用的方法包括：

数据过滤： 从训练数据中移除有害或敏感的内容。
对抗训练： 训练模型抵抗对抗性攻击。
人工审核： 对模型的输出进行人工审核，以确保其符合伦理标准。

8. 结论

通过本文的介绍，相信你已经掌握了使用 Unsloth 和 Ollama 进行 LLM 微调的基本方法。微调是构建特定领域 AI 应用的关键步骤，能够显著提高模型在特定任务上的性能。结合 Unsloth 提供的便捷工具和 Ollama 的本地部署能力，你可以更高效地训练、测试和使用定制化的 AI 模型。随着大模型技术的不断发展，微调将在未来的 AI 应用中发挥越来越重要的作用。通过掌握这项技术，你可以更好地利用大模型的力量，为你的业务和社会创造更大的价值。利用正确的数据和工具，每个人都可以构建为特定使用场景优化的定制 AI 模型，而无需依赖云服务。

使用 Unsloth 与 Ollama 微调 LLM：一步一步指南