在大型语言模型(LLM)领域,微调是一项至关重要的技术,它能让通用模型适应特定任务,大幅提升性能。本文将深入探讨LoRA微调这一高效且易于上手的技术,并结合实际案例,手把手教你如何利用LoRA微调打造一个属于你自己的“玩偶医生”LLM,解决玩偶的各种“疑难杂症”。文章将涵盖环境配置、基础设施搭建、LLMOps基础、模型监控以及最佳实践,力求提供一个完整、可操作的LLM微调指南。

环境配置:迈向LLM微调的第一步

进行LLM微调的第一步是搭建合适的开发环境。作者推荐使用 uv 环境,并强调了安装正确 torch 版本的关键性。如果 torch 版本不兼容当前的 CUDA 版本,GPU 加速将无法启用,导致训练速度大幅下降。可以使用 nvcc --versionnvidia-smi 命令查看当前的 CUDA 版本。例如,如果 CUDA 版本为 12.1,则应使用 cu121 版本。执行以下命令安装:

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

需要注意的是,Windows 系统上的 torch 尚未支持 Python 3.13,需要将环境降级到 3.12。安装好 torch 后,可以使用以下命令安装其他必要的软件包:

uv pip install datasets transformers accelerate>=0.26.0 loguru dotenv peft zenml "zenml[server]==0.83.0" hf_xet comet_ml

LLMOps:ZenML与CometML的加持

为了更好地管理 LLM微调 过程,作者引入了 LLMOps 概念,并选择了 ZenML 和 CometML 作为工具。ZenML 用于构建本地 pipeline,实现流程可视化和自动化,便于未来的改进。CometML 则用于存储训练指标和 artifacts。

首先,使用以下命令初始化 ZenML 项目:

zenml init

然后,安装 Hugging Face 集成,方便与模型和数据集交互:

zenml integration install huggingface -uv

接着,运行以下命令,启动本地 uvicorn 服务器,在浏览器中打开 ZenML 默认工作区:

zenml login --local --blocking

为了使用 CometML,需要先在其网站(https://www.comet.com/)上注册账号,创建项目工作区并获取 API 密钥。然后,执行以下命令注册 CometML experiment tracker,并将其添加到 ZenML stack 中:

zenml experiment-tracker register comet_experiment_tracker --flavor=comet --workspace=your-workspace-name --project_name=project-name --api_key=FEW-CHARACTERS-API-KEY comet_tracker
zenml stack register custom_stack -e comet_tracker -o default -a default --set

务必在第二个命令末尾加上 --set,否则 experiment tracking 可能无法正常工作。

代码实现:从数据集到微调模型

文章提供了一系列 Python 代码片段,用于实现 LLM微调 流程。首先是导入必要的库和定义辅助函数:

import json
import tempfile
from typing import Annotated, Any
import torch
from datasets import Dataset as HFDataset
from loguru import logger
from peft import LoraConfig, PeftModel, get_peft_modelfrom transformers import (AutoModelForCausalLM, AutoTokenizer,                          DataCollatorForLanguageModeling, PreTrainedModel,                          PreTrainedTokenizerFast, Trainer, TrainingArguments)from zenml import ArtifactConfig, pipeline, stepfrom zenml.client import Clientfrom zenml.enums import ArtifactTypefrom zenml.integrations.comet.experiment_trackers import CometExperimentTracker

experiment_tracker: CometExperimentTracker = Client().active_stack.experiment_tracker
DEFAULT_BASE_MODEL_NAME: str = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
LORA_TARGET_MODULES: list[str] = ["q_proj", "k_proj", "v_proj", "o_proj"]
DEFAULT_MAX_SEQ_LENGTH: int = 256
DEFAULT_LORA_R: int = 16
DEFAULT_LORA_ALPHA: int = 32
DEFAULT_MAX_STEPS: int = 50
DEFAULT_PER_DEVICE_TRAIN_BATCH_SIZE: int = 1
DEFAULT_LEARNING_RATE: float = 2e-4
DEFAULT_MAX_NEW_TOKENS: int = 64

def print_trainable_parameters(model: PreTrainedModel | PeftModel) -> dict[str, Any]:
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    trainable_percentage = 100 * trainable_params / all_param if all_param > 0 else 0
    logger.info(
        f"Trainable params: {trainable_params:,} || "
        f"All params: {all_param:,} || "
        f"Trainable %: {trainable_percentage:.2f}%"
    )
    return {"trainable_params": trainable_params, "all_params": all_param, "trainable_percentage": trainable_percentage}

def format_instruction(example: dict[str, str]) -> str:
    return f"### Instruction: {example['instruction']}\n### Response: {example['response']}"

接下来,定义一个 ZenML step 来加载训练数据集:

@step(experiment_tracker=experiment_tracker.name)
def load_training_dataset() -> Annotated[HFDataset, ArtifactConfig(name="training_dataset")]:
    logger.info("Loading plushie doctor dataset...")
    data = [
        {"instruction": "My teddy bear has a tummy ache",
         "response": "Oh no! Let me listen to teddy's tummy with my magic stethoscope! *puts stethoscope on teddy* I hear rumbly sounds! Teddy needs special cuddle medicine and maybe some pretend soup!"},
        {"instruction": "My doll fell down and hurt her knee",
         "response": "Don't worry! Doctor me will make it all better! *puts colorful bandaid on doll's knee* There! Now she needs lots of kisses and maybe a special blanket to feel better!"},
        {"instruction": "My stuffed elephant can't sleep",
         "response": "Elephants need special sleepy medicine! *pretends to give elephant medicine* Now let's tuck elephant in real cozy and sing a lullaby! Sweet dreams, elephant!"},
        {"instruction": "My bunny has a broken ear",
         "response": "Oh my! Bunny's ear needs super gentle fixing! *carefully pretends to stitch* I'm using my special invisible thread! Now bunny's ear is good as new and even stronger!"},
        {"instruction": "My toy dinosaur has a cold",
         "response": "Even big dinosaurs get sniffles! *covers dinosaur with tiny blanket* Dinosaur needs warm tea and lots of rest in the toy box hospital! No more roaring until better!"},
        {"instruction": "My puppy stuffie won't eat",
         "response": "Puppy must have a sore throat! *looks in puppy's mouth with flashlight* I see the problem! Puppy needs magic throat drops and maybe some yummy pretend kibble!"},
        {"instruction": "My doll has a fever",
         "response": "Poor dolly is too hot! *puts cool washcloth on doll's forehead* We need to take her temperature! Beep beep! She needs lots of water and quiet time to get better!"},
        {"instruction": "My bear's paw is stuck",
         "response": "Don't worry bear! Doctor me will help you! *gently moves bear's paw* Sometimes paws get sleepy and need gentle wiggling! There! Now bear can wave hello again!"},
    ]
    dataset = HFDataset.from_list(data)
    logger.info(f"Dataset loaded with {len(dataset)} examples")
    return dataset

该 step 创建了一个包含玩偶医疗场景的示例数据集,包括 instruction (问题) 和 response (解决方案)。然后,定义另一个 step 来准备模型和 tokenizer:

@step(experiment_tracker=experiment_tracker.name)
def prepare_model_and_tokenizer(
        base_model_name: str = DEFAULT_BASE_MODEL_NAME) -> tuple[
    Annotated[PreTrainedModel, ArtifactConfig(name="base_model")],
    Annotated[PreTrainedTokenizerFast, ArtifactConfig(name="base_tokenizer")]]:
    logger.info(f"Loading base model and tokenizer: {base_model_name}")
    tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        logger.info("Set pad_token to eos_token")
    model_kwargs: dict[str, Any] = {"trust_remote_code": True}
    if torch.cuda.is_available():
        model_kwargs.update({
            "torch_dtype": torch.float16,
            "device_map": "auto"
        })
        logger.info("Thats why I have decided to write this guide - how to finetune own LLM on your computer for greatest purpose - be plushie doctor!")
    else:
        model_kwargs["torch_dtype"] = torch.float32
        logger.info("Loading model for CPU")
    model = AutoModelForCausalLM.from_pretrained(base_model_name, **model_kwargs)
    logger.info(f"Model loaded successfully")
    logger.info("Base model parameters:")
    print_trainable_parameters(model)
    return model, tokenizer

该 step 加载指定的预训练模型和 tokenizer。如果 tokenizer 没有 pad token,则将其设置为 eos token。如果 CUDA 可用,则将模型加载到 GPU 上,并使用 torch.float16 数据类型以节省显存。接下来,定义一个 step 来 tokenize 数据集:

@step(experiment_tracker=experiment_tracker.name)
def tokenize_dataset(
        dataset: HFDataset,
        tokenizer: PreTrainedTokenizerFast,
        max_seq_length: int = DEFAULT_MAX_SEQ_LENGTH) -> Annotated[HFDataset, ArtifactConfig(name="tokenized_dataset")]:
    logger.info(f"Tokenizing dataset with max length: {max_seq_length}")

    def tokenize_function(examples: dict[str, list[str]]) -> dict[str, Any]:
        texts = []
        for i in range(len(examples["instruction"])):
            text = format_instruction({
                "instruction": examples["instruction"][i],
                "response": examples["response"][i]
            })
            texts.append(text)
        tokenized = tokenizer(
            texts,
            truncation=True,
            padding="max_length",
            max_length=max_seq_length,
            return_tensors=None
        )
        tokenized["labels"] = tokenized["input_ids"].copy()
        return tokenized

    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names,
        desc="Tokenizing dataset"
    )
    logger.info(f"Dataset tokenized successfully")
    return tokenized_dataset

该 step 使用 tokenizer 将 instruction 和 response 拼接成文本,然后进行 tokenize,并设置 labels 等于 input_ids。

最关键的一步是 LLM微调,使用以下 step 实现:

@step(experiment_tracker=experiment_tracker.name)
def train_lora_model(
        base_model: PreTrainedModel,
        tokenizer: PreTrainedTokenizerFast,
        tokenized_dataset: HFDataset,
        lora_r: int = DEFAULT_LORA_R,
        lora_alpha: int = DEFAULT_LORA_ALPHA,
        lora_dropout: float = 0.05,
        per_device_train_batch_size: int = DEFAULT_PER_DEVICE_TRAIN_BATCH_SIZE,
        learning_rate: float = DEFAULT_LEARNING_RATE,
        max_steps: int = DEFAULT_MAX_STEPS,
        logging_steps: int = 1,
        use_fp16: bool = False,
        use_bf16: bool = False) -> tuple[
    Annotated[PeftModel, ArtifactConfig(name="lora_model", artifact_type=ArtifactType.MODEL)],
    Annotated[dict[str, Any], ArtifactConfig(name="training_metrics")]]:
    logger.info("Starting LoRA fine-tuning...")
    if not use_fp16 and not use_bf16 and torch.cuda.is_available():
        try:
            device_capability = torch.cuda.get_device_capability()
            if device_capability[0] >= 8:
                use_bf16 = True
                logger.info("Auto-enabled BF16 for Ampere+ GPU")
            else:
                use_fp16 = True
                logger.info("Auto-enabled FP16 for pre-Ampere GPU")
        except Exception as e:
            logger.warning(f"Could not detect GPU capability: {e}")
    logger.info(f"Using target modules: {LORA_TARGET_MODULES}")
    lora_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        target_modules=LORA_TARGET_MODULES,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM"
    )
    peft_model = get_peft_model(base_model, lora_config)
    logger.info("LoRA configuration applied to model")
    logger.info("LoRA model parameters:")
    lora_params_info = print_trainable_parameters(peft_model)
    temp_output_dir = tempfile.mkdtemp(prefix="zenml_lora_")
    logger.info(f"Using temporary output directory: {temp_output_dir}")
    training_args = TrainingArguments(
        output_dir=temp_output_dir,
        per_device_train_batch_size=per_device_train_batch_size,
        learning_rate=learning_rate,
        max_steps=max_steps,
        logging_steps=logging_steps,
        save_steps=max_steps + 1,
        save_strategy="no",
        save_safetensors=False,
        no_cuda=not torch.cuda.is_available(),
        fp16=use_fp16,
        bf16=use_bf16,
        dataloader_num_workers=0,
        remove_unused_columns=False,
        seed=42,
        warmup_steps=min(10, max_steps // 10),
        gradient_accumulation_steps=1,
        report_to="comet_ml" if experiment_tracker else "none",
    )
    if experiment_tracker:
        hyperparams_to_log = training_args.to_dict()
        if "output_dir" in hyperparams_to_log:
            del hyperparams_to_log["output_dir"]
        experiment_tracker.log_params(hyperparams_to_log)
        experiment_tracker.log_params({
            "lora_r": lora_r,
            "lora_alpha": lora_alpha,
            "lora_dropout": lora_dropout,
            "lora_target_modules": LORA_TARGET_MODULES,
            "max_seq_length": tokenizer.model_max_length,
        })
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )
    trainer = Trainer(
        model=peft_model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=data_collator,
    )
    if hasattr(peft_model, 'config'):
        peft_model.config.use_cache = False
    logger.info("Starting training...")
    train_result = trainer.train()
    logger.info("Training completed!")
    training_metrics = {
        "train_runtime": train_result.metrics.get("train_runtime", 0),
        "train_samples_per_second": train_result.metrics.get("train_samples_per_second", 0),
        "train_loss": train_result.metrics.get("train_loss", 0),
        "lora_params": lora_params_info,
        "lora_config": {
            "r": lora_r,
            "lora_alpha": lora_alpha,
            "lora_dropout": lora_dropout,
            "target_modules": LORA_TARGET_MODULES,
        }
    }
    if experiment_tracker:
        experiment_tracker.log_metrics({
            "train_runtime": training_metrics["train_runtime"],
            "train_samples_per_second": training_metrics["train_samples_per_second"],
            "train_loss": training_metrics["train_loss"],
            "lora_trainable_params": training_metrics["lora_params"]["trainable_params"],
            "lora_all_params": training_metrics["lora_params"]["all_params"],
            "lora_trainable_percentage": training_metrics["lora_params"]["trainable_percentage"],
        })
    logger.info(f"Training metrics: {json.dumps(training_metrics, indent=2, default=str)}")
    unwrapped_model = trainer.model
    if hasattr(trainer, 'accelerator') and trainer.accelerator is not None:
        logger.info("Unwrapping model using trainer.accelerator")
        unwrapped_model = trainer.accelerator.unwrap_model(trainer.model)
    else:
        logger.info("No trainer.accelerator found, model might be already unwrapped or not wrapped by Accelerate.")
    model_state_dict = unwrapped_model.state_dict()
    clean_peft_model = get_peft_model(base_model, lora_config)
    clean_peft_model.load_state_dict(model_state_dict)
    final_model_cpu = clean_peft_model.to('cpu')
    try:
        logger.info("Attempting to convert model to float32 for serialization.")
        final_model_cpu = final_model_cpu.float()
    except Exception as e:
        logger.warning(f"Could not convert model to float32: {e}. Proceeding with original dtype on CPU.")
    torch.cuda.empty_cache()
    if hasattr(final_model_cpu, 'config'):
        final_model_cpu.config.use_cache = True
        if hasattr(final_model_cpu, 'base_model') and hasattr(final_model_cpu.base_model, 'config'):
            final_model_cpu.base_model.config.use_cache = True
            logger.info("Set use_cache=True on PeftModel and its base model config.")
        else:
            logger.info("Set use_cache=True on model config.")
    return final_model_cpu, training_metrics

该 step 使用 Peft 库实现 LoRA微调。首先,根据 GPU 性能自动启用 FP16 或 BF16 混合精度训练。然后,定义 LoRA 配置,包括 rank (r)、alpha (alpha)、dropout 以及 target modules。target modules 指定了需要进行 LoRA微调 的模型层,常见的选择包括 attention 层的 query、key、value 和 output projection 矩阵。使用 get_peft_model 函数将 LoRA 配置应用到 base model 上,创建一个 PeftModel。接下来,定义 TrainingArguments,包括输出目录、batch size、learning rate、max steps 等。使用 Trainer 训练模型,并将训练指标记录到 CometML 中。最后,将模型转换为 CPU 上的 float32 数据类型,并保存到 artifact 中。

为了验证 LLM微调 的效果,可以使用以下 step 来评估模型:

@step(experiment_tracker=experiment_tracker.name)
def evaluate_model(
        model: PreTrainedModel | PeftModel,
        tokenizer: PreTrainedTokenizerFast,
        max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,) -> Annotated[dict[str, str], ArtifactConfig(name="evaluation_results")]:
    logger.info("Evaluating model...")
    test_prompts = [
            "My robot toy is making weird noises",
            "My stuffed cat won't purr",
            "My doll's hair is messy",
            "My toy car won't drive",
            "My stuffed dragon feels sad",
            "My teddy bear lost his voice"
    ]
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    logger.info(f"Model moved to {device}")
    results = {}
    for prompt in test_prompts:
        logger.info(f"Generating response for: '{prompt}'")
        formatted_prompt = format_instruction({'instruction': prompt, 'response': ''})
        inputs = tokenizer(
            formatted_prompt,
            return_tensors="pt",
            truncation=True,
            max_length=tokenizer.model_max_length - max_new_tokens
        ).to(device)
        with torch.no_grad():
            outputs = model.generate(
                input_ids=inputs.input_ids,
                attention_mask=inputs.attention_mask,
                max_new_tokens=max_new_tokens,
                temperature=0.3,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        if response.startswith(formatted_prompt):
            response = response[len(formatted_prompt):].strip()
        results[prompt] = response
        logger.info(f"Response: {response}")
    return results

该 step 使用一组测试 prompts 来生成模型的 response,并记录到 artifact 中。例如,对于 prompt “My robot toy is making weird noises”,模型可能会生成类似 “Oh no! Let me listen to my robot toy’s toyie throat! pretends to suckle covers ear coughs” 的 response,展现出“玩偶医生”的角色特征。

Pipeline:整合所有步骤

有了以上 step,就可以使用 ZenML 将它们整合到一个 pipeline 中:

@pipelinedef simple_lora_pipeline(
        base_model_name: str = DEFAULT_BASE_MODEL_NAME,
        max_seq_length: int = DEFAULT_MAX_SEQ_LENGTH,
        lora_r: int = DEFAULT_LORA_R,
        lora_alpha: int = DEFAULT_LORA_ALPHA,
        max_steps: int = DEFAULT_MAX_STEPS,
        per_device_train_batch_size: int = DEFAULT_PER_DEVICE_TRAIN_BATCH_SIZE,
        learning_rate: float = DEFAULT_LEARNING_RATE,
        use_fp16: bool = False,
        use_bf16: bool = False,
        max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS,) -> None:
    logger.info("Starting simple LoRA fine-tuning pipeline")
    dataset = load_training_dataset()
    base_model, tokenizer = prepare_model_and_tokenizer(base_model_name=base_model_name)
    logger.info("Evaluating base model before fine-tuning...")
    _ = evaluate_model(
        model=base_model,
        tokenizer=tokenizer,
        max_new_tokens=max_new_tokens,
    )
    logger.info("Base model evaluation complete.")
    tokenized_dataset = tokenize_dataset(
        dataset=dataset,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length
    )
    lora_model, _ = train_lora_model(
        base_model=base_model,
        tokenizer=tokenizer,
        tokenized_dataset=tokenized_dataset,
        lora_r=lora_r,
        lora_alpha=lora_alpha,
        per_device_train_batch_size=per_device_train_batch_size,
        learning_rate=learning_rate,
        max_steps=max_steps,
        use_fp16=use_fp16,
        use_bf16=use_bf16,
    )
    _ = evaluate_model(
        model=lora_model,
        tokenizer=tokenizer,
        max_new_tokens=max_new_tokens,
    )
    logger.info("Pipeline completed successfully!")

该 pipeline 首先加载数据集、准备模型和 tokenizer,然后评估 base model 的性能。接下来,tokenize 数据集,并使用 LoRA微调 训练模型。最后,评估 微调 后的模型性能。

使用以下代码运行 pipeline:

import random

logger.info("Starting ZenML LoRA Pipeline")
run_name = f"simple_lora_{random.randint(1000, 9999)}"
pipeline_instance = simple_lora_pipeline.with_options(
    run_name=run_name,
    enable_cache=False,
)
try:
    pipeline_run = pipeline_instance(
        base_model_name=DEFAULT_BASE_MODEL_NAME,
        max_seq_length=32,
        lora_r=32,
        lora_alpha=16,
        max_steps=50,
        per_device_train_batch_size=1,
        learning_rate=5e-4,
        use_fp16=torch.cuda.is_available(),
        use_bf16=False,
        max_new_tokens=256,
    )
    if pipeline_run:
        logger.info(f"Pipeline executed successfully with run name: {run_name}")
        logger.info("You can find the trained model and outputs in the ZenML dashboard.")
except Exception as e:
    logger.error(f"Pipeline execution failed: {e}")
    logger.exception("Full traceback:")

运行 pipeline 后,可以在 ZenML dashboard 中查看 pipeline 的执行状态和结果,并在 CometML 中查看训练指标。

总结:LoRA微调的强大力量

通过本文的介绍,我们了解了如何使用 LoRA微调 技术,结合 ZenML 和 CometML 等工具,构建一个 LLM微调 pipeline,并将其应用到“玩偶医生”这一特定场景。 LoRA微调 是一种高效且易于上手的技术,它允许我们在有限的计算资源下,将大型语言模型适应到各种特定任务,充分发挥 LLM 的潜力。无论是打造专属的玩偶医生,还是解决其他领域的难题, LoRA微调 都将是你的得力助手。