Fine-Tuning DeepSeek LLM: Adapting Open-Source AI for Your Needs

Abhishek Maheshwarappa
5 min readJan 22, 2025

--

Introduction

DeepSeek LLM is a powerful open-source language model, but to maximize its potential for specific applications, fine-tuning is essential. In this guide, we’ll walk through the process of fine-tuning DeepSeek LLM using Supervised Fine-Tuning (SFT) with Hugging Face datasets, providing a step-by-step code walkthrough for training on a domain-specific dataset. We will also discuss the loss function used, why a subset of data was used, and how LoRA (Low-Rank Adaptation) enables memory-efficient fine-tuning.

Source: DeepSeek AI

For people who love to get hands dirty use this Google Colab

Overview of Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is the process of further training a pre-trained model on a labeled dataset to specialize it for a particular task, such as customer support, medical Q&A, or e-commerce recommendations.

How Fine-Tuning Works

Fine-tuning involves training the model on task-specific labeled data, where:

  • Input (X): The text data given to the model.
  • Target (Y): The expected output based on labeled data (e.g., sentiment label, chatbot response, or summarized text).
  • Loss Function: Measures how well the model’s predictions match the expected output. The most commonly used loss function for text generation is Cross-Entropy Loss.

For example, when fine-tuning on the IMDB sentiment dataset:

  • Input (X): A movie review like “The movie had great visuals, but the plot was weak.”
  • Target (Y): The correct label, e.g., “Positive” or “Negative” sentiment.

For text generation tasks, input could be a question, and the target could be the correct response generated by the model.

Loss Function Used: Cross-Entropy Loss

For language models, we use Cross-Entropy Loss, which calculates the difference between the predicted token distribution and the actual target distribution:

Credits — Latex to Image

The goal is to minimize this loss during training so that the model learns to generate more accurate text outputs.

2. Why Use a Subset of Data?

When fine-tuning large language models like DeepSeek LLM on resource-limited hardware, training on the full dataset (e.g., IMDB with 25,000 samples) can lead to excessive training time and GPU memory issues. To mitigate this, we:

  • Selected a subset: 500 samples for training and 100 for evaluation.
  • Maintained representativeness: This subset retains enough diversity to achieve reasonable performance.

Using a smaller dataset speeds up experimentation while demonstrating fine-tuning concepts effectively. For production-level fine-tuning, larger datasets should be used on more powerful infrastructure.

3. Load the DeepSeek LLM

Before fine-tuning, we need to load the DeepSeek LLM and prepare it for training.

Install Required Libraries

First, install the necessary dependencies:

pip install -U torch transformers datasets accelerate peft bitsandbytes

Load the Model with 4-bit Quantization

We use 4-bit quantization to make the large model compatible with limited GPU memory:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

model_name = "deepseek-ai/deepseek-llm-7b-base"
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16 # Use float16 for faster computation
)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Apply LoRA for memory-efficient fine-tuning
lora_config = LoraConfig(
r=8, # Low-rank adaptation size
lora_alpha=32,
target_modules=["q_proj", "v_proj"], # Apply LoRA to attention layers
lora_dropout=0.05,
bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
print("✅ DeepSeek LLM Loaded with LoRA and 4-bit Precision!")

4. Using Hugging Face Datasets for Training

For fine-tuning, we need a high-quality dataset. Hugging Face provides access to various datasets:

Select a Dataset

For this example, let’s use the IMDB dataset for fine-tuning DeepSeek LLM on sentiment classification:

from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

Preprocess the Dataset

Convert text into tokenized inputs for the model:

def tokenize_function(examples):
inputs = tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=512
)
inputs["labels"] = inputs["input_ids"].copy()
return inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Subset the dataset for faster experimentation
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(500))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))
# Print a sample tokenized entry
print("Tokenized Sample:")
print(small_train_dataset[0])

5. What is LoRA (Low-Rank Adaptation)?

LoRA (Low-Rank Adaptation) is a technique designed to make fine-tuning large models like DeepSeek LLM more memory-efficient by:

  • Freezing the majority of the model’s weights.
  • Introducing low-rank trainable matrices in key layers (e.g., attention layers).

This drastically reduces the number of trainable parameters while preserving the model’s performance. LoRA enables fine-tuning large language models on resource-constrained hardware (e.g., Colab GPUs).

How LoRA Works

  1. Decomposes parameter updates into low-rank matrices:
    Where and are low-rank matrices.
  2. Applies updates only to the decomposed matrices (e.g., attention projections).
  3. Reduces memory and computational cost compared to full fine-tuning.

6. Code Walkthrough: Fine-Tuning DeepSeek LLM

Set Training Parameters

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=3e-4, # Lower learning rate for LoRA fine-tuning
per_device_train_batch_size=1, # Reduce batch size for memory efficiency
gradient_accumulation_steps=8, # Simulate larger batch size
num_train_epochs=0.5,
weight_decay=0.01,
save_strategy="epoch",
logging_dir="./logs",
logging_steps=50,
fp16=True, # Mixed precision training
)

Initialize Trainer

trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_test_dataset,
)
print("🚀 Trainer Initialized!")

Start Fine-Tuning

print("🚀 Starting Fine-Tuning...")
trainer.train()

Save the Fine-Tuned Model

trainer.save_model("./fine_tuned_deepseek")
tokenizer.save_pretrained("./fine_tuned_deepseek")
print("✅ Fine-Tuned Model Saved Successfully!")

Checkout the google colab to get hands on practice

7. Next Steps

  • Experiment with larger datasets for production-level training.
  • Explore more advanced LoRA configurations for efficient scaling.

In the next article, we’ll explore how DeepSeek LLM can revolutionize e-commerce and retail. From personalizing product recommendations to generating engaging marketing content, we’ll dive into real-world use cases and practical examples. Learn how this open-source powerhouse can enhance customer experiences, optimize business operations, and drive growth in the competitive retail landscape.

Stay tuned for actionable insights and code walkthroughs to harness the potential of DeepSeek LLM in your e-commerce and retail projects! 🚀

References for the Article:

  1. DeepSeek LLM:GitHub Repository: DeepSeek LLM
  2. LoRA (Low-Rank Adaptation):Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv preprint arXiv:2106.09685
  3. IMDB Dataset:Hugging Face Dataset: IMDB
  4. BitsAndBytes (Quantization):Official Repository: BitsAndBytes GitHub

Stay on the cutting-edge of AI! 🌟 Follow me on Medium, connect on LinkedIn, and explore latest trends in AI technologies and models. Dive into the world of AI with me and discover new horizons! 📚💻

--

--

Responses (4)