Megatron Training Pipeline 🚀

This repository is based on alibaba/Pai-Megatron-Patch for secondary development, thanks to the original author’s work. This project is used to support Alignment such as SFT and DPO etc. Currently, the operation flow of SFT and DPO is as follows.

Environment Setup 🛠️

Prerequisites

Miniconda/Anaconda
Git
NVIDIA GPU with CUDA support

Environment Configuration 🔧

bash scripts/train/megatron/env_install.sh

You can also refer to the docker images(dsw-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pai-megatron-patch:24.07) provided in the Pai-Megatron-Patch repository.

SFT/DPO Training Pipeline 🔄

1. Model Weight Conversion (HF → Megatron) 🔄

bash scripts/train/megatron/convert/qwen2_5/convert_hf_to_mcore.sh

config in convert_hf_to_mcore.sh

MODEL_SIZE=${MODEL_SIZE:-"3B"}
TP=${TP:-"2"} # Tensor Parallelism
PP=${PP:-"2"} # Pipeline Parallelism
PRECISION=${PRECISION:-"bf16"}
USE_TE=${USE_TE:-"true"} # Default
MG2HF=${MG2HF:-"false"} # Default
HF_CKPT_PATH=${HF_CKPT_PATH:-"Qwen/Qwen2.5-3B-Instruct"} # HF model path
TARGET_CKPT_PATH="./mg_models/Qwen2.5-${MODEL_SIZE}-hf-to-mcore-te-tp${TP}-pp${PP}" # Output Megatron model path

2. SFT/DPO Data Preprocessing 📊

bash scripts/train/megatron/preprocess/sft_conv.sh(dpo_conv.sh)

config in sft_conv.sh/dpo_conv.sh

INPUT_JSON=${INPUT_JSON:-"./data/dummy_sft.json"} # Input JSON path
DATA_TYPE=${DATA_TYPE:-"conversations"} # Key for conversation data
TOKENIZER=${TOKENIZER:-"Qwen2Tokenizer"} # Tokenizer class
SEQ_LEN=${SEQ_LEN:-4096} # Sequence length
OUTPUT_DATA_PREFIX=${OUTPUT_DATA_PREFIX:-"./data/dummy_sft_mg"} # Output tokenized data path
HF_MODEL_PATH=${HF_MODEL_PATH:-"Qwen/Qwen2.5-3B-Instruct"} # HF model path
EXTRA_VOCAB_SIZE=${EXTRA_VOCAB_SIZE:-293} # Extra vocab size (1.5B: 293, 3B: 293, 7B: 421, 14B: 421, 32B: 421, 72B: 421)
TEMPLATE=${TEMPLATE:-"chatml-idsys"} # Chat Template 

required sft data format:

[
    {
        "conversations": [
            {
                "from": "human",
                "value": "xxx"
            },
            {
                "from": "gpt",
                "value": "xxx"
            }
        ]
    }
]

required dpo data format:

[
    {
      "chosen":[
        {
          "value":"xxx",
          "from":"human"
        },
        {
          "value":"xxx",
          "from":"gpt"
        }
      ],
      "rejected":[
        {
          "value":"xxx",
          "from":"human"
        },
        {
          "value":"xxx",
          "from":"gpt"
        }
      ],
    },
]

3. Training Configuration 🎯

bash scripts/train/megatron/train/qwen2_5/sft_conv.sh(dpo_conv.sh)

Key configurations in sft_conv.sh/dpo_conv.sh:

# Paths
DATASET_PATH=${DATASET_PATH:-"./data/dummy_sft_mg_conversations_maxlen_4096"} # Path prefixed of tokenized data
VALID_DATASET_PATH=${VALID_DATASET_PATH:-"./data/dummy_sft_mg_conversations_maxlen_4096"} # Same as DATASET_PATH for validation
PRETRAIN_CHECKPOINT_PATH=${PRETRAIN_CHECKPOINT_PATH:-"./mg_models/Qwen2.5-3B-hf-to-mcore-te-tp2-pp2"} # Converted Megatron model path
OUTPUT_BASEPATH=${OUTPUT_BASEPATH:-"./checkpoints/sft/"} # Output path for training checkpoints

# Training Hyperparameters
MODEL_SIZE=${MODEL_SIZE:-"3B"}
BATCH_SIZE=${BATCH_SIZE:-4}
GLOBAL_BATCH_SIZE=${GLOBAL_BATCH_SIZE:-16}
LR=${LR:-5e-6}
MIN_LR=${MIN_LR:-0.0}
SEQ_LEN=${SEQ_LEN:-4096}
PAD_LEN=${PAD_LEN:-4096}
EPOCHS=${EPOCHS:-10}

# Parallelism Configuration (Must match with convert_hf_to_mcore.sh)
TP=${TP:-2}
PP=${PP:-2}

4. Model Weight Conversion (Megatron → HF) 🔄

bash scripts/train/megatron/convert/qwen2_5/convert_mcore_to_hf.sh(convert_mcore_to_hf_dpo.sh)

config in convert_mcore_to_hf.sh/convert_mcore_to_hf_dpo.sh

MODEL_SIZE=${MODEL_SIZE:-"3B"}
TP=${TP:-"2"} # Tensor Parallelism
PP=${PP:-"2"} # Pipeline Parallelism
MG_MODEL_PATH=${MG_MODEL_PATH:-"./checkpoint/sft/checkpoint"} # Checkpoint Path
HF_CKPT_PATH=${HF_CKPT_PATH:-"Qwen/Qwen2.5-3B-Instruct"} # HF model path
PRECISION=${PRECISION:-"fp32"} # Default
USE_TE=${USE_TE:-"true"} # Default
MG2HF=${MG2HF:-"true"} # Default
TARGET_CKPT_PATH="./hf_models_from_mg/Qwen2.5-${MODEL_SIZE}-hf-to-mcore-te-tp${TP}-pp${PP}" # Output HF model path

Troubleshooting 🔧

Apex Compilation Issues

Check if required GCC/G++ are installed in conda
Verify if installed PyTorch matches environment CUDA version

Note ⚠️

Always ensure CUDA and PyTorch versions are compatible
Keep consistent sequence lengths across preprocessing and training
Adjust model size configurations according to your hardware capabilities

Supported Models 📚

Qwen2.5 series (1.5B, 3B, 7B, 14B, 32B, 72B)