Megatron Training Pipeline πο
This repository is based on alibaba/Pai-Megatron-Patch for secondary development, thanks to the original authorβs work. This project is used to support Alignment such as SFT and DPO etc. Currently, the operation flow of SFT and DPO is as follows.
Environment Setup π οΈο
Prerequisitesο
Miniconda/Anaconda
Git
NVIDIA GPU with CUDA support
Environment Configuration π§ο
bash scripts/train/megatron/env_install.sh
You can also refer to the docker images(
dsw-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pai-megatron-patch:24.07) provided in the Pai-Megatron-Patch repository.
SFT/DPO Training Pipeline πο
1. Model Weight Conversion (HF β Megatron) πο
bash scripts/train/megatron/convert/qwen2_5/convert_hf_to_mcore.sh
config in
convert_hf_to_mcore.sh
MODEL_SIZE=${MODEL_SIZE:-"3B"}
TP=${TP:-"2"} # Tensor Parallelism
PP=${PP:-"2"} # Pipeline Parallelism
PRECISION=${PRECISION:-"bf16"}
USE_TE=${USE_TE:-"true"} # Default
MG2HF=${MG2HF:-"false"} # Default
HF_CKPT_PATH=${HF_CKPT_PATH:-"Qwen/Qwen2.5-3B-Instruct"} # HF model path
TARGET_CKPT_PATH="./mg_models/Qwen2.5-${MODEL_SIZE}-hf-to-mcore-te-tp${TP}-pp${PP}" # Output Megatron model path
2. SFT/DPO Data Preprocessing πο
bash scripts/train/megatron/preprocess/sft_conv.sh(dpo_conv.sh)
config in
sft_conv.sh/dpo_conv.sh
INPUT_JSON=${INPUT_JSON:-"./data/dummy_sft.json"} # Input JSON path
DATA_TYPE=${DATA_TYPE:-"conversations"} # Key for conversation data
TOKENIZER=${TOKENIZER:-"Qwen2Tokenizer"} # Tokenizer class
SEQ_LEN=${SEQ_LEN:-4096} # Sequence length
OUTPUT_DATA_PREFIX=${OUTPUT_DATA_PREFIX:-"./data/dummy_sft_mg"} # Output tokenized data path
HF_MODEL_PATH=${HF_MODEL_PATH:-"Qwen/Qwen2.5-3B-Instruct"} # HF model path
EXTRA_VOCAB_SIZE=${EXTRA_VOCAB_SIZE:-293} # Extra vocab size (1.5B: 293, 3B: 293, 7B: 421, 14B: 421, 32B: 421, 72B: 421)
TEMPLATE=${TEMPLATE:-"chatml-idsys"} # Chat Template
required sft data format:
[
{
"conversations": [
{
"from": "human",
"value": "xxx"
},
{
"from": "gpt",
"value": "xxx"
}
]
}
]
required dpo data format:
[
{
"chosen":[
{
"value":"xxx",
"from":"human"
},
{
"value":"xxx",
"from":"gpt"
}
],
"rejected":[
{
"value":"xxx",
"from":"human"
},
{
"value":"xxx",
"from":"gpt"
}
],
},
]
3. Training Configuration π―ο
bash scripts/train/megatron/train/qwen2_5/sft_conv.sh(dpo_conv.sh)
Key configurations in
sft_conv.sh/dpo_conv.sh:
# Paths
DATASET_PATH=${DATASET_PATH:-"./data/dummy_sft_mg_conversations_maxlen_4096"} # Path prefixed of tokenized data
VALID_DATASET_PATH=${VALID_DATASET_PATH:-"./data/dummy_sft_mg_conversations_maxlen_4096"} # Same as DATASET_PATH for validation
PRETRAIN_CHECKPOINT_PATH=${PRETRAIN_CHECKPOINT_PATH:-"./mg_models/Qwen2.5-3B-hf-to-mcore-te-tp2-pp2"} # Converted Megatron model path
OUTPUT_BASEPATH=${OUTPUT_BASEPATH:-"./checkpoints/sft/"} # Output path for training checkpoints
# Training Hyperparameters
MODEL_SIZE=${MODEL_SIZE:-"3B"}
BATCH_SIZE=${BATCH_SIZE:-4}
GLOBAL_BATCH_SIZE=${GLOBAL_BATCH_SIZE:-16}
LR=${LR:-5e-6}
MIN_LR=${MIN_LR:-0.0}
SEQ_LEN=${SEQ_LEN:-4096}
PAD_LEN=${PAD_LEN:-4096}
EPOCHS=${EPOCHS:-10}
# Parallelism Configuration (Must match with convert_hf_to_mcore.sh)
TP=${TP:-2}
PP=${PP:-2}
4. Model Weight Conversion (Megatron β HF) πο
bash scripts/train/megatron/convert/qwen2_5/convert_mcore_to_hf.sh(convert_mcore_to_hf_dpo.sh)
config in
convert_mcore_to_hf.sh/convert_mcore_to_hf_dpo.sh
MODEL_SIZE=${MODEL_SIZE:-"3B"}
TP=${TP:-"2"} # Tensor Parallelism
PP=${PP:-"2"} # Pipeline Parallelism
MG_MODEL_PATH=${MG_MODEL_PATH:-"./checkpoint/sft/checkpoint"} # Checkpoint Path
HF_CKPT_PATH=${HF_CKPT_PATH:-"Qwen/Qwen2.5-3B-Instruct"} # HF model path
PRECISION=${PRECISION:-"fp32"} # Default
USE_TE=${USE_TE:-"true"} # Default
MG2HF=${MG2HF:-"true"} # Default
TARGET_CKPT_PATH="./hf_models_from_mg/Qwen2.5-${MODEL_SIZE}-hf-to-mcore-te-tp${TP}-pp${PP}" # Output HF model path
Troubleshooting π§ο
Apex Compilation Issuesο
Check if required GCC/G++ are installed in conda
Verify if installed PyTorch matches environment CUDA version
Note β οΈο
Always ensure CUDA and PyTorch versions are compatible
Keep consistent sequence lengths across preprocessing and training
Adjust model size configurations according to your hardware capabilities
Supported Models πο
Qwen2.5 series (1.5B, 3B, 7B, 14B, 32B, 72B)