DPO

This algorithm is used for direct preference optimization (DPO) on UltraFeedback. This README provides instructions for setting up and running the experiment.

Data Preparation

To begin, you’ll need to obtain the following datasets:

UltraFeedback Binarized Dataset

We provide a convenient script to obtain the UltraFeedback dataset in a specific format. To use it, run:

python get_ultrafeedback.py

Direct Preference Optimization (DPO)

Equipped with an instruction-following model (eg. a model trained on UltraChat using the SFT algorithm), you can proceed with Direct Preference Optimization using the UltraFeedback dataset. To run DPO, use the following script:

export DATA_PATH=data/ultrafeedback.json
export CONV_TEMPLATE=llama-3-instruct
export MODEL_PATH=saved_models/llama-3-8b_ultrachat
export GA=16
export OUTPUT_DIR=saved_models/llama-3-8b_ultrachat-ultrafeedback
bash scripts/train_dpo.sh

References

@misc{cui2023ultrafeedback,
      title={UltraFeedback: Boosting Language Models with High-quality Feedback}, 
      author={Ganqu Cui and Lifan Yuan and Ning Ding and Guanming Yao and Wei Zhu and Yuan Ni and Guotong Xie and Zhiyuan Liu and Maosong Sun},
      year={2023},
      eprint={2310.01377},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@article{rafailov2024direct,
  title={Direct preference optimization: Your language model is secretly a reward model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}