Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers' questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities.
Despite the success of methods like RLHF and DPO in aligning large language models with human preferences, their reliance on costly and hard-to-scale preference data presents a major limitation. Human-annotated comparisons and GPT-4-generated feedback are expensive and often domain-general, limiting applicability in specific settings. Meanwhile, large-scale unlabeled text on the internet remains underutilized for preference supervision. Among these, user-generated content (UGC) stands out as a rich and naturally occurring source of implicit human preferences, reflecting the author's opinions, experiences, and intent to address readers' informational needs. This motivates our work: to explore whether UGC can be systematically transformed into high-quality preference data, enabling effective and scalable alignment of LLMs without relying on explicit human or model-generated annotations.
Our proposed method, PUGC, leverages implicit user preferences embedded within user-generated content (UGC) to construct high-quality preference data without explicit human annotations. Unlike traditional preference data generation methods, which rely on pre-collected instructions and explicit feedback from reward models, PUGC transforms UGC into reader questions, filters instructions for relevance, and then directly utilizes the UGC as a reference to capture implicit preferences. This allows for scalable, flexible alignment across domains by reducing reliance on human-generated instructions, making it a practical and efficient alternative for generating large-scale, high-quality preference data.
Models trained using PUGC consistently outperform those trained on UltraFeedback preference data, demonstrating clear improvements in win rates on both Alpaca Eval 2.0 and MT-Bench benchmarks. Notably, combining PUGC with the DPO objective generally achieves the best performance, underscoring the effectiveness of this optimization approach in aligning LLMs with implicit user preferences from UGC. Additionally, the instruct setting significantly boosts performance compared to the base setting, indicating that higher-quality instruction data contributes positively to preference alignment. Although PUGC leads to shorter responses, it successfully mitigates length bias and achieves better alignment quality overall.
Incorporating UGC as reference text enhances reward model agreement with GPT-4, demonstrating its value in providing implicit preference signals for reward evaluation.
Our results indicate that increasing the quantity of UGC significantly boosts alignment performance, while variations in UGC quality have a relatively minor impact.
By leveraging domain-specific UGC, PUGC achieves superior performance in targeted alignment tasks compared to general-domain and other baselines.
Utilizing implicit preferences in UGC notably enhances the model’s theory-of-mind capabilities, enabling it to better understand user intentions, beliefs, and emotions.
While our main experiments focus on offline training, online iterative RLHF has demonstrated stronger performance. We evaluate PUGC in an online iterative setting by splitting UGC data into three subsets for sequential training iterations. Each iteration yields steady improvements in LC win rates (13.74%, 3.22%, and 3.44% respectively), achieving a 37.51% win rate by the third iteration compared to 35.93% in the offline setting.
@inproceedings{wu2024stepco,
title={Aligning Large Language Models with Implicit Preferences from User-Generated Content},
author={Tan, Zhaoxuan and Li, Zheng and Liu, Tianyi and Wang, Haodong and Yun, Hyokun and Zeng, Ming and Chen, Pei and Zhang, Zhihan and Gao, Yifan and Wang, Ruijie and Nigam, Priyanka and Yin, Bing and Jiang, Meng},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2025},
}