[CL] Direct Reasoning Optimization:LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks
[Microsoft]
https://arxiv.org/abs/2506.13351