Paper-Weekly21-Best Practices and Lessons Learned on Synthetic Data for Language Models
Planning is a hallmark of human intelligence. It is an evolutionary feat built upon numerous other capacities: using various tools to iteratively collect information and make decisions, recording intermediate plans and exploring alternative plans by running simulations.
Why this task is challenging?
- Planning a multi-day itinerary is inherently long-horizon, involving a large number of interdependent decisions on places, lodging, transportation, dining, etc.
- Travel planning involves many constraints, ranging from explicit constraints such as budget and various user needs to implicit commonsense constraints, e.g., people cannot teletransport to another city without using some means of transportation.
- Travel planning requires strong agency to proactively acquire necessary information using various tools (e.g., to search flights and restaurants) from the partially observable environment and deliberate over the collected information to further the planning while being mindful of all the explicit and implicit constraints.
除了GPT-4以外的所有language agents都不幸挂0,即使是GPT4正确率也只有0.06%
Benchmark构造流程:
- Setup环境和评估;
- 多样化的旅行需求设计;
- Query构造,旅程分为3天、5天和7天,不同的时长对应一个州内的若干城市;
- 人工标注,找了20个研究生为每个query构造至少一条可行的方案,每条给0.8刀(这也太少了吧!估计是在国内找的人);
- Quality Control: 人工审核每一条plan的可行性,并调低预算,增加难度。
作者强调:This setup ensures that all agents access the same unchanging information from our static databases, avoiding the variability and potential biases introduced by dynamic data.(下次也这么说)
由于不同agent生成的plan五花八门,作者使用GPT-4抽取方案中的key components, 再构造成完整的方案,同时根据constraints 的不同,设置对应的pass rate进行评估。