Test-Time Scaling with OLAF2

날짜

2025. 1. 16.

카테고리

Research

Test-Time Scaling has emerged as an exciting area of research, with recent models like QWQ, O1, and Deepseek-R1 demonstrating surprising reasoning capabilities. Test-Time Scaling can be approached through various methods, including Best-of-N, Monte Carlo Tree Search, and Reflective Tuning. In this post, we share our preliminary findings from applying test-time scaling techniques to OLAF2-14B, our flagship model.

Experimental Setup

The figure below showcases Deepseek’s results on test-time scaling, where the average number of thought tokens is plotted on the X-axis.

In contrast, our experiments involve applying multiple scaling methods simultaneously, which complicates the calculation of token counts. Some tokens have inherently higher value than others. To address this, we use FLOPs (Floating Point Operations) as a more consistent metric. FLOPs are computed following the approach outlined in the Scaling Laws for Neural Language Models paper. Specifically, a single forward pass is approximated as:

Here:

  • n_{layer}: Number of layers

  • d_{model}: Dimension of the residual stream

  • n_{ctx}: Number of tokens in the input context

We benchmark our methods on the GSM8K and the Omni Math subset of HRM8K. While it would have been ideal to include more subsets and benchmarks, due to compute constraints we focus on these two. This selection is motivated by two key reasons:

  1. Diversity in Difficulty: GSM8K represents relatively easy, school-level math word problems, while Omni Math includes olympiad-level, highly challenging problems.

  2. Simplified Evaluation: Both subsets have been pre-filtered to include only questions with digit-based answers, simplifying the evaluation process.

For more details about the benchmark, please refer to our paper.

Evaluation Results

To our surprise, increasing test-time compute significantly enhances the performance of OLAF2-14B. The efficiency of this scaling, however, depends heavily on how the compute is utilized, as some methods are far more effective than others. When scaled to the extreme, OLAF2-14B surpasses GPT-4o on both benchmarks.


OLA

챗봇 이용자 280% 달성한 내용 보기

OLA

챗봇 이용자 280% 달성한 내용 보기

4주 완성의 시작

전문 지식이 없으셔도, IT 인력이 없으셔도 됩니다.
관련 전문가가 끝까지 상담해드립니다.

궁금하시다면 직접 사용해보실 수도 있습니다.

4주 완성의 시작

전문 지식이 없으셔도, IT 인력이 없으셔도 됩니다.
관련 전문가가 끝까지 상담해드립니다.

궁금하시다면 직접 사용해보실 수도 있습니다.

4주 완성의 시작

전문 지식이 없으셔도, IT 인력이 없으셔도 됩니다.
관련 전문가가 끝까지 상담해드립니다.

궁금하시다면 직접 사용해보실 수도 있습니다.

© 2025 OneLineAI, Inc. All rights reserved.

© 2025 OneLineAI, Inc. All rights reserved.

© 2025 OneLineAI, Inc. All rights reserved.