TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (2024)

Zheng Chu1,Jingchang Chen1,Qianglong Chen2,
Weijiang Yu3, Haotian Wang1, Ming Liu1,4, Bing Qin1,4
1Harbin Institute of Technology, Harbin, China
2Zhejiang University3Sun Yat-sen University4Peng Cheng Laboratory
{zchu,jcchen,mliu,qinb}@ir.hit.edu.cn
{chenqianglong.ai, wanght1998, weijiangyu8}@gmail.com
Corresponding Author.

Abstract

Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world.Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark.To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena.TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings.Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning.Besides, LLMs exhibit capability discrepancies across different reasoning categories.Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges.We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning111 Data is available at: GitHub.

\useunder

\ul

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models


Zheng Chu1,Jingchang Chen1,Qianglong Chen2,Weijiang Yu3, Haotian Wang1, Ming Liu1,4thanks:  Corresponding Author., Bing Qin1,41Harbin Institute of Technology, Harbin, China2Zhejiang University3Sun Yat-sen University4Peng Cheng Laboratory{zchu,jcchen,mliu,qinb}@ir.hit.edu.cn{chenqianglong.ai, wanght1998, weijiangyu8}@gmail.com


1 Introduction

Time flies over us, but leaves its shadow behind.Understanding time is a crucial part of human comprehension of the world.Envision the blossoming of flowers, and you’ll associate it with the arrival of spring.The ponder within it encompasses the intricate interplay of world knowledge, causality, and event temporal relationships.Temporal reasoning, in contrast to reasoning of a singular nature, comes with inherent complexity, encompassing implicit arithmetic, logical implications, and world knowledge. It is a form of integrated reasoning built upon foundational reasoning like mathematical and logical reasoning(Cobbe etal., 2021; Mishra etal., 2022; Yu etal., 2020).Recently, large language models (LLMs) have demonstrated remarkable performance in complex reasoning(Hendrycks etal., 2021; Srivastava etal., 2022; Brown etal., 2020; Chowdhery etal., 2023; OpenAI, 2023; Touvron etal., 2023),but their performance in temporal reasoning has not yet been extensively explored.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (1)

Recent research for temporal reasoning typically focuses only on a few aspects, such as temporal commonsense or temporal question answering(Zhou etal., 2019; Chen etal., 2021; Dhingra etal., 2022; Wang and Zhao, 2023).Due to the inherent complexity of temporal reasoning, it is challenging to accurately measure models’ temporal reasoning capabilities based on limited aspects.

To address this issue, we propose TimeBench, a comprehensive and hierarchical temporal reasoning benchmark.Specifically, drawing inspiration from the human cognitive process of transitioning from abstraction and concreteness to integration(Barsalou etal., 2018), we categorize temporal reasoning into three levels: symbolic temporal reasoning, commonsense temporal reasoning, and event temporal reasoning.These levels respectively represent understanding abstract time expression, grasping concrete world knowledge, and integrating and applying this knowledge in real-world scenarios.TimeBench comprises 10 tasks with 16 sub-tasks, covering a broad spectrum of temporal reasoning phenomena.Besides, prior work typically features only a single task form, too simplistic to capture the model’s performance.In contrast, we incorporate four distinct task forms, offering a more realistic simulation of challenges.

To quantify the temporal reasoning capabilities of contemporary LLMs, we extensively assess widely-used LLMs, including proprietary models such as ChatGPT(Ouyang etal., 2022) and GPT-4(OpenAI, 2023), as well as open-source like LLaMA2(Touvron etal., 2023), Vicuna-1.5(Chiang etal., 2023), Mistral(Jiang etal., 2023), Baichuan2(Yang etal., 2023a), ChatGLM3Zeng etal. (2023) and FLAN-T5(Chung etal., 2022).We conduct experiments under zero-shot and few-shot settings, combining commonly used reasoning techniques, chain-of-thought prompting(Kojima etal., 2022; Wei etal., 2022).The experimental results suggest that GPT-4 outperforms other models, showcasing strong temporal reasoning capabilities, as shown in Figure1.Nevertheless, there is still a considerable gap between the strongest models and humans.On the contrary, open-source models show inferior performance in temporal reasoning, attributed to shortcomings in abstract time understanding, temporal relations modeling, and a lack of temporal commonsense.In addition, we also observe that chain-of-thought prompting does not yield a consistent improvement in performance.These findings indicate that there is still significant room for improvement in models’ temporal reasoning capabilities.Moreover, we have conducted a thorough analysis of the deficiencies and obstacles faced by models in temporal reasoning.

We aspire for temporal reasoning to garner increased attention within the research community.Our contributions can be summarized as follows:

  • We introduce TimeBench, a comprehensive and hierarchical benchmark to quantify the temporal reasoning abilities of LLMs.

  • We conduct extensive experiments with several LLMs, revealing a significant gap between even SOTA LLM and humans, indicating substantial research opportunities in this field.

  • By conducting a thorough analysis, we reveal the dilemmas that LLMs face in temporal reasoning and identify potential solutions.

2 TimeBench Benchmark

2.1 Benchmark Design Principal

TimeBench focuses on a comprehensive evaluation of the temporal reasoning capabilities of large language models in challenging and complex scenarios.To achieve this goal, we summarize the difficulties and challenges faced in temporal reasoning, categorize them into three levels,and integrate diverse task formats to better align with the intricate nature of temporal reasoning.

Just as the human cognitive process unfolds from foundational cognition and conceptual understanding to practical reasoning, we delineate temporal reasoning into three hierarchical levels.Specifically, TimeBench categorizes temporal reasoning into symbolic, commonsense and event temporal reasoning, covering 10 datasets with a total of 16 subtasks.(1) Symbolic Temporal Reasoning focuses on the comprehension of fundamental abstract temporal expressions.(2) Temporal Commonsense Reasoning emphasizes the mastery of temporal principles, concepts and world knowledge.(3) Event Temporal Reasoning concentrates on modeling the temporal relationships between events and times within authentic scenarios.

2.2 Difficulties and Challenges

We delineate the essential competencies and the challenges that arise from a human cognitive standpoint in the realm of temporal reasoning, and language models confront similar challenges.We present the dataset statistics, task formats, and the associated challenges in Table7.

Time Expression Understanding

Time expressions (TimeX) denote words or phrases that convey information about time and represent the simplest and most basic units of expressing time, such as in April 2000, after 2008.Grasping time expressions is the most foundational step in understanding temporal elements within the textual modality.

Temporal Commonsense

assesses the understanding of temporal world knowledge, including event order, event duration, typical time, event frequency and stationary, which is crucial for language models to comprehend daily scenarios.

Event-Time Relations

assesses the model’s grounding capability to establish temporal relationships between events and their temporal context,thereby enabling models to grasp the progression and transformations of events as they dynamically evolve through time.

Event-Event Relations

not only involve event-time grounding but also introduce multi-hop relative connections between events.Models with this capability can better handle temporal reasoning in complex scenarios involving multiple events.

Implicit Temporal Reasoning

involves going beyond the surface of texts, engaging in deeper reasoning such as drawing upon temporal commonsense, identifying implicit temporal factors and discerning hidden temporal relationships among events.Implicit temporal reasoning is pivotal in complex real-world scenarios where events and time are intricately interwoven.

2.3 Symbolic Temporal Reasoning

To evaluate the language model’s comprehension of abstract time expressions, we utilize two symbolic reasoning tasks stripped of semantic content: date arithmetic and time expression inference.Table1 shows examples of symbolic temporal reasoning.

Date Arithmetic

(Tan etal., 2023) assesses the model’s grasp of abstract date calculation.When provided with a date, the model needs to accurately calculate the date a certain amount of time before or after the given date. The smallest unit is one day.

TimeX NLI

(Thukral etal., 2021)focuses on the logical entailment relationships among abstract TimeX, including three aspects: order (s1), duration (s2), and duration with unit conversion (s3).

2.4 Commonsense Temporal Reasoning

We measure the model’s mastery of temporal commonsense and world knowledge, along with its capacity for reasoning based on these insights.Table2 presents examples of temporal commonsense reasoning in QA and generation forms.

MCTACO

(Zhou etal., 2019) evaluates diverse commonsense knowledge from different aspects of events, including duration, frequency, order, stationary and typical event time.

DurationQA

(Virgo etal., 2022) focuses specifically on temporal commonsense reasoning in the spectrum of event duration.

TimeDial

(Qin etal., 2021) considers temporal commonsense reasoning in dialogue scenarios and involves various aspects of commonsense associated with duration, order, and world knowledge.

SituatedGen

(Zhang and Wan, 2023) considers generative commonsense reasoning in a constrained text generation scenario.Given a set of contrasting keywords, the model needs to choose appropriate keywords for each sentence and generate a pair of contrasting sentences that satisfy temporal commonsense.

2.5 Event Temporal Reasoning

Event temporal reasoning assesses the model’s understanding of relationships between events and time in real-world scenarios, as well as its ability to reasoning under certain temporal or event constraints.Examples are shown in Table3.

TimeQA

(Chen etal., 2021)requires the model to answer time-sensitive questions based on context containing numerous time-involved facts.It is categorized into explicit reasoning and implicit reasoning based on time indicators (before, in, etc.).

MenatQA

(Wei etal., 2023)introduces time-sensitive factors to elicit implicit temporal reasoning, including time scope change, disruption of facts, and counterfactual questions, which provides a more in-depth assessment of implicit reasoning ability on event-time relations.

TempReason

(Tan etal., 2023)removes irrelevant context and focuses on implicit temporal reasoning within structured facts, investigating the model’s capability boundaries.It involves event-time reasoning and event-event reasoning.

TRACIE

(Zhou etal., 2021) evaluates the model’s comprehension of temporal order between implicit events. The model needs to identify events implied in the context and then determine their chronological order.

2.6 Task Formats and Evaluation Metrics

TimeBench is a multispectral benchmark encompassing four task types: free-form reading comprehension, natural language inference, constrained text generation, and multi-select questions.For detailed task types and their corresponding evaluation metrics, please refer to AppendixA.3 andA.4.

3 Methodology

We perform evaluations using the prompt-based approach, including standard prompting and chain-of-thought prompting.Experiments are conducted under both zero-shot and few-shot settings.

Standard Prompting

We formulate specific instructions for each task.In the zero-shot setting, models follow the instructions to answer questions.In the few-shot setting, models are provided with several question-answer pairs as demonstrations and emulate those instances to answer questions.

promptzssp={INST}{Q}superscriptsubscriptpromptzsspINSTQ\displaystyle\mathrm{prompt_{zs}^{sp}}=\{\mathrm{INST}\}\{\mathrm{Q}\}roman_prompt start_POSTSUBSCRIPT roman_zs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_sp end_POSTSUPERSCRIPT = { roman_INST } { roman_Q }(1)
promptfssp={INST}{Q1}{A1}..{Q}\displaystyle\mathrm{prompt_{fs}^{sp}}=\{\mathrm{INST}\}\{\mathrm{Q_{1}}\}\{%\mathrm{A_{1}}\}..\{\mathrm{Q}\}roman_prompt start_POSTSUBSCRIPT roman_fs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_sp end_POSTSUPERSCRIPT = { roman_INST } { roman_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } { roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } . . { roman_Q }(2)

Chain-of-Thought Prompting

The instructions of CoT are the same as standard prompting.In the zero-shot setting, following Zeroshot CoT(Kojima etal., 2022), we add a reasoning trigger Let’s think step by step after questions to perform chain-of-thought reasoning.In the few-shot setting, we manually annotate CoT demonstrations for each task to guide the step-by-step reasoning.Prompts can be found in AppendixB.3.

promptzscot={INST}{Q}{TRIG}superscriptsubscriptpromptzscotINSTQTRIG\displaystyle\mathrm{prompt_{zs}^{cot}}=\{\mathrm{INST}\}\{{\mathrm{Q}}\}\{{%\mathrm{TRIG}}\}roman_prompt start_POSTSUBSCRIPT roman_zs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cot end_POSTSUPERSCRIPT = { roman_INST } { roman_Q } { roman_TRIG }(3)
promptfscot={INST}{Q1}{R1}{A1}..{Q}\displaystyle\mathrm{prompt_{fs}^{cot}}=\{\mathrm{INST}\}\{{\mathrm{Q_{1}}}\}%\{{\mathrm{R_{1}}}\}\{{\mathrm{A_{1}}}\}..\{{\mathrm{Q}}\}roman_prompt start_POSTSUBSCRIPT roman_fs end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_cot end_POSTSUPERSCRIPT = { roman_INST } { roman_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } { roman_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } { roman_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } . . { roman_Q }(4)

4 Experimental Setup

MethodSymbolicCommonsenseEvent TemporalOverall
TimeXNLIArithDQAMcT.TiD.SitGenTimeQAMenatQATempRTRACIESym.Comm.EventAvg.
s1s2s3Exp.Imp.Sco.Ord.Ctf.L2L3
Human98.096.092.0100.080.887.197.8100.093.391.185.687.379.997.195.382.596.591.489.091.5
GPT-485.373.353.3100.064.888.394.688.673.751.072.454.828.792.495.962.878.084.166.573.7
+ FS CoT92.084.064.0100.055.172.393.4-66.952.865.352.625.996.994.666.485.073.665.272.1
GPT-3.552.068.431.663.667.771.276.479.166.148.443.251.617.984.778.055.053.973.655.659.7
+ FS CoT51.671.836.684.441.238.171.1-68.047.042.541.737.889.976.650.261.150.156.756.6
LLaMA270bsuperscriptsubscriptabsent70b{}_{\text{70b}}^{{\dagger}}start_FLOATSUBSCRIPT 70b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT55.061.037.082.067.485.382.774.966.748.361.442.533.885.285.461.058.877.660.564.4
+ FS CoT52.073.039.079.562.379.161.1-64.343.057.745.253.187.581.667.060.967.562.463.0
LLaMA213bsuperscriptsubscriptabsent13b{}_{\text{13b}}^{{\dagger}}start_FLOATSUBSCRIPT 13b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT50.054.030.029.553.366.055.664.859.348.649.643.437.578.762.758.040.959.954.752.6
+ FS CoT40.061.037.052.059.368.840.8-59.449.158.443.844.178.068.258.047.556.357.454.5
LLaMA27bsuperscriptsubscriptabsent7b{}_{\text{7b}}^{{\dagger}}start_FLOATSUBSCRIPT 7b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT26.050.030.020.054.559.645.262.454.445.349.841.935.864.053.349.031.555.449.246.3
+ FS CoT37.052.036.025.556.967.041.9-45.636.150.938.057.359.757.750.037.655.349.447.4
Baichuan213bsuperscriptsubscriptabsent13b{}_{\text{13b}}^{{\dagger}}start_FLOATSUBSCRIPT 13b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT38.048.033.042.554.873.045.764.959.454.252.738.021.477.363.554.040.459.652.651.3
+ FS CoT50.056.034.047.062.069.343.8-58.249.649.840.145.681.365.660.046.858.456.354.2
Baichuan27bsuperscriptsubscriptabsent7b{}_{\text{7b}}^{{\dagger}}start_FLOATSUBSCRIPT 7b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT27.066.041.032.559.869.434.359.853.850.249.638.522.965.951.055.041.655.848.448.5
+ FS CoT30.056.034.034.057.069.544.5-51.240.746.432.646.361.564.153.038.557.049.548.1
Mistral7bsuperscriptsubscriptabsent7b{}_{\text{7b}}^{{\dagger}}start_FLOATSUBSCRIPT 7b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT48.053.038.041.061.876.261.858.355.945.349.447.845.576.774.853.045.064.556.155.4
+ FS CoT57.063.035.054.061.845.757.3-60.446.257.247.933.265.967.957.052.354.954.554.0
ChatGLM36bsuperscriptsubscriptabsent6b{}_{\text{6b}}^{{\dagger}}start_FLOATSUBSCRIPT 6b end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT48.070.032.035.051.862.655.061.657.226.335.441.522.576.455.958.046.357.846.749.3
+ FS CoT47.068.032.046.053.964.356.5-52.524.535.040.222.579.460.354.048.358.246.149.1

MethodSymbolicCommonsenseEvent TemporalOverall
TimeXNLIArithDQAMcT.TiD.SitGenTimeQAMenatQATempRTRACIESym.Comm.EventAvg.
s1s2s3Exp.Imp.Sco.Ord.Ctf.L2L3
Human98.096.092.0100.080.887.197.8100.093.391.185.687.379.997.195.382.596.591.489.091.5
GPT-478.676.050.798.059.280.091.159.360.646.557.057.023.195.395.064.875.872.462.468.3
+ CoT80.076.060.092.058.182.689.3-61.341.254.659.622.697.094.558.077.076.761.168.5
GPT-3.545.467.631.297.050.568.669.162.370.835.440.943.922.981.273.857.460.362.653.357.4
+ CoT33.664.833.671.023.245.167.0-64.435.139.742.926.357.668.152.050.845.148.348.3
LLaMA270b44.047.032.078.559.268.957.025.040.840.618.916.612.063.554.548.050.452.536.844.1
+ CoT30.066.028.053.557.367.158.631.419.512.212.720.837.540.551.044.461.028.239.1
LLaMA213b30.049.034.022.538.540.635.457.961.930.546.136.126.953.169.449.033.943.146.642.6
+ CoT36.050.038.06.039.251.736.9-58.738.940.932.533.658.068.447.032.542.647.342.4
LLaMA27b39.053.030.013.039.341.06.324.549.029.026.821.116.063.947.949.033.827.837.834.3
+ CoT44.050.033.05.035.040.01.7-49.931.631.424.517.856.948.146.033.025.638.334.3
Baichuan213b41.061.037.012.552.063.457.752.255.434.648.844.339.557.461.449.037.956.348.848.0
+ CoT40.057.031.010.044.661.958.1-41.540.952.038.543.262.864.355.034.554.949.846.7
Baichuan27b35.050.037.04.547.955.354.342.041.534.735.231.220.443.447.755.031.649.938.639.7
+ CoT38.043.032.01.037.958.044.2-53.538.839.933.229.341.247.254.028.546.742.139.4
Vicuna1.513b35.050.036.015.039.259.134.251.860.437.046.837.423.242.143.646.034.046.142.141.1
+ CoT42.051.037.03.029.850.033.7-56.936.438.237.720.449.049.151.033.337.842.339.0
Vicuna1.57b37.058.043.05.040.452.532.047.847.118.535.725.717.333.046.854.035.843.234.837.1
+ CoT36.050.036.01.539.449.236.2-40.924.626.228.525.027.740.354.030.941.633.434.4
FLANT511b53.063.043.00.052.065.047.749.561.726.833.652.221.887.983.964.039.853.654.050.3
+ CoT56.066.045.00.049.763.442.7-64.428.241.650.230.679.568.955.041.851.952.349.4
Mistral7b47.050.043.026.549.858.823.258.328.221.424.322.321.739.631.651.041.647.530.037.3
+ CoT38.056.035.016.536.649.319.3-31.322.421.124.925.634.031.261.036.435.131.433.5
ChatGLM36b38.050.034.02.034.143.656.738.941.231.733.826.032.257.054.050.031.043.340.739.0
+ CoT27.049.037.00.024.837.144.8-41.725.434.628.141.244.552.048.028.335.639.435.7

4.1 Models

We evaluate several popular LLMs, including both open-source and proprietary models, with parameter sizes ranging from 6B to 70B.222Since OpenAI has never disclosed the scale of ChatGPT series, 6B to 70B here refers to ChatGLM36B6B{}_{\text{6B}}start_FLOATSUBSCRIPT 6B end_FLOATSUBSCRIPT to LLaMA270B70B{}_{\text{70B}}start_FLOATSUBSCRIPT 70B end_FLOATSUBSCRIPT.The complete list of models can be found in AppendixB.1.

4.2 Implementation Details

We access proprietary models through Azure API 0613 version.For open-source models, we deploy them locally through FastAPI.We set the temperature to 0.0 for greedy decoding in all experiments.To improve answer extraction accuracy, we prompt models with trigger Therefore, the answer is before model outputs to deduce final answers.

5 Experimental Results

5.1 Few-shot Results

Table4 presents the experimental results under few-shot settings.GPT-4 achieves the best performance across three categories, while LLaMA270b70b{}_{\text{70b}}start_FLOATSUBSCRIPT 70b end_FLOATSUBSCRIPT and GPT-3.5 rank in the second tier.However, there remains a substantial gap of 19.4% between the most powerful LLM and humans.

In symbolic temporal reasoning tasks, GPT-4 demonstrates exceptional performance.However, other models exhibit a significant decline in comparison to GPT-4.In commonsense temporal reasoning tasks, GPT4 lags behind humans by only 8.0%,indicating its powerful internal knowledge reservoir.With the model scale shrinking, its knowledge reservoir also decreases gradually, leading to a decline in performance.Notably, there is a significant gap of 25.2% between LLMs and humans in event temporal reasoning,which suggests that LLMs encounter major challenges in modeling intricate event-time relationships.

5.2 Zero-shot Results

Experimental results of alignment models under zero-shot settings are shown in Table5.In zero-shot settings, GPT-4 and GPT-3.5 rank first and second, respectively,and they significantly outperform all open-source models by a large margin.It is noteworthy that open-source models exhibit a larger performance decline compared to proprietary models when transitioning from few-shot to zero-shot scenarios.GPT, Baichuan2 and LLaMA2 suffer drops of 5.6%, 14.6% and 27.2%, respectively.We attribute this performance decline to the quality of alignment.Restricted by their limited instruction-following capability,open-source models struggle to fully unleash their performance solely through instructions.Therefore, few-shot prompting is a better approach for stimulating their temporal reasoning abilities.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (2)

5.3 Chain-of-Thought in Temporal Reasoning

Previous research has found that chain-of-thought prompting can enhance the model’s reasoning ability(Wei etal., 2022; Kojima etal., 2022).We aim to explore the following questions:Does CoT prompting bring consistent improvement in temporal reasoning?Due to the diversity of temporal reasoning, the above question has not yet been definitively answered.To investigate this, we select several popular LLMs and analyze their performance affected by chain-of-thought prompting.

Chain-of-thought reasoning is not consistently effective.

As illustrated in Figure2, introducing zero-shot CoT prompting results in consistent declines, with an overall decrease of 7.4%.In the few-shot scenario, CoT prompting also fails to yield consistent improvements,varying depending on the task.There is a 10.8% improvement in symbolic reasoning, while a significant decline of 15.2% in commonsense reasoning.In event temporal reasoning, there is a slight improvement of 1.3%.Next, we will conduct a more detailed analysis of the impact of CoT on specific tasks.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (3)

Impact of CoT prompting across tasks.

In order to explore the impact of CoT on various tasks thoroughly, we delve into the performance changes of each model across specific tasks within each category, as illustrated in Figure3.In the zero-shot setting, open-source models achieve a slight improvement in event temporal reasoning with chain-of-thought prompting, while in other cases, they face performance degradation.While in the few-shot setting, almost all models exhibit significant improvement in symbolic temporal reasoning, with a concurrent prevalent decline in commonsense temporal reasoning.We attribute this to the knowledge sensitivity inherent in commonsense reasoning,where step-by-step reasoning cannot compensate for the lack of knowledge.In event temporal reasoning, improvements mainly stem from datasets involving implicit multi-step reasoning (MenatQA and TempReason), indicating that CoT is more effective for multi-hop questions.In summary, zero-shot CoT consistently has a negative impact on temporal reasoning.While in few-shot scenario, CoT has a positive impact on symbolic and complex tasks, while negatively affecting knowledge-sensitive tasks.

6 Analysis and Discussion

6.1 Scaling Effect of Model Size

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (4)

We investigate how the scale of models affects temporal reasoning capabilities.The trend is illustrated in Figure4.As the model scale increases, there is a notable improvement in performance.When the parameter size expands from 7B to 13B, LLaMA2 and Baichuan2 show improvements of 13.0% and 10.5%, respectively.Furthermore, when LLaMA scales up to 70B, the trend of performance improvement continues without stopping.The overall improvement follows a log-linear relationship with scale.There are no significant performance differences among LLaMA2, Baichuan2, and ChatGLM3 under similar parameter specifications,while Mistral demonstrates impressive prowess, outperforming all other 13B models with nearly half the number of parameters.

6.2 Challenges in Temporal Reasoning

ModelOrderDurationFreq.StationarityTypicalAvg.
GPT-476.4\downarrow92.8\uparrow83.3\uparrow71.4\downarrow54.5\downarrow77.5
GPT-3.550.5\uparrow39.8\downarrow55.2\uparrow48.4\uparrow28.7\downarrow43.5
Baichuan213bsubscriptsuperscriptabsent13𝑏{}^{\dagger}_{\text{13}b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 13 italic_b end_POSTSUBSCRIPT40.5\downarrow51.8\uparrow43.7\uparrow46.2\uparrow29.8\downarrow42.5
LLaMA270bsubscriptsuperscriptabsent70𝑏{}^{\dagger}_{\text{70}b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 70 italic_b end_POSTSUBSCRIPT65.2\uparrow72.1\uparrow66.3\uparrow36.3\downarrow52.7\downarrow63.0
Mistral7bsubscriptsuperscriptabsent7𝑏{}^{\dagger}_{\text{7}b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 7 italic_b end_POSTSUBSCRIPT27.0\downarrow44.4\uparrow58.3\uparrow38.5\downarrow38.3\downarrow42.5

LLMs underperform in (multi-hop) symbolic reasoning

Except for GPT-4, the performance of all other models in symbolic temporal reasoning is unsatisfactory.A noticeable decrease is observed in duration-conversion task compared to other atomic tasks (25% in GPT-4 and 27% in LLaMA270b70𝑏{}_{\text{70}b}start_FLOATSUBSCRIPT 70 italic_b end_FLOATSUBSCRIPT).This is because the duration-conversion task (s3) necessitates a two-step reasoning process.It first unifies time units, and subsequently engages in numerical comparison.In contrast, other atomic tasks (s1, s2 and arithmetic) can be completed with a single reasoning step.In summary, LLMs perform poorly in symbolic temporal reasoning and exhibit more pronounced declines when encountering multi-step reasoning.

Mastery of commonsense knowledge varies in LLMs

We analyze models’ performance across various commonsense aspects, as shown in Table6.We regard the model’s average performance in commonsense reasoning tasks as the baseline.If the model outperforms the baseline in a specific aspect, it suggests greater proficiency in this type of knowledge, and vice versa.The findings indicate that LLMs generally demonstrate good knowledge of event duration and frequency.However, their comprehension of event order and typical events is relatively weakerThe uneven mastery of commonsense knowledge significantly affects the model’s reasoning performance,especially when dealing with complex questions that involve multiple types of knowledge.Retrieval-augmented reasoning presents a promising avenue for mitigating the model’s knowledge scarcity.

LLMs exhibit poor implicit temporal reasoning capabilities.

When comparing explicit and implicit event temporal reasoning, specifically TimeQA-explicit versus others, we observe a significant performance decrease in implicit reasoning.Additionally, on TRACIE with numerous implied events, most models only surpass a random baseline (50.0).Even GPT-4, despite its advanced capabilities, achieves only a 66.4% accuracy, suggesting that the LLM struggles with modeling implicit temporal relationships.We consider it helpful to explicitly model the temporal relationships between events and time expressions, for instance constructing timelines or temporal graphs.

LLMs are good factual reasoners rather than factual extractors

When humans engage in temporal reasoning, it generally involves two steps: first, extracting time-fact pairs from the context, and then performing fact-based reasoning.TempReason provides extracted facts for conducting fact-based reasoning.By comparing the model’s performance in context-based (TimeQA) against fact-based (TempReason) reasoning,we identify the bottleneck in event temporal reasoning.LLMs excel in TempReason, which signifies their strong capability in fact-based reasoning.However, their performance in context-based reasoning is significantly weaker compared to their performance in fact-based reasoning.This implies that errors could arise during the extraction of time-sensitive facts from the context.We attribute this performance gap to the model’s deficiency in factual extraction capabilitiesThus, we consider LLMs to be strong factual reasoners rather than factual extractors in event temporal reasoning.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (5)

6.3 Alignment Impairs Temporal Reasoning

In the experiments mentioned earlier (Table5), we observe a sharp decline in zero-shot performance of alignment models.To investigate whether alignment is the cause of the decline in temporal reasoning, we conducted experiments on alignment models under few-shot settings.Figure5 illustrates the overall performance decline after alignment.With the exception of Baichuan2, all other models are severely impaired, experiencing a significant drop of up to 22%.Through manual analysis of error cases, we have summarized two reasons:(1) Alignment reduces the model’s usability, causing it to tend towards refusal to answer when confronted with knowledge-sensitive questions.(2) Alignment damages the model’s in-context learning capability, resulting in situations where the model deviates from the demonstrations.Furthermore, we believe that the lack of temporal reasoning-related training data in alignment exacerbates this issue, leading to disparities between different reasoning capabilities, such as mathematical and temporal reasoning.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (6)

6.4 Error Analysis

We manually analyze 100 predictions by GPT-4, GPT-3.5 and LLaMa2-base70b from each subtask.The visualization of errors is shown in Figure 6.

Symbolic Reasoning

We categorize symbolic reasoning errors into five groups:(a) Expression: The model provides an incorrect time calculation expression.(b) Computation: The model provides the correct time calculation expression, but there is a calculation error.(c) Conversion: The model has an error in the conversion of time units.(d) Comparison: The model has an error when comparing two time-expressions (or intervals).(e) Combination: The model encounters errors in the combination of multiple above operations.LLMs exhibit numerous computation, conversion, and comparison errors, which suggests a substantial deficiency in their understanding of fundamental temporal expressions.Additionally, a higher frequency of errors is observed in combination questions, highlighting that multi-step reasoning continues to be a significant challenge for current models

Commonsense Reasoning

We categorize the errors of commonsense reasoning into two groups:(a) No Answer: The model fails to provide a final answer.(b) Reasoning Error: The model encounters reasoning errors, which can be subdivided into five types of knowledge-related errors.We observe that GPT series models have a higher No Answer rate, while LLaMA is always able to provide answers.This discrepancy can be attributed to two factors:firstly, the models may lack the necessary commonsense knowledge to formulate an answer;secondly, the preference alignment mechanism may prompt the model to abstain from answering when confronted with questions outside its knowledge scope.Integration of retrieval can alleviate the problem of knowledge scarcity to a certain degree.

Event Temporal Reasoning

We categorize the errors of event temporal reasoning into four groups:(a) No Answer: The model is unable to find the answer in the context.(b) Reasoning Error: The model encounters reasoning errors.(c) Hallucination: The model’s prediction does not exist in the context, known as hallucination reasoning.(d) Metric: The model’s prediction is correct, but the metric is limited by the evaluation criteria.It can be observed that, except for reasoning errors, failures to provide answers account for approximately 30%,indicating that models still have flaws in grounding temporal facts from context.Additionally, models occasionally experience hallucination phenomena, leading to erroneous reasoning.

7 Related Work

7.1 Temporal Reasoning

There are numerous efforts addressing diverse challenges in temporal reasoning.Early research mainly relies on TimeML(Pustejovsky etal., 2003), focusing TimeX extraction and temporal relation extraction(Verhagen etal., 2007, 2010; UzZaman etal., 2013; Llorens etal., 2015; Miller etal., 2015; Mathur etal., 2021; Vashishtha etal., 2019).The advent of pre-trained language models (PLMs) has brought about commonsense reasoning as a tool to explore the world knowledge in models(Zhou etal., 2019; Qin etal., 2021; Dhingra etal., 2022).Recently, much attention has shifted towards event temporal reasoning(Chen etal., 2021; Tan etal., 2023; Wei etal., 2023).Han etal. (2021); Yang etal. (2023b); Son and Oh (2023); Chen etal. (2023) continuously pre-trains LLMs on time-aware data to elicit temporal reasoning,and Zhu etal. (2023); Su etal. (2023); Chu etal. (2023) explicitly represent temporal relationships using temporal graphs and timelines.Additionally, some works extend beyond text, evaluating temporal reasoning in structured tables and video domains(Gupta etal., 2023; Ko etal., 2023).

Some concurrent studies also analyze the temporal reasoning abilities of LLMs.Jain etal. (2023); Qiu etal. (2023) focus on temporal commonsense and Wang and Zhao (2023) introduces a unified form for accessing the overall abilities.

Distinguished from other works, TimeBench is multispectral, offering a comprehensive evaluation of LLM’s temporal reasoning abilities.

7.2 Large Language Models

In recent years, there has been rapid progress in the research of large language models (LLM)(Zhao etal., 2023).They exhibit outstanding performance across a multitude of tasks without the need for fine-tuning(Brown etal., 2020; Kojima etal., 2022).Furthermore, they have achieved astonishing results in complex reasoning tasks, such as mathematical reasoning(Cobbe etal., 2021; Mishra etal., 2022) and logical reasoning(Yu etal., 2020; Liu etal., 2023).Moreover, some studies suggest that the chain-of-thought prompting can further enhance the model’s capabilities in complex reasoning scenarios(Wei etal., 2022; Kojima etal., 2022; Chu etal., 2024; Zhang etal., 2023).

8 Conclusion

Temporal reasoning entails inherent diversity and complexity.The lack of a comprehensive benchmark makes it challenging to quantify LLMs’ temporal reasoning capabilities.In this work, we present TimeBench, a comprehensive and hierarchical benchmark for LLM temporal reasoning, tailored to mirror temporal reasoning in complex scenarios.We conduct extensive experiments on state-of-the-art LLMs to investigate their temporal reasoning capabilities.Our findings indicate a substantial gap between state-of-the-art LLMs and human performance, emphasizing the need for further research in this area.Moreover, we provide a meticulous analysis and discussion, outlining the current challenges that models face and suggesting potential directions for improvement.

Limitations

TimeBench is a comprehensive benchmark to quantify the temporal reasoning capabilities of LLMs.While we have taken various factors into account, there are a few limitations.Firstly, our evaluation only applied prompt-based method under zero-shot and few-shot setting, lacking evaluations specifically tailored for models fine-tuned on the temporal domain.Secondly, the instructions and demonstrations were manually crafted,which may potentially lead to discrepancies in prompts interpretation among different LLMs.Thirdly, the dataset constituting the benchmark includes data from past years and a portion sourced from Wikipedia, which may contaminate the training corpus of LLMs.

Acknowledgements

The research in this article is supported by the National Key Research and Development Project (2021YFF0901602), the National Science Foundation of China (U22B2059, 62276083), and Shenzhen Foundational Research Funding (JCYJ20200109113441941), Major Key Project of PCL (PCL2021A06).Ming Liu is the corresponding author.

References

  • Ainslie etal. (2023)Joshua Ainslie, James Lee-Thorp, Michiel deJong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023.GQA: training generalized multi-query transformer models from multi-head checkpoints.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4895–4901. Association for Computational Linguistics.
  • Banerjee and Lavie (2005)Satanjeev Banerjee and Alon Lavie. 2005.Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Barsalou etal. (2018)LawrenceW Barsalou, Léo Dutriaux, and Christoph Scheepers. 2018.Moving beyond the distinction between concrete and abstract concepts.Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1752):20170144.
  • Beltagy etal. (2020)IzBeltagy, MatthewE Peters, and Arman Cohan. 2020.Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150.
  • Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Chen etal. (2021)Wenhu Chen, Xinyi Wang, and WilliamYang Wang. 2021.A dataset for answering time-sensitive questions.In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  • Chen etal. (2023)Ziqiang Chen, Shaojuan Wu, Xiaowang Zhang, and Zhiyong Feng. 2023.TML: A temporal-aware multitask learning framework for time-sensitive question answering.In Companion Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 200–203. ACM.
  • Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE. Gonzalez, Ion Stoica, and EricP. Xing. 2023.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  • Chowdhery etal. (2023)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, YiTay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai, ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.2023.Palm: Scaling language modeling with pathways.J. Mach. Learn. Res., 24:240:1–240:113.
  • Chu etal. (2024)Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2024.Navigate through enigmatic labyrinth a survey of chain of thought reasoning: Advances, frontiers and future.In The 62nd Annual Meeting of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, August 11–16, 2024. Association for Computational Linguistics.
  • Chu etal. (2023)Zheng Chu, Zekun Wang, Jiafeng Liang, Ming Liu, and Bing Qin. 2023.MTGER: multi-view temporal graph enhanced temporal reasoning over time-involved document.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 15218–15233. Association for Computational Linguistics.
  • Chung etal. (2022)HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangShane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, EdH. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocV. Le, and Jason Wei. 2022.Scaling instruction-finetuned language models.Preprint, arXiv:2210.11416.
  • Cobbe etal. (2021)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021.Training verifiers to solve math word problems.CoRR, abs/2110.14168.
  • Dhingra etal. (2022)Bhuwan Dhingra, JeremyR. Cole, JulianMartin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and WilliamW. Cohen. 2022.Time-aware language models as temporal knowledge bases.Trans. Assoc. Comput. Linguistics, 10:257–273.
  • Gupta etal. (2023)Vivek Gupta, Pranshu Kandoi, MahekBhavesh Vora, Shuo Zhang, Yujie He, Ridho Reinanda, and Vivek Srikumar. 2023.Temptabqa: Temporal question answering for semi-structured tables.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2431–2453. Association for Computational Linguistics.
  • Han etal. (2021)Rujun Han, Xiang Ren, and Nanyun Peng. 2021.ECONET: effective continual pretraining of language models for event temporal reasoning.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5367–5380. Association for Computational Linguistics.
  • Hendrycks etal. (2021)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021.Measuring massive multitask language understanding.In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Jain etal. (2023)Raghav Jain, Daivik Sojitra, Arkadeep Acharya, Sriparna Saha, Adam Jatowt, and Sandipan Dandapat. 2023.Do language models have a common sense regarding time? revisiting temporal commonsense reasoning in the era of large language models.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6750–6774, Singapore. Association for Computational Linguistics.
  • Jiang etal. (2023)AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego deLasCasas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed. 2023.Mistral 7b.CoRR, abs/2310.06825.
  • Ko etal. (2023)Dohwan Ko, JiSoo Lee, Woo-Young Kang, Byungseok Roh, and Hyunwoo Kim. 2023.Large language models are temporal and causal reasoners for video question answering.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4300–4316. Association for Computational Linguistics.
  • Kojima etal. (2022)Takeshi Kojima, ShixiangShane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022.Large language models are zero-shot reasoners.In NeurIPS.
  • Lin (2004)Chin-Yew Lin. 2004.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74–81.
  • Liu etal. (2023)Hanmeng Liu, Zhiyang Teng, Ruoxi Ning, Jian Liu, Qiji Zhou, and Yue Zhang. 2023.Glore: Evaluating logical reasoning of large language models.CoRR, abs/2310.09107.
  • Llorens etal. (2015)Hector Llorens, Nathanael Chambers, Naushad UzZaman, Nasrin Mostafazadeh, JamesF. Allen, and James Pustejovsky. 2015.Semeval-2015 task 5: QA tempeval - evaluating temporal information understanding with question answering.In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, Colorado, USA, June 4-5, 2015, pages 792–800. The Association for Computer Linguistics.
  • Mathur etal. (2021)Puneet Mathur, Rajiv Jain, Franck Dernoncourt, Vlad Morariu, QuanHung Tran, and Dinesh Manocha. 2021.TIMERS: Document-level temporal relation extraction.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 524–533, Online. Association for Computational Linguistics.
  • Miller etal. (2015)Timothy Miller, Steven Bethard, Dmitriy Dligach, Chen Lin, and Guergana Savova. 2015.Extracting time expressions from clinical text.In Proceedings of BioNLP 15, pages 81–91, Beijing, China. Association for Computational Linguistics.
  • Mishra etal. (2022)Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. 2022.LILA: A unified benchmark for mathematical reasoning.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5807–5832. Association for Computational Linguistics.
  • OpenAI (2023)OpenAI. 2023.GPT-4 technical report.CoRR, abs/2303.08774.
  • Ouyang etal. (2022)Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, PaulF. Christiano, Jan Leike, and Ryan Lowe. 2022.Training language models to follow instructions with human feedback.In NeurIPS.
  • Papineni etal. (2002)Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  • Pustejovsky etal. (2003)James Pustejovsky, JoséM. Castaño, Robert Ingria, Roser Saurí, RobertJ. Gaizauskas, Andrea Setzer, Graham Katz, and DragomirR. Radev. 2003.Timeml: Robust specification of event and temporal expressions in text.In New Directions in Question Answering, Papers from 2003 AAAI Spring Symposium, Stanford University, Stanford, CA, USA, pages 28–34. AAAI Press.
  • Qin etal. (2021)Lianhui Qin, Aditya Gupta, Shyam Upadhyay, Luheng He, Yejin Choi, and Manaal Faruqui. 2021.TIMEDIAL: temporal commonsense reasoning in dialog.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 7066–7076. Association for Computational Linguistics.
  • Qiu etal. (2023)Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, EdoardoM. Ponti, and ShayB. Cohen. 2023.Are large language models temporally grounded?CoRR, abs/2311.08398.
  • Raffel etal. (2020)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2020.Exploring the limits of transfer learning with a unified text-to-text transformer.J. Mach. Learn. Res., 21:140:1–140:67.
  • Son and Oh (2023)Jungbin Son and Alice Oh. 2023.Time-aware representation learning for time-sensitive question answering.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 70–77. Association for Computational Linguistics.
  • Srivastava etal. (2022)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, Abubakar Abid, Adam Fisch, AdamR. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, AlexanderW. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ameet Rahane, AnantharamanS. Iyer, Anders Andreassen, Andrea Santilli, Andreas Stuhlmüller, AndrewM. Dai, Andrew La, AndrewK. Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakas, and etal. 2022.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.CoRR, abs/2206.04615.
  • Su etal. (2023)Xin Su, Phillip Howard, Nagib Hakim, and Steven Bethard. 2023.Fusing temporal graphs into transformers for time-sensitive question answering.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 948–966. Association for Computational Linguistics.
  • Tan etal. (2023)Qingyu Tan, HweeTou Ng, and Lidong Bing. 2023.Towards benchmarking and improving the temporal reasoning capability of large language models.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 14820–14835. Association for Computational Linguistics.
  • Thukral etal. (2021)Shivin Thukral, Kunal Kukreja, and Christian Kavouras. 2021.Probing language models for understanding of temporal expressions.In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2021, Punta Cana, Dominican Republic, November 11, 2021, pages 396–406. Association for Computational Linguistics.
  • Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, CristianCanton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, PunitSingh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, EricMichael Smith, Ranjan Subramanian, XiaoqingEllen Tan, Binh Tang, Ross Taylor, Adina Williams, JianXiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and ThomasScialom. 2023.Llama 2: Open foundation and fine-tuned chat models.Preprint, arXiv:2307.09288.
  • UzZaman etal. (2013)Naushad UzZaman, Hector Llorens, Leon Derczynski, JamesF. Allen, Marc Verhagen, and James Pustejovsky. 2013.Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations.In Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, Georgia, USA, June 14-15, 2013, pages 1–9. The Association for Computer Linguistics.
  • Vashishtha etal. (2019)Siddharth Vashishtha, Benjamin VanDurme, and AaronSteven White. 2019.Fine-grained temporal relation extraction.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2906–2919, Florence, Italy. Association for Computational Linguistics.
  • Vedantam etal. (2015)Ramakrishna Vedantam, C.Lawrence Zitnick, and Devi Parikh. 2015.Cider: Consensus-based image description evaluation.In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society.
  • Verhagen etal. (2007)Marc Verhagen, RobertJ. Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007.Semeval-2007 task 15: Tempeval temporal relation identification.In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval@ACL 2007, Prague, Czech Republic, June 23-24, 2007, pages 75–80. The Association for Computer Linguistics.
  • Verhagen etal. (2010)Marc Verhagen, Roser Saurí, Tommaso Caselli, and James Pustejovsky. 2010.Semeval-2010 task 13: Tempeval-2.In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July 15-16, 2010, pages 57–62. The Association for Computer Linguistics.
  • Virgo etal. (2022)FelixGiovanni Virgo, Fei Cheng, and Sadao Kurohashi. 2022.Improving event duration question answering by leveraging existing temporal information extraction data.In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 4451–4457. European Language Resources Association.
  • Wang and Zhao (2023)Yuqing Wang and Yun Zhao. 2023.TRAM: benchmarking temporal reasoning for large language models.CoRR, abs/2310.00835.
  • Wei etal. (2022)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdH. Chi, QuocV. Le, and Denny Zhou. 2022.Chain-of-thought prompting elicits reasoning in large language models.In NeurIPS.
  • Wei etal. (2023)Yifan Wei, Yisong Su, Huanhuan Ma, Xiaoyan Yu, Fangyu Lei, Yuanzhe Zhang, Jun Zhao, and Kang Liu. 2023.Menatqa: A new dataset for testing the temporal comprehension and reasoning abilities of large language models.In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1434–1447. Association for Computational Linguistics.
  • Yang etal. (2023a)Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, CeBian, Chao Yin, Chenxu Lv, DaPan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, and Zhiying Wu. 2023a.Baichuan 2: Open large-scale language models.CoRR, abs/2309.10305.
  • Yang etal. (2023b)Sen Yang, Xin Li, Lidong Bing, and Wai Lam. 2023b.Once upon a time in graph: Relative-time pretraining for complex temporal reasoning.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 11879–11895. Association for Computational Linguistics.
  • Yu etal. (2020)Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. 2020.Reclor: A reading comprehension dataset requiring logical reasoning.In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Zeng etal. (2023)Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, WengLam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023.GLM-130B: an open bilingual pre-trained model.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Zhang and Wan (2023)Yunxiang Zhang and Xiaojun Wan. 2023.Situatedgen: Incorporating geographical and temporal contexts into generative commonsense reasoning.In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  • Zhang etal. (2023)Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, and Hai Zhao. 2023.Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents.CoRR, abs/2311.11797.
  • Zhao etal. (2023)WayneXin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023.A survey of large language models.Preprint, arXiv:2303.18223.
  • Zhou etal. (2019)Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019."going on a vacation" takes longer than "going for a walk": A study of temporal commonsense understanding.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3361–3367. Association for Computational Linguistics.
  • Zhou etal. (2021)Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, and Dan Roth. 2021.Temporal reasoning on implicit events from distant supervision.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 1361–1371. Association for Computational Linguistics.
  • Zhu etal. (2023)Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023.Question answering as programming for solving time-sensitive questions.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12775–12790. Association for Computational Linguistics.

Appendix A TimeBench Details

TimeBench features 3 major categories, 10 tasks and 15 subtasks, each with distinct challenges, totaling 19,000 instances.Detailed statistics are available in Figure7 and Table7.

A.1 Benchmark Construction

TimeX Arithmetic

(Tan etal., 2023)TimeX Arithmetic data is derived from the l1: time-time reasoning data in TempReason.We retain 4,000 instances, where time expressions are calculated with a minimum unit of one day.

TimeX NLI

(Thukral etal., 2021)The original data of TimeXNLI is in NLI format, including three sub-tasks,Temp-Order, Temp-Duration, and Cross-Unit Duration, including 6,140, 3,540, and 15,840 instances respectively.We conduct a random sampling of 2,213, 2,332 and 2,429 entries, resulting in a combined total of 6,965 instances.

MCTACO

(Zhou etal., 2019)The original MCTACO dataset consists of yes/no questions, containing 1,332 questions with 9,442 options.To guarantee that the questions are presented in a 4-way multi-select style, we initially remove questions that have less than four options.Subsequently, to ensure that each question has at least one correct option, we filter out questions where all options are labeled as "no".For each remaining question, we randomly sample four options, striving to maintain a balance between correct and incorrect options.In most cases, a question is accompanied by 2 correct and 2 incorrect options.A minority of questions have an option distribution of 1-3 or 3-1.After the aforementioned filtering process, we obtain 852 pieces of data in a 4-way multi-select format.

DurationQA

(Virgo etal., 2022)The original DurationQA has the same format as MCTACO, which consists of 694 questions with 4,868 options.Following the identical filtration procedure as MCTACO, we finally obtained a collection of 687 questions in a 4-way multi-select format.

TimeDial

(Qin etal., 2021)consists of 4-way multi-select instances in a two-person dialogue scenario.We leave the original data unaltered and simply randomize the sequence of options, yielding 1,446 pieces of 4-way multi-select instances.

SituateGen

(Zhang and Wan, 2023)SituatedGen includes 1,220 test cases, which span across two distinct reasoning domains: time and geography.We manually screen the original test data and retain those with clear time features for temporal reasoning evaluation, resulting in 115 instances.

TimeQA

(Chen etal., 2021)The original data of TimeQA includes two splits, Easy and Hard, with each question containing 20 Wikipedia paragraphs.The excessively long context may exceed the model’s maximum length limit and incur significant inference overhead.Therefore, we have reduced the context of the original data.For the paragraphs in the original data, we refer to those containing the answer as relevant paragraphs, and the rest as irrelevant paragraphs.For each question, we keep the first paragraph, all relevant paragraphs, and one random irrelevant paragraph as distractor.This ensures that most questions have at least three paragraphs.After that, we sample 500 pieces of data from those where the context length is less than 650 tokens.For both Easy and Hard splits, we apply the aforementioned filtration, resulting in 500 questions each, totaling 1,000 instances.

TempReason

(Tan etal., 2023)TempReason dataset contains 5,397 entries for l2 (event-time reasoning) and 4,426 entries for l3 (event-event reasoning).In the original dataset, each question corresponds with a text context and extracted facts.Similar to TimeQA, we apply a filter based on context length.We preserve questions with a context length between 300 and 600 tokens, yielding 839 and 1,037 instances, respectively.Notably, every remaining question is applicable to either context-based reasoning or fact-based reasoning.

MenatQA

(Wei etal., 2023)MenatQA consists of 999 data entries, formatted similarly to TimeQA, where each question is accompanied by several corresponding paragraphs.Following the paper’s proposed method, we modify the original data by incorporating the three time-sensitive factors: scope, order, and counterfactual.Subsequently, for each factor, we randomly sample 400 instances, resulting in a total of 1,200 data points.

TRACIE

(Zhou etal., 2021)The original TRACIE dataset consists of yes/no type questions, containing 4,248 test instances.We randomly sample 500 instances from the iid split in the test set.

A.2 Human Performance Evaluation

Unless otherwise stated, the results of human evaluation are derived from original dataset papers.Please refer to the corresponding paper for human evaluation details.TimeXNLI, Date Arith, and MCTACO are manually evaluated by three authors from the TimeBench team.Within each subtask, we randomly sample 50 instances, and the average of the performances by three human evaluators is considered the final human performance.

A.3 Task Formats

TimeBench is a multispectral benchmark, which features four different task formats.

Multi-Select Questions

Previous work utilizes the Multiple Choice (MC) form, which requires models to select the only correct answer from the options. However, this task form has shortcuts and may not truly reflect the model’s abilities.To address this, we employ the Multi-Select (M-S) task form, where the model needs to select all possible correct answers from the options provided.In our task, each question presents four options, with at most two of them being correct.

Natural Language Inference

is the task of determining the logical relationship between two pieces of text.Specifically, given a premise and a hypothesis, the model needs to determine whether the hypothesis can be inferred from the premise and output entailment, contradiction, or neutral.Our tasks focus on the entailment in temporal domains.

Free-form Reading Comprehension

requires models to answer questions based on the provided context, and the ground truth answer is free-form without pre-defined format restrictions.

Constrained Text Generation

refers to the task of generating text under certain constraints.The task is keyword-constrained text generation, where the model takes keywords as input and outputs sentences that include those keywords.

A.4 Evaluation Metrics

Accuracy is used for NLI and date arithmetic tasks.M-S tasks are evaluated using option-level EM and F1.FRC tasks (excluding date arithmetic) are assessed with token-level EM and F1.For CTG task, we take the average of multiple generation metrics, which are outlined as follows.

Metrics for SituatedGen

Following SituatedGen(Zhang and Wan, 2023), we use BLEU-4(Papineni etal., 2002), METEOR(Banerjee and Lavie, 2005), ROUGE-L(Lin, 2004), CIDEr(Vedantam etal., 2015), and MATCH(Zhang and Wan, 2023) scores to metric the results of CTG.333We utilize pycocoevalcap package to calucate BLEU-4, METEOR, ROUGE-L, CIDEr.

The overall score is calculated as the sum of the above scores. We set the weight of CIDEr to 1/101101/101 / 10 for balancing when summation.

S=BLEU-4+METEOR+ROUGE-L+CIDER/10+MATCH𝑆BLEU-4METEORROUGE-LCIDER10MATCHS=\textrm{BLEU-4}+\textrm{METEOR}+\textrm{ROUGE-L}\\+\textrm{CIDER}/10+\textrm{MATCH}start_ROW start_CELL italic_S = BLEU-4 + METEOR + ROUGE-L end_CELL end_ROW start_ROW start_CELL + CIDER / 10 + MATCH end_CELL end_ROW

As the overall score S𝑆Sitalic_S does not represent a percentile, we proceed to normalize the models’ scores to align with humans’ relative performance levels.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (7)

Appendix B Supplemental Materials

B.1 Models

ChatGPT-3.5/GPT-4

(Ouyang etal., 2022; OpenAI, 2023)ChatGPT is a chat model aligned through SFT and RLHF based on GPT-3(Brown etal., 2020).GPT-4 is an upgraded version of ChatGPT with enhanced reasoning capabilities, making it the most powerful LLM.Unless otherwise stated, ChatGPT refers to gpt-3.5-turbo-0613 and GPT-4 refers to gpt-4-0613.

Llama2/Vicuna-1.5

(Touvron etal., 2023; Chiang etal., 2023)LLaMA2 is an open foundation model trained on 2T tokens with efficient grouped-query attentionAinslie etal. (2023).LLaMA2-chat is the official aligned model with SFT and RLHF, andVicuna-1.5 is aligned with SFT only by the community444https://lmsys.org/.

Baichuan2

(Yang etal., 2023a) is an open foundation model pre-trained on 2.6T tokens, which is competitive with LLaMA2.Baichuan2-chat is the official aligned model with SFT and RLHF.

Mistral

(Jiang etal., 2023) is a 7B open foundation model incorporating efficient grouped-query attention(Ainslie etal., 2023) and sliding windows attention(Beltagy etal., 2020).It achieves the strongest performance among models of its size, even surpassing LLaMA2-13B.Mistral-instruct is the officially aligned model with SFT only.

ChatGLM3

(Zeng etal., 2023) is an open-source bilingual LLM for Chinese and English, exhibiting competitive performance under 10B.

FLAN-T5

(Chung etal., 2022) is an open-source instruction model built on top of T5(Raffel etal., 2020) through instruction fine-tuning.

B.2 Full Results

The overall score is derived from the average of all corresponding metrics.For brevity, we omit some F1 scores in tables in the main text.Please refer to TableLABEL:tab:full_results for the full experimental results.The full results of SituatedGen can be found in Table8.

B.3 Prompts

The prompt formats are showcased in Figure9.The demonstrations can be found from Figure10 to18.

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (8)

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (9)

DatasetFormat#Challenges
Symbolic
TimeX ArithFRC4,000TimeX Arithmetic
TimeX NLINLI6,965TimeX Causality
- Order-2,213order
- Duration-2,332duration
- Conversion-2,420duration + time unit conversion
Commonsense
MCTACOM-S852Temporal Commonsense
TimeDialM-S1,446Temporal Commonsense
DurationQAM-S687Event Duration
SituatedGenCTG115Temporal Commonsense
Event
TimeQAFRC1,000Context-based Reasoning
- Explicit-500explicit, event-time reasoning
- Implicit-500implicit, event-time reasoning
MenatQAFRC1,599Implicit, Context-based Reasoning
- Order-400event-time reasoning
- Scope-400event-time reasoning
- Counterfactual-400event-time reasoning
TempReasonFRC1,876Implicit, Fact-based Reasoning
- l2 (e2t)-839event-time reasoning
- l3 (e2e)-1,037event-event reasoning
TRACIENLI500Implicit, Implied Event-Event Reasoning
In total19,000
MethodBLEU-4METEORROUGE-LCIDErMATCHOverallNorm
Human39.940.456.339798.1274.4100.0
GPT-48.2331.2728.8438.4590.41162.5959.25
+ FS28.6438.9955.69298.6490.11243.2988.66
GPT-3.513.3830.1235.91125.4178.76170.7062.21
+ FS27.2433.7751.18282.7576.54217.0179.08
LLaMA270b5.1513.6215.8322.0731.7968.6025.00
+ FS19.1029.0941.74171.3665.29172.3562.81
LLaMA213b4.6621.4320.8017.7261.62110.2840.19
+ FS15.1527.4937.55138.1364.94158.9357.92
LLaMA27b2.7713.4614.6914.3434.8367.1824.48
+ FS6.9015.8221.7752.9933.8183.6030.47
Baichuan213b8.3325.8630.0782.6370.63143.1552.17
+ FS15.7930.2340.96169.1471.01174.9163.74
Baichuan27b5.1721.9923.7344.8059.85115.2241.99
+ FS15.0623.4532.29137.9452.04136.6449.79
Vicuna1.513b7.7326.3529.1569.1671.91142.0651.77
+ FS6.8518.6625.9992.9646.19106.9938.99
Vicuna1.57b6.2924.3426.9146.9068.84131.0747.77
+ FS20.7130.1945.20203.2067.58184.0067.05
FLAN-T516.2024.4329.3895.1756.38135.9149.53
+ FS12.8830.3836.2792.2076.44165.1960.20
Mistral7b5.8222.8924.1944.0363.74121.0344.11
+ FS18.9629.0243.15185.6163.24172.9363.02
ChatGLM36b6.5621.1121.9641.4853.02106.8038.92
+ FS10.5324.1733.44124.5056.94137.5350.12
LLaMA270bsubscriptsuperscriptabsent70𝑏{}^{\dagger}_{70b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 70 italic_b end_POSTSUBSCRIPT22.3433.0350.93243.3174.96205.5974.92
LLaMA213bsubscriptsuperscriptabsent13𝑏{}^{\dagger}_{13b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 13 italic_b end_POSTSUBSCRIPT17.5429.4445.21200.1465.64177.8464.81
LLaMA27bsubscriptsuperscriptabsent7𝑏{}^{\dagger}_{7b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 7 italic_b end_POSTSUBSCRIPT17.4928.3345.24202.0859.98171.2562.41
Baichuan213bsubscriptsuperscriptabsent13𝑏{}^{\dagger}_{13b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 13 italic_b end_POSTSUBSCRIPT17.8629.7544.28198.8366.35178.1264.91
Baichuan27bsubscriptsuperscriptabsent7𝑏{}^{\dagger}_{7b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 7 italic_b end_POSTSUBSCRIPT15.3027.5441.80171.5962.40164.2059.84
Mistral7bsubscriptsuperscriptabsent7𝑏{}^{\dagger}_{7b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 7 italic_b end_POSTSUBSCRIPT14.5427.3941.72168.8959.42159.9658.30
ChatGLM36bsubscriptsuperscriptabsent6𝑏{}^{\dagger}_{6b}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT 6 italic_b end_POSTSUBSCRIPT17.1129.3540.74156.4966.18169.0261.60
MethodSymbolicCommonsenseEventOverall
TimeXNLIDate ArithDurationQAMcTACOTimeDialSitGenTimeQAMenatQATempReasonTRACIESym.Comm.EventAvg.
s1s2s3AccEMF1EMF1EMF1NormE-EME-F1H-EMH-F1S-EMS-F1O-EMO-F1C-EMC-F1L2-EML2-F1L3-EML3-F1Acc
Human98.096.092.0100.064.080.875.887.197.897.8100.089.093.387.091.182.085.684.087.376.079.996.097.194.095.382.596.591.489.091.5
GPT-478.676.050.798.035.059.261.280.072.091.159.348.960.640.446.544.457.049.057.022.023.191.095.394.095.064.875.872.462.468.3
+ CoT80.076.060.092.035.058.167.082.665.089.3-50.061.333.041.243.454.653.059.620.022.693.097.093.094.558.077.076.761.168.5
+ FS85.373.353.3100.051.064.877.688.385.094.688.659.273.740.051.059.672.448.054.825.328.786.092.494.895.962.878.084.166.573.7
+ FS CoT92.084.064.0100.042.055.168.072.379.093.4-48.066.944.452.848.565.344.052.622.025.991.096.993.094.666.485.073.665.272.1
GPT-3.545.467.631.297.019.250.534.168.639.269.162.360.570.829.535.436.540.937.543.921.022.973.681.261.873.857.460.362.653.357.4
+ CoT33.664.833.671.012.423.228.145.134.667.0-52.564.429.035.135.839.738.542.924.026.332.057.654.268.152.050.845.148.348.3
+ FS52.068.431.663.642.867.743.571.247.876.479.153.866.137.948.437.843.243.551.616.017.977.784.770.078.055.053.973.655.659.7
+ FS CoT51.671.836.684.420.841.221.438.148.371.1-56.568.037.547.038.142.537.541.733.037.886.289.968.076.650.261.150.156.756.6
LLaMA270b44.047.032.078.512.759.223.068.910.057.025.028.040.831.040.68.018.911.016.69.012.050.063.539.054.548.050.452.536.844.1
+ CoT30.066.028.053.58.057.321.067.19.058.617.031.413.019.55.012.28.012.718.020.812.037.520.040.551.044.461.028.239.1
+ FS49.042.038.062.01.361.213.066.56.056.662.841.051.116.020.08.016.417.019.918.018.734.052.231.041.151.047.861.833.844.3
+ FS CoT54.063.040.069.58.055.221.562.16.056.436.650.934.042.428.038.619.029.318.021.977.083.165.074.757.056.657.949.753.2
LLaMA213b30.049.034.022.54.038.58.540.610.035.457.946.061.921.030.528.046.123.036.118.026.943.053.155.069.449.033.943.146.642.6
+ CoT36.050.038.06.07.339.214.051.710.036.9-45.058.730.038.920.040.918.032.521.033.643.058.056.068.447.032.542.647.342.4
+ FS43.057.060.020.59.046.88.066.615.062.340.224.034.217.018.411.025.95.014.622.033.354.068.150.064.847.045.154.038.343.9
+ FS CoT37.055.050.033.012.049.511.045.68.044.5-35.046.021.025.434.046.723.036.57.016.572.080.854.066.250.043.846.546.045.5
LLaMA27b39.053.030.013.02.739.34.041.01.06.324.537.049.014.029.07.026.88.021.19.016.048.063.932.047.949.033.827.837.834.3
+ CoT44.050.033.05.02.735.04.540.01.01.7-27.049.917.031.611.031.410.024.57.017.844.056.932.048.146.033.025.638.334.3
+ FS44.060.034.011.04.062.88.064.78.040.030.536.050.820.029.45.022.36.018.06.017.912.036.323.044.353.037.349.534.038.7
+ FS CoT38.051.036.014.511.042.825.065.613.053.4-36.053.521.034.11.013.63.011.25.014.022.046.721.042.351.034.953.933.337.8
Baichuan213b41.061.037.012.54.052.018.563.415.057.752.245.055.429.034.631.048.834.044.330.039.540.057.445.061.449.037.956.348.848.0
+ CoT40.057.031.010.03.344.620.061.913.058.1-36.041.536.040.939.052.027.038.529.043.246.062.846.064.355.034.554.949.846.7
+ FS43.059.040.042.524.762.127.570.218.058.963.747.060.735.045.737.051.931.041.519.031.873.081.148.059.448.046.163.752.553.7
+ FS CoT45.054.048.047.010.744.427.068.815.055.0-43.057.827.036.738.049.834.040.733.043.072.880.443.060.244.048.556.151.651.7
Baichuan27b35.050.037.04.54.047.910.555.315.054.342.026.041.520.034.720.035.219.031.26.020.422.043.429.047.755.031.649.938.639.7
+ CoT38.043.032.01.05.337.913.058.015.044.2-41.053.528.038.829.039.923.033.218.029.321.041.229.047.254.028.546.742.139.4
+ FS40.050.036.020.028.759.426.566.917.053.049.845.060.730.042.127.037.823.035.710.020.440.057.437.053.051.036.557.344.845.8
+ FS CoT41.050.036.023.513.045.717.558.17.039.2-36.051.229.043.042.052.525.039.320.031.057.070.139.060.249.037.647.749.546.0
Vicuna1.513b35.050.036.015.08.039.221.559.17.034.251.843.060.429.037.038.046.822.037.417.023.214.042.113.043.646.034.046.142.141.1
+ CoT42.051.037.03.01.329.811.550.07.033.7-44.056.931.036.416.038.225.037.713.020.431.049.029.049.151.033.337.842.339.0
+ FS48.057.038.030.57.333.627.557.013.040.339.045.058.323.025.938.042.626.041.418.020.151.061.828.042.656.043.442.543.643.3
+ FS CoT38.059.039.039.510.737.414.045.812.041.6-47.059.527.030.739.048.131.035.926.031.271.077.553.065.552.043.941.650.146.7
Vicuna1.57b37.058.043.05.01.340.49.552.56.032.047.835.047.111.018.520.035.715.025.712.017.314.033.014.046.854.035.843.234.837.1
+ CoT36.050.036.01.51.339.48.549.29.036.2-30.040.914.024.616.026.214.028.512.025.09.027.77.040.354.030.941.633.434.4
+ FS43.057.037.08.53.344.65.542.17.036.867.124.031.912.014.916.021.820.027.517.022.213.034.36.032.254.036.447.729.935.9
+ FS CoT35.054.035.08.02.737.210.047.510.041.3-31.039.913.016.615.026.715.023.816.023.155.066.332.048.143.033.042.035.936.4
FLANT511b53.063.043.00.04.052.014.065.013.047.749.556.061.724.026.831.033.648.052.220.021.884.087.978.083.964.039.853.654.050.3
+ CoT56.066.045.00.04.749.714.563.413.042.7-57.664.423.928.239.041.646.050.228.030.673.079.557.068.955.041.851.952.349.4
+ FS53.065.043.03.54.050.215.565.89.035.060.255.064.323.025.031.033.647.050.621.022.582.087.078.084.565.041.152.854.150.5
+ FS CoT54.068.046.03.54.050.713.064.011.043.7-54.059.419.021.234.036.645.049.220.021.789.093.872.079.766.042.952.853.550.5
Mistral7b47.050.043.026.511.349.815.058.86.023.258.37.028.25.021.44.024.37.022.34.021.72.039.61.031.651.041.647.530.037.3
+ CoT38.056.035.016.513.336.614.549.38.019.3-13.031.311.022.48.021.114.024.912.025.65.034.04.031.261.036.435.131.433.5
+ FS51.057.035.018.012.055.825.571.013.052.963.024.043.512.023.94.021.37.021.27.023.023.048.923.044.957.040.360.735.543.0
+ FS CoT29.062.032.028.59.346.614.549.28.034.4-13.033.711.025.39.026.49.024.39.020.368.078.642.057.450.037.943.439.539.8
ChatGLM36b38.050.034.02.03.034.17.043.614.056.738.920.041.214.031.725.033.817.026.024.032.242.057.030.054.050.031.043.340.739.0
+ CoT27.049.037.00.01.024.83.037.110.044.8-24.041.710.025.427.034.622.028.135.041.228.044.527.052.048.028.335.639.435.7
+ FS37.052.030.00.010.053.09.052.919.054.750.111.019.613.017.025.035.024.030.420.025.627.046.533.053.354.029.852.735.238.2
+ FS CoT32.066.031.00.04.034.88.043.611.044.0-25.043.319.027.523.030.219.025.736.043.050.062.834.056.048.032.340.842.139.2
LLaMA2-Base70b55.061.037.082.040.067.459.085.363.082.774.957.066.736.048.352.061.435.042.525.033.878.085.280.085.461.058.877.660.564.4
+ CoT52.073.039.079.532.662.347.079.125.061.155.064.334.043.050.057.737.045.244.053.182.087.576.081.667.060.967.562.463.0
LLaMA2-Base13b50.054.030.029.519.353.326.566.020.055.664.848.059.334.048.641.049.638.043.434.037.568.078.749.062.758.040.959.954.752.6
+ CoT40.061.037.052.025.359.326.068.811.040.8-46.059.437.049.149.058.434.043.838.044.170.078.055.068.258.047.556.357.454.5
LLaMA2-Base7b26.050.030.020.019.354.520.059.615.045.262.444.054.430.045.342.049.834.041.930.035.850.064.036.053.349.031.555.449.246.3
+ CoT37.052.036.025.521.356.926.567.016.041.9-32.045.627.036.141.050.930.038.051.057.345.059.737.057.750.037.655.349.447.4
Baichuan2-Base13b38.048.033.042.520.754.842.573.011.045.764.950.059.440.054.242.052.731.038.013.021.468.077.350.063.554.040.459.652.651.3
+ CoT50.056.034.047.029.362.022.569.312.043.8-46.058.239.049.639.049.834.040.137.045.673.081.346.065.660.046.858.456.354.2
Baichuan2-Base7b27.066.041.032.528.059.834.569.45.034.359.840.053.835.050.241.049.633.038.518.022.949.065.934.051.055.041.655.848.448.5
+ CoT30.056.034.034.023.357.033.069.512.044.5-41.051.231.040.738.046.426.032.641.746.346.061.543.864.153.038.557.049.548.1
Mistral-Base7b48.053.038.041.034.061.842.576.235.061.858.343.055.930.045.337.049.438.047.837.045.568.076.764.074.853.045.064.556.155.4
+ CoT57.063.035.054.030.061.842.045.729.057.3-51.060.430.046.248.057.237.047.924.033.260.065.958.067.957.052.354.954.554.0
ChatGLM3-Base6b48.070.032.035.03.351.813.562.611.055.061.650.057.224.026.330.035.438.041.522.022.567.076.435.055.958.046.357.846.749.3
+ CoT47.068.032.046.08.753.915.564.313.056.545.052.523.024.530.035.037.040.222.022.572.079.442.060.354.048.358.246.149.1

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models (2024)

References

Top Articles
Latest Posts
Article information

Author: Aracelis Kilback

Last Updated:

Views: 6473

Rating: 4.3 / 5 (64 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Aracelis Kilback

Birthday: 1994-11-22

Address: Apt. 895 30151 Green Plain, Lake Mariela, RI 98141

Phone: +5992291857476

Job: Legal Officer

Hobby: LARPing, role-playing games, Slacklining, Reading, Inline skating, Brazilian jiu-jitsu, Dance

Introduction: My name is Aracelis Kilback, I am a nice, gentle, agreeable, joyous, attractive, combative, gifted person who loves writing and wants to share my knowledge and understanding with you.