11/12/2024 | Press release | Distributed by Public on 11/12/2024 12:03
TL;DR: Time series forecasting is becoming increasingly important across various domains, thus having high-quality, diverse benchmarks are crucial for fair evaluation across model families. Such benchmarks also help identify model strengths and limitations, driving progressive advancements in the field. GIFT-Eval is a new comprehensive benchmark designed for evaluating general time series forecasting models, particularly foundation models. It introduces a diverse dataset collection that encompasses 28 datasets and over 144,000 time series, with 177 million data points. GIFT-Eval is structured to support both full-shot and zero-shot evaluation as it provides train validation and test splits for each dataset along with a non-leaking pretraining dataset to promote robust development and comparison of foundation forecasting models. Moreover the datasets are analyzed in detail across four time series characteristics and six time series features, and results are aggregated across all the characteristics for more useful insights across model families.
Time series forecasting has become critical in numerous fields, ranging from finance and healthcare to cloud operations. As universal forecasting models emerge, there is a need for diverse benchmarks that support a wide array of datasets, frequencies, and forecasting tasks. Especially in foundation model research, having such a diverse high-quality benchmark becomes very crucial, ensuring fair evaluation and highlighting model weaknesses. For instance Natural Language Processing (NLP) research has benefited from diverse benchmarks like GLUE or MMLU, however, time series forecasting lacks such comprehensive resources. Existing datasets are often narrow in scope, focusing on specific tasks and failing to test models' ability to handle varied forecasting scenarios, particularly in zero-shot settings. Inconsistent data splits across models also increase the risk of data leakage, complicating comparisons.
GIFT-Eval fills these identified gaps by introducing a comprehensive time series forecasting benchmark that consists of both pretraining and train/test components. GIFT-Eval supports a wide range of forecasting tasks, from short to long-term predictions, and evaluates models in both univariate and multivariate settings, providing a much-needed diversity in time series data.
Moreover, GIFT-Eval ensures fair and consistent model evaluation, particularly for foundation models, by offering pretraining data without leakage. It stands apart from previous benchmarks by introducing a broader spectrum of frequencies and prediction lengths, as well as the evaluation of zero-shot forecasting capabilities.
GIFT-Eval consists of two key components:
Our paper presents a detailed analysis and benchmarking across 17 models, providing insights into model performance, highlighting strengths and identifying failure cases to guide the future development of universal time series models. In order to get granular insights from the results we categorize the datasets in our paper according to distinct time series characteristics that influence their structure and modeling. These include the domain from which the data originates (e.g., finance, healthcare), the frequency of observations (e.g., hourly, daily), the prediction length or forecast horizon, and whether the series is univariate or multivariate. Additionally, time series possess statistical features such as trend, seasonal strength, entropy, Hurst exponent, stability, and lumpiness, which help capture the patterns and variability within the data. GIFT-Eval considers these characteristics and features to ensure a comprehensive evaluation of forecasting models across diverse real-world scenarios.
Experiments were conducted using 17 models, spanning traditional statistical approaches (e.g., ARIMA, Theta), deep learning models (e.g., PatchTST, iTransformer), and foundation models (e.g., Moirai, Chronos). We present the results across five sections, covering key characteristics such as domain, prediction length, frequency, and number of variates, followed by an aggregation of results across all configurations. Here, we share only the gist of our findings, but for a more detailed and fine-grained analysis, interested readers can refer to our full paper.
Foundation models generally outperform both statistical and deep learning models across most domains. However, they face difficulties in domains like Web/CloudOps and Transport, where high entropy, low trend, and lumpiness make the data less predictable for zero-shot foundation models. In contrast, deep learning models perform better in these challenging domains when given full-shot training, likely benefiting from more targeted training data compared to foundation models.
Foundation models excel in short-term forecasts, effectively capturing immediate trends and fluctuations. However, as prediction lengths extend to medium and long-term forecasts, their performance declines, while deep learning models like PatchTST and iTransformer perform better, successfully capturing longer-term dependencies. Although fine-tuning foundation models improves their ability to handle long-term forecasts, a notable performance gap remains between foundation models and deep learning models for medium to long-term predictions. This gap highlights an opportunity for further research to enhance foundation models' ability to manage extended forecast horizons.
For the highest frequency data, such as second-level granularity, statistical models lead the performance. As the frequency shifts to minutely and hourly data, deep learning models begin to dominate. Foundation models seem to struggle while handling the noisy patterns in high-frequency data, where specialized deep learning and statistical models perform better. However, when it comes to lower frequencies, such as daily to yearly data, foundation models consistently outperform other approaches, leveraging their extensive pretraining to capture broader patterns and slower dynamics.
In multivariate settings, deep learning models outperform all others across metrics, while Moirai leads among foundation models but still falls short of deep learning performance. This highlights a gap in foundation model research, where multivariate forecasting remains a challenge compared to deep learning models. Conversely, in univariate scenarios, foundation models, especially the large variant of Moirai, excel, delivering superior performance over their deep learning counterparts.
PatchTST stands out as the top-performing model across all metrics, with Moirai Large consistently ranking second and frequently appearing in the top two across all datasets. PatchTST proves to be a strong generalist, delivering reliable performance across diverse datasets, while Moirai Large excels in specific cases. However, the scaling law-where larger models perform better-only holds in select domains like energy and univariate forecasting.
In conclusion, GIFT-Eval has been introduced as a comprehensive and diverse benchmark designed to evaluate time series forecasting models across key characteristics such as domain, frequency, number of variates, and prediction length. A diverse pretraining dataset and detailed analysis are provided to enable fair comparisons of statistical, deep learning, and foundation models. It is hoped that this benchmark will foster the development of more robust and adaptable foundation models, advancing the field of time series forecasting.
Salesforce AI invites you to dive deeper into the concepts discussed in this blog post (see links below). Connect with us on social media and our website to get regular updates on this and other research projects.
Taha Aksu is a Research Scientist at Salesforce AI Research Asia. His main focus of interest lies in training and evaluating foundation models in time series. He is also interested in bridging the gap between text and time series modality.
More by Taha