This episode explores the forecasting capabilities of OpenAI's ChatGPT-3.5 and ChatGPT-4, focusing on a comparative analysis between direct prediction prompts and narrative-based prompts (termed "future narratives"). The core investigation leverages the known training data cutoff of these models (initially September 2021) to evaluate their ability to predict events that occurred in 2022. A later "falsification exercise" in the second paper re-ran experiments with updated training data.
Narrative Prompting Enhances Forecasting Accuracy
A central finding across both sources is that prompting ChatGPT-4 to generate "future narratives" significantly improves its accuracy in forecasting certain events compared to direct prediction prompts. This suggests that framing the request as a storytelling exercise unlocks latent predictive abilities within Large Language Models (LLMs).
From HackerNoon: "Results show that ChatGPT-4 is significantly more accurate when asked to generate future narratives, particularly in predicting economic trends and cultural events. This suggests that narrative-driven AI responses may unlock latent predictive abilities within LLMs."
From arXiv paper: "After analyzing 100 trials, we find that future narrative prompts significantly enhanced ChatGPT-4's forecasting accuracy. This was especially evident in its predictions of major Academy Award winners as well as economic trends..."
Academy Awards Prediction
ChatGPT-4, when using narrative prompts, demonstrated remarkable accuracy in predicting the winners of major Academy Award categories (Best Actor, Best Actress, Supporting Actor/Actress) for the 2022 Oscars. Direct prompting, in contrast, performed poorly and often resulted in refusals to answer ("No Prediction").
From HackerNoon: "But narrative prompting with ChatGPT-4 shows accuracy ranging from 42% (Best Actress, Chastain) to 100% (Best Actor, Will Smith) with one exception. It failed to accurately predict the Best Picture winner."
From arXiv paper: "ChatGPT-4's accuracy significantly improved when the training window included the events being prompted for, achieving 100% accuracy in many instances [in the May 2024 experiments for Best Actor]."
Macroeconomic Variable Prediction
While direct prompts for macroeconomic predictions (inflation and unemployment rates) were largely refused by both ChatGPT-3.5 and ChatGPT-4, narrative prompts, particularly those featuring authoritative figures like the Federal Reserve Chair Jerome Powell, yielded more substantive results.
From HackerNoon: "In all cases, direct prompting was even less effective at prediction than it had been with the Academy Awards as ChatGPT refused to answer the prompt altogether when asked to directly predict the future time series of each macroeconomic variable."
From HackerNoon: "The distribution of Powell’s month by month predictions of inflation are on average comparable to the facts contained in the monthly University of Michigan’s consumer expectations survey."
The Role of "Hallucination"
The researchers conjecture that the improved accuracy with narrative prompts might be linked to the LLMs' capacity for "hallucinatory narrative construction." By asking the model to create a fictional story set in the future, it may be leveraging its training data in a more effective way to synthesize and extrapolate information.
From HackerNoon: "These findings indicate that narrative prompts leverage the models’ capacity for hallucinatory narrative construction, facilitating more effective data synthesis and extrapolation than straightforward predictions."
From arXiv paper: "Narrative prompting also consistently outperformed direct prompting. These findings indicate that narrative prompts leverage the models' capacity for hallucinatory narrative construction, facilitating more effective data synthesis and extrapolation than straightforward predictions."
OpenAI's Terms of Service and Prediction
The authors suggest that OpenAI might have intentionally made direct prediction difficult due to potential violations of its terms of service, particularly concerning the provision of financial, medical, or legal advice, making high-stakes automated decisions, and facilitating gambling. Storytelling, however, does not directly violate these terms.
From HackerNoon: "While outright prediction does not directly violate OpenAI’s terms of service, we think it is most likely the case based on our experiment that OpenAI has attempted to make it very difficult."
From HackerNoon: "But one thing that does not violate its terms of service is the telling of stories... Our project tests for whether requesting ChatGPT to tell stories may, in fact, unlock its ability to perform accurate forecasting."
Impact of Additional Information
In the context of macroeconomic predictions, providing additional real-world information (like Russia's invasion of Ukraine) to the narrative prompts sometimes led to less accurate predictions, suggesting the model's difficulty in appropriately integrating such information.
From HackerNoon: "Oddly, when prompted with information about Russia’s invasion of Ukraine, Powell’s predictions were systematically lower and less accurate than when that information had not been used to prime ChatGPT."
Falsification Exercise (arXiv Paper)
Repeating the experiments in May 2024, when the training data for both models had been updated to include the 2022 events, resulted in significantly improved accuracy for both direct and narrative prompting, often reaching 100%. This supports the idea that the earlier predictions were based on extrapolations from the training data.
From arXiv paper: "As a falsification exercise, we repeated our experiments in May 2024 at which time the models included more recent training data. ChatGPT-4's accuracy significantly improved when the training window included the events being prompted for, achieving 100% accuracy in many instances."
Prompt Design Importance
The study highlights the critical role of prompt design in eliciting predictive capabilities from LLMs. Narrative framing appears to be a key factor in unlocking more accurate forecasts.
From HackerNoon: "Our findings add to this nascent exploration by underscoring the importance of prompt design in harnessing LLMs for predictive tasks..."
Important Quotes
"Our findings suggest that these prediction machines become unusually accurate under ChatGPT-4 when prompted to tell stories set in the future about the past." (HackerNoon & arXiv)
"But it also suggests that beneath OpenAI’s outward facing consumer product, ChatGPT-4, is a very powerful prediction machine." (HackerNoon & arXiv)
"The poorer accuracy for events outside of the training window suggests that in the 2023 prediction experiments, ChatGPT-4 was forming predictions based solely on its training data." (arXiv)
Conclusion
The research suggests that while directly asking ChatGPT to predict the future may be limited by design and terms of service, employing narrative prompts can surprisingly enhance its forecasting accuracy, particularly with ChatGPT-4. This indicates a potential for leveraging the creative text generation capabilities of LLMs for analytical tasks, although the underlying mechanisms and the reliability of these "predictions" require further investigation. The study underscores the significance of prompt engineering and raises ethical considerations regarding the potential misuse of AI for predictive purposes in sensitive domains. The improved accuracy when the training data included the predicted events highlights the dependence on the information within the model's knowledge base for these forecasting abilities.
Share this post