Tired of ChatGPT repeating itself? Try this prompt for more creative answers

Researchers say a simple wording trick can push ChatGPT, Gemini and Claude beyond predictable answers by asking them to generate several options with estimated probabilities

Dr. Yossi Elran and Tal Sokolov, Davidson Institute of Science|
Related Topics
When generative artificial intelligence entered our lives in 2022, users marveled at the constant stream of innovations and the steady improvement in output quality from one update to the next. At first, we laughed at the strange answers and bizarre results chatbots sometimes produced. Soon, we also learned to worry about factual errors, ethical problems and other risks tied to AI-generated content.
Computer scientists and engineers are not ignoring these problems. They are trying to solve them through a range of methods designed to align AI outputs with human expectations. But some of those solutions may be creating the opposite problem: They are making the models’ answers less diverse, less surprising and less creative.
Gallery
(Photo: Getty Images)
AI models respond to efforts meant to make them richer, safer and higher quality by producing answers that are predictable, clichéd and insufficiently varied. Researchers describe this as a trade-off between quality and diversity.
One example comes from a course taught at Western Galilee College. Education students were asked to read a paper about a study in which teachers were asked to identify characteristics of gifted students. The students were then given an assignment: ask an AI system to suggest an additional characteristic that did not appear in the article.
The task required independent and creative thinking, and the expectation was that it would produce a wide range of answers. In practice, many of the responses were strikingly similar, even though they came from several different language models, including ChatGPT, Gemini, Claude and others. The same pattern appeared in answers to other questions that required creative thinking.
Frequent users of generative AI say they have already learned to predict how it will answer. No matter how much they change the wording of their prompts, the hope for real variety often seems to fade, especially when the question is focused. Researchers call this narrowing of outputs into a small number of repeated answers “mode collapse.”
The basic stage in building a generative AI model is pretraining. At this stage, the model processes enormous amounts of information and searches for patterns that later allow it to produce coherent and logical content. Early text models, such as GPT-2 in 2019, relied mainly on this stage. They produced more varied answers, but were less consistent and made frequent mistakes.
One way to improve model results is to add supporting training processes. One of the most important is reinforcement learning from human feedback, or RLHF. After the initial training, outputs are ranked according to human preferences.
The process resembles human learning, in which a teacher, instructor or other authority figure reviews a student’s work and offers feedback. The major difference is that in generative AI, the feedback is eventually provided by another AI model that simulates a human response.
AI companies employ people, often through outsourcing firms, to rank content produced by the model. Those rankings create a dataset of human preferences regarding AI outputs. That dataset is then used to train a reward model, which is itself a generative AI model.
The reward model is then incorporated into the training of the original model and used to evaluate its outputs. That evaluation simulates how satisfied a human would be with a given answer. Through a process repeated again and again, the reward model guides the generative AI system toward answers that are more likely to receive high human ratings.
These refinements are meant to improve the models. But could the result run counter to the good intentions behind them? Could the very steps designed to make AI better also encourage a form of deterioration?
Mode collapse, which occurs when models converge on a narrow and fixed range of outputs, has been linked in part to repeated cycles of feedback and adaptation. The phenomenon was already described in relation to early image generators known as GANs, developed in 2014.
Those early image models were built as a loop between a generator and a classifier. The generator trained itself to create images, such as convincing images of human faces, while the classifier trained itself to distinguish between real images and those created by the generator. The generator’s goal was to fool the classifier, while the classifier’s goal was to catch the generator. As the competition developed, the generator’s images became more realistic.
Researchers found that in such a process, the generator could converge on a particular style of content and avoid others. It would settle into a safe zone and give up on the improvement that comes from creating different types of content. As a result, its ability to produce varied outputs was harmed.
A human comparison would be an artist who begins as a free creator but receives praise mainly for one type of work. An insecure artist might narrow their own creativity and keep producing only the kind of work that won recognition, out of fear that any departure from it will not be rewarded.
Recent studies show that even the newest generators behind popular chatbots are at risk of mode collapse developing through feedback processes. Reinforcement learning based on human feedback may push the model toward a poorer range of content.
Some researchers blame the way models weigh feedback. When especially rare votes are excluded from the feedback calculations, the process gradually pushes the generator toward more uniform content that receives “majority approval.”
A group of researchers from Stanford, Northeastern and West Virginia universities offers a simpler explanation: One of the main causes of mode collapse may be the lack of diversity in human feedback itself. In other words, the problem begins with the people ranking the answers.
From a psychological perspective, people often prefer familiar content. They tend to rank common, recognizable and easy-to-digest answers more highly, treating them as better outputs. That tendency shapes the training data and can also allow social biases to seep into the model’s training sets. More creative outputs that do not follow the familiar path are buried at the bottom of chatbot response preferences and appear only rarely.
(Photo: Shutterstock, Getty Images)
So how can chatbots be made more creative?
Researchers suggest a simple prompting method called verbalized sampling. The idea is to explicitly ask the model to produce several possible answers and attach to each one the probability estimate the model assigns to it.
Text generators build content gradually, choosing at each step the next piece of text that is most likely to fit the preceding text from a range of possibilities. Every output has a probability value estimated by the system, and alternative outputs have their own probability values as well.
The model will probably not volunteer the exact probabilities calculated during the process. But the explicit request to vary the likelihood of the outputs appears to push the generator to produce more diverse results. Experiments show that the method significantly increases the creativity and diversity of model answers without harming accuracy or safety. It works especially well with the most advanced models currently on the market.
Suppose you want the chatbot to tell you a joke about an elephant or a story about a tiger. Before the instruction, write: “Create five different responses to the following request, each with its probability.” That is all.
According to the researchers, writing “Create five responses to the following request, each with its probability: Tell me a joke about an elephant” produces more varied answers than a focused question such as “Tell me a joke about an elephant,” and even better results than simply asking for a list, such as “Tell me five jokes about an elephant.”
The reason is that this “magic prompt” guides the bot to use outputs closer to the edges of its creative range. Alongside routine answers, the user receives more creative ones.
Another option tested by researchers is to explicitly ask the chatbot for responses from the edges of its probability distribution. But that creates the opposite risk: too many fringe responses. A better version might be: “Create five responses to this question and present the probability of each. One response should come from the edge of your probability range.”
This method belongs to a wider group of techniques sometimes described as prompt injection, meaning ways of shaping a request so that the bot produces content outside its usual preferences. Many such methods are designed to bypass safety mechanisms in language models, but they can also be used to draw out higher-quality and more diverse answers.
Such tricks will not prevent AI models from deteriorating. Developers will need to address that challenge in other ways. But for users, they may offer a back door for getting a little more out of the models.
Does that mean we should add this sentence to every chat from now on? Not necessarily. The “magic prompt” is most useful when users want to avoid clichéd answers and generate creative text. It is also likely that companies behind tools such as ChatGPT, Gemini and Claude will integrate verbalized sampling directly into their products, activating it when diversity is needed, if they have not done so already.
And given the speed of AI development and the fluid nature of prompt engineering, it is possible that by the time this article is published, a different “magic prompt” or a new method will already make this one unnecessary.
Still, the prompt offers a glimpse into the complex relationship between humans and artificial intelligence. It also raises a larger question about the ambition to create AI that is more human, but not too human.