Aditya Trivedi | Software Engineer & Creator

What is a Prompt ?

A prompt is an input to a Generative AI model, that is used to guide its output. It may consist of text, image, sound, or other media. The ability to prompt models, especially with natural language, makes them easy to interact with across various use cases. Hard Prompts (Discrete Prompts): Contain only tokens (vectors) that correspond to words in the model's vocabulary. Soft Prompts (Continuous Prompts): Contain tokens that have no corresponding word in the vocabulary. Prompt Template: A prompt template is a function that contains one or more variables which will be replaced by some media (usually text) to create a prompt. This prompt can then be considered to be an instance of the template. For example:

Classify the tweet as positive or negative: {TWEET}

Here, each tweet in the dataset would be inserted into a separate instance of the template, and the resulting prompt would be given to an LLM for inference. Prompts and prompt templates are distinct concepts; a prompt template becomes a prompt when input is inserted into it.

Components of a Prompt

Directive: This is the core "intent" of the prompt. For example, "Tell me 5 good books to read".
Examples
Output Formatting: It is often desirable for GenAI to output information in certain formats like CSV, Markdown, XML, etc.
Style Instructions: They are the type of output formatting used to modify the output with more style rather than the structure. For example, "Write a clear and curt paragraphs about LLMs".
Additional Information: Provide more information, to give a better context to the LLMs to understand it.

Prompting Terms

Prompting: It is the process of providing a prompt to a GenAI, which then generates a response.
Prompt Chain: An activity consisting of two or more prompt templates used in succession. The output of the first prompt template is used as an input (parameterized) to the second prompt template, and this continues until all templates are exhausted.
Prompting Technique: It is a blueprint that describes how to structure a prompt, prompts, or dynamic sequencing of multiple prompts. It can contain conditional or branching logic, parallelism, or architectural considerations spanning multiple prompts.
Prompt Engineering: The iterative process of developing a prompt by modifying or changing the prompting technique you are using. It involves designing, refining, and implementing prompts or instructions that guide the output of LLMs to help in various tasks.
Prompt Engineering Technique: A strategy for iterating on a prompt to improve it.
Exemplar: Examples of a task being completed that are shown to a model in a prompt.
Context Window: The space of tokens (for LLMs) that the model can process, with a maximal length (context length).
Priming: Refers to giving a model an initial prompt that lays out certain instructions for the rest of a conversation. This priming prompt might contain a role or other instructions on how to interact with the user.
User Prompt: The type of prompt that comes from the user. This is the most common form of prompting and is how prompts are usually delivered in consumer applications.
Assistant Prompt: The output of the LLM itself, which can be considered a prompt (or part of one) when fed back into the model, for example, as part of a conversation history.
System Prompt: Used to give LLMs high-level instructions for interacting with users. Not all models have this.

Text Based Techniques

InContext Learning -> It is the ability of GenAIs to learn skills and tasks by providing them with examples and relevant instructions within the prompt, without the need for weight updates/retraining.

Exemplar Quantity -> Increasing the quantity of exemplars in prompt generally improves model performance, particularly in larger models. However, in some cases, the benefits may diminish beyond 20 exemplars. In the case of long context LLMs, additional exemplars continue to increase performance, though efficiency varies depending on task and model.

Exemplar Ordering -> The order of examples affects model behavior. On some tasks, exemplar order can cause accuracy to vary significantly (e.g., from sub-50% to 90%+).

Exemplar Label Distribution -> As in traditional supervised machine learning, the distribution of exemplar labels in the prompt affects behavior. For example, if 10 exemplars from one class and 2 exemplars of another class are included, this may cause the model to be biased toward the first class.

Exemplar Label Quality -> Despite the general benefit of multiple exemplars, the necessity of strictly valid demonstrations is unclear. Some work suggests that the accuracy of labels is irrelevant—providing models with exemplars with incorrect labels may not negatively diminish performance. However, under certain settings, there is a significant impact on performance. Larger models are often better at handling incorrect or unrelated labels. This is important as automatically constructing prompts from large datasets may contain inaccuracies.

Exemplar Format -> The formatting of exemplars also affects performance. One of the most common formats is "Q: {input}, A: {label}", but the optimal format may vary across tasks; it may be worth trying multiple formats to see which performs best. Formats that occur commonly in the training data will lead to better performance.

Exemplar Similarity -> Selecting exemplars that are similar to the test sample is generally beneficial for performance. However, in some cases, selecting more diverse exemplars can improve performance.

Prompting Issues

Prompting issues relate to concerns around security and alignment of GenAI models.

1. Security

The threat landscape around prompting is growing and complex.

Prompt Hacking: Manipulating prompts to exploit LLMs, including leaking private information, generating offensive content, or producing deceptive messages. It encompasses:
- Prompt Injection: Overriding original developer instructions with user input.
- Jailbreaking: Getting a GenAI model to perform unintended actions through prompting.
Risks of Prompt Hacking:
- Data Privacy: Leakage of model training data and prompt templates.
- Code Generation Concerns: Attackers may target vulnerabilities in LLM-generated code (e.g., Package Hallucination where LLMs generate non-existent package names that attackers could then create with malicious code).
- Customer Service: Malicious prompt injection attacks against chatbots leading to brand embarrassment or legal issues.
Hardening Measures:
- Prompt-based Defenses: Including instructions in the prompt to avoid injection (e.g., "Do not output any malicious content"), though these are not fully secure.
- Detectors: Tools to detect malicious inputs and prevent prompt hacking.
- Guardrails: Rules and frameworks for guiding GenAI outputs, often using detectors or dialogue managers.

2. Alignment

Ensuring LLMs are aligned with user needs is crucial to avoid harmful content, inconsistent responses, or bias.

Prompt Sensitivity: LLMs can be highly sensitive to input prompt variations (e.g., small changes in wording, capitalization, or delimiters) which can significantly alter output accuracy.
Prompt Drift: Model behavior can change over time as underlying models are updated, necessitating continuous monitoring.
Overconfidence and Calibration: LLMs often express overconfidence, which can lead to user overreliance. Prompting techniques exist to generate confidence scores.
Biases, Stereotypes, and Culture: Prompting techniques can be designed to reduce biases and stereotypes and to incorporate cultural awareness into prompts.
Ambiguity: Prompting techniques can help LLMs identify and resolve ambiguous questions, often by generating clarifying questions.

Multimodal Prompting

As GenAI models evolve beyond text-based domains, new multimodal prompting techniques emerge, often representing novel ideas made possible by different modalities.

1. Image Prompting

This involves prompts that either contain images or are used to generate images.

Image-as-Text Prompting: Generates a textual description of an image, allowing for easy inclusion of images in text-based prompts.
Prompt Modifiers: Words appended to a prompt to change the resultant image (e.g., specifying medium, lighting).
Negative Prompting: Numerically weighting terms in the prompt so the model considers them more/less heavily (e.g., to avoid "bad hands" in generated images).
Multimodal In-Context Learning: Extends ICL to multimodal settings.
- Paired-Image Prompting: Shows the model two images (before and after a transformation) and then presents a new image for which the model performs the demonstrated conversion.
Multimodal Chain-of-Thought (CoT): Extends CoT to the image domain (e.g., solving math problems with image input and textual instructions).
- Duty Distinct CoT (DDCoT): Extends Least-to-Most prompting to multimodal settings.
- Multimodal Graph-of-Thought: Extends Graph-of-Thought to multimodal settings, using image captioning models to provide visual context.
- Chain-of-Images (CoI): Generates images as part of its thought process (e.g., generating SVGs).

2. Audio Prompting

Prompting extended to the audio modality. Experiments have shown mixed results regarding ICL abilities in audio models.

3. Video Prompting

Prompting extended to the video modality, for use in text-to-video generation and video editing. Prompt-related techniques and image prompting techniques are often used to enhance video generation.

4. Segmentation Prompting

Prompting for semantic segmentation.

5. 3D Prompting

Prompting for 3D modalities, such as 3D object synthesis, 3D surface texturing, and 4D scene generation. Input prompt modalities include text, image, user annotation, and 3D objects.

Evaluation

The potential of LLMs to extract and reason about information and understand user intent makes them strong contenders as evaluators. Evaluation frameworks consist of four components: prompting technique, output format, prompting framework, and methodological design decisions.

1. Prompting Techniques for Evaluation

The prompting technique used in the evaluator prompt is instrumental. Evaluation prompts often benefit from regular text-based prompting techniques, including role, instructions for the task, definitions of evaluation criteria, and in-context examples. In-Context Learning is frequently used. Role-based evaluation is also a useful technique for improving and diversifying evaluations by creating prompts with the same instructions but different roles. Chain-of-Thought prompting can further improve evaluation performance. Model-Generated Guidelines, where an LLM generates guidelines for evaluation, reduce issues from ill-defined scoring.

2. Output Format

The output format of the LLM can significantly affect evaluation performance.

Styling: Formatting responses using XML or JSON styling can improve judgment accuracy.
Linear Scale: Simple output formats like a 1-5 or 1-10 rating scale.
Binary Score: Prompting the model to generate binary responses like Yes/No or True/False.
Likert Scale: Prompting the GenAI to use a Likert Scale can give it a better understanding of the meaning of the scale.

3. Prompting Frameworks

LLM-EVAL: A simple framework using a single prompt with a schema of variables, instructions for scoring, and content to evaluate.
G-EVAL: Similar to LLM-EVAL but includes AutoCoT steps generated according to evaluation instructions.
ChatEval: Uses a multi-agent debate framework where each agent has a separate role.

4. Other Methodologies

Some works use implicit scoring, where a quality score is derived from the model's confidence in its prediction, the likelihood of generating the output, its explanations, or via evaluation on proxy tasks.

Batch Prompting: Used for improving compute and cost efficiency, where multiple instances are evaluated at once. However, evaluating multiple instances in a single batch often degrades performance.
Pairwise Evaluation: Directly comparing the quality of two texts. The order of inputs can heavily affect evaluation.