NLP（七十七）文本补全中的动态提示（Dynamic-Prompting）

本文将会介绍如何在大模型中对文本补充（Text Completion）进行动态提示（Dynamic Prompting）。

本文以文本分类任务为例，样例数据集为TREC，使用大模型的文本补全进行类别预测，分别在Zero-Shot, Few-Shot, Dynamic Few-Shot场景下进行测试，验证动态提示对于提升模型表现的有效性。

数据集

Text REtrieval Conference (TREC) Question Classification数据集包含训练集中的约5500个标记问题和测试集中的另外 500 个问题。

该数据集有 6 个粗类标签和 50 个精细类标签。每个句子的平均长度为10，词汇量为8700。6 个粗类标签为ABBR, ENTY, DESC, HUM, LOC, NUM.

该数据集从四个来源收集：USC发布的4,500个英语问题（Hovy et al., 2001）、针对少数稀有类别的大约500个手动构建的问题、894个 TREC 8 和 TREC 9 问题，以及来自 TREC 10 的500个问题作为测试集。这些问题是手动标记的。

该数据集的HuggingFace网址为: https://huggingface.co/datasets/trec，使用datasets模块加载代码如下：

import openai
from datasets import load_dataset
from sklearn.metrics import classification_report

dataset = load_dataset("trec")

dataset

输出结果为：

DatasetDict({
    train: Dataset({
        features: ['text', 'coarse_label', 'fine_label'],
        num_rows: 5452
    })
    test: Dataset({
        features: ['text', 'coarse_label', 'fine_label'],
        num_rows: 500
    })
})

其中test数据集的第一条数据为：

1
2
3

{'text': 'How far is it from Denver to Aspen ?',
 'coarse_label': 5,
 'fine_label': 40}

对数据进行预处理，代码如下：

# name of the text and label column
label_type = 'coarse_label'
text_key = "text"
# create mapping of ids2class and class2id
id2class = dict((i, label) for i, label in enumerate(dataset['train'].features[label_type].names))
class2id = dict((label, i) for i, label in enumerate(dataset['train'].features[label_type].names))
# create a dictionary with classes as key and containing all the training examples within that class
class2TrainDataset = dict((label, []) for label in dataset['train'].features[label_type].names)
for example in dataset['train']:
    label = id2class[example[label_type]]
    class2TrainDataset[label].append(example[text_key])

其中，id2class和class2id分别为id与类别对应表、类型与id对应表，class2TrainDataset为每个类别中的训练数据集。

Zero-Shot

构建Zero-Shot的prompt，代码如下：

# a prompt for asking LLM to perform a task
task_prompt = "As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.\nFollowing are the semantic classes: ["
task_prompt += ", ".join([label for label in class2TrainDataset]) + "]"
# a prompt for asking LLM to generate the output for current task
query_prompt = "\nClassify the following question into one of the above classes. Please answer in a single word.\nquestion: "
answer_prompt = "\noutput: "

那么，test数据集的第一条的Zero-Shot prompt为：

zeroshot_prompt = task_prompt +  query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> zeroshot_prompt

As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:

调用openai的大模型进行回复，调用函数代码如下：

openai.api_key = "sk-xxx"
model_name = "gpt-3.5-turbo-instruct"

import tiktoken
enc = tiktoken.encoding_for_model(model_name)
log_bias_dict = {}
for label in dataset['train'].features["coarse_label"].names:
    for token_id in enc.encode(label):
      log_bias_dict[token_id] = 5
      
# Text completion using GPT
def trim_text(text):
    return text.strip().strip('\n').strip('\\n')

def generate_using_gpt(prompt):
    generated_sentence = ""
    try:
        # Create a completion for the provided prompt and parameters
        response = openai.Completion.create(
            model=model_name,
            prompt=prompt, 
            max_tokens=3,
            temperature=0,
            top_p=1,
            stop=None,
            frequency_penalty=0,
            presence_penalty=0.0,
            logit_bias=log_bias_dict
        )
        
        choices = response.get("choices", "")
        if len(choices) == 0 or "text" not in choices[0]:
            print("Text not generated properly")
        generated_sentence = choices[0]['text'].lstrip('\\n').rstrip('\\n').lstrip('\n\n').rstrip('\n\n').lstrip('\n').rstrip('\n')
        
    except openai.error.APIError as e:
        # Handle API error here, e.g. retry or log
        print(f"OpenAI API returned an API Error: {e}")

    except openai.error.AuthenticationError as e:
        # Handle Authentication error here, e.g. invalid API key
        print(f"OpenAI API returned an Authentication Error: {e}")

    except openai.error.APIConnectionError as e:
        # Handle connection error here
        print(f"Failed to connect to OpenAI API: {e}")

    except openai.error.InvalidRequestError as e:
        # Handle connection error here
        print(f"Invalid Request Error: {e}")
        
    except openai.error.RateLimitError as e:
        # Handle rate limit error
        print(f"OpenAI API request exceeded rate limit: {e}")

    except openai.error.ServiceUnavailableError as e:
        # Handle Service Unavailable error
        print(f"Service Unavailable: {e}")

    except openai.error.Timeout as e:
        # Handle request timeout
        print(f"Request timed out: {e}")
    return generated_sentence

使用模型为gpt-3.5-turbo-instruct, max_tokens为3。为了保证输出token为数据集中的粗类类别，使用tiktoken得到这些粗类类别的token id，采用logit_bias对这些token id的输出进行加强。

对test数据集第一条数据进行测试：

1
2
3

>>> generate_using_gpt(zeroshot_prompt)

'LOC'

对全量test数据集使用Zero-Shot Prompt，代码如下：

# prompt without any examples from the training dataset
labels = []
predictions = []
for example in dataset['test']:
    zeroshot_prompt = task_prompt +  query_prompt + example[text_key] + answer_prompt
    pred = generate_using_gpt(zeroshot_prompt)
    pred=trim_text(pred)
    labels.append(example[label_type])
    if pred not in class2id:
        predictions.append(-1)
    else:
        predictions.append(class2id[pred])
        
report = classification_report(labels, predictions, digits=4)

评估结果如下：

              precision    recall  f1-score   support

           0     0.6364    0.7778    0.7000         9
           1     0.4432    0.4149    0.4286        94
           2     0.7154    0.6377    0.6743       138
           3     0.9455    0.8000    0.8667        65
           4     0.8222    0.9136    0.8655        81
           5     0.8195    0.9646    0.8862       113

    accuracy                         0.7380       500
   macro avg     0.7304    0.7514    0.7369       500
weighted avg     0.7336    0.7380    0.7324       500

weighted avg F1值为0.7324.

Few-Shot

接下来，使用Few-Shot对prompt进行加强，方法为从每个类别的train数据集中提取第一条样本作为Few-Shot，即In-Context Learning(ICL),代码如下：

# function to selection few examples in each of the classes from the training dataset
def generateFewshotPrompt(class2TrainDataset, N=3):
    fewshot_prompt = "\nFollowing are some examples."
    for label in class2TrainDataset:
        for example in class2TrainDataset[label][:N]:
            fewshot_prompt += "\nquestion: " + example
            fewshot_prompt += "\noutput: " + label
    return fewshot_prompt
    
# prompt with one example in each of the classes
fewshot_examples = generateFewshotPrompt(class2TrainDataset, N=1)
fewshot_prompt = task_prompt +  fewshot_examples + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> fewshot_prompt

test数据集的第一条数据的Few-Shot prompt如下：

As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Following are some examples.
question: What is the full form of .com ?
output: ABBR
question: What films featured the character Popeye Doyle ?
output: ENTY
question: How did serfdom develop in and then leave Russia ?
output: DESC
question: What contemptible scoundrel stole the cork from my lunch ?
output: HUM
question: What sprawling U.S. state boasts the most airports ?
output: LOC
question: When was Ozzy Osbourne born ?
output: NUM
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:

基于Few-Shot prompt，对全量test数据集进行评估，代码如下：

# prompt is created by adding one example in each of the classes 
labels = []
predictions = []
for example in dataset['test']:
    fewshot_prompt = task_prompt + fewshot_examples + query_prompt + example[text_key] + answer_prompt
    pred = generate_using_gpt(fewshot_prompt)
    pred=trim_text(pred)
    labels.append(example[label_type])
    if pred not in class2id:
        predictions.append(-1)
    else:
        predictions.append(class2id[pred])
        
report = classification_report(labels, predictions, digits=4)

评估结果如下：

              precision    recall  f1-score   support

           0     0.8182    1.0000    0.9000         9
           1     0.5217    0.5106    0.5161        94
           2     0.7727    0.7391    0.7556       138
           3     1.0000    0.8462    0.9167        65
           4     0.8021    0.9506    0.8701        81
           5     0.9474    0.9558    0.9515       113

    accuracy                         0.7980       500
   macro avg     0.8103    0.8337    0.8183       500
weighted avg     0.8001    0.7980    0.7969       500

此时，weighted avg F1值为0.7969.

Dynamic Few-Shot

上面Few-Shot prompt的效果已经比Zero-Shot prompt好很多了，还有提升空间吗？

对于Few-Shot的样本，我们是否可以进行选择，使得评估样本与Few-Shot样本接可能相近。基于此，我们想到了Dynamic Few-Shot，在每次评估测试样本时，在训练集的每个类别中选择与其语义相似度最高的k（本文取k=1）个样本。

考虑到文本的语义相似度，我们需要一个语义相似度计算的基础模型，一般为文本嵌入（Text Embedding）模型，本文选择all-mpnet-base-v2，使用sentence_transformers进行文本嵌入。代码如下：

from sentence_transformers import SentenceTransformer, util
import numpy as np
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# loading Sentence Transformer based model
model = SentenceTransformer('all-mpnet-base-v2', device=device)

# extract embeddings for a set of examples
def ExtractEmbeddings(examples):
    embedding_ls = []
    for example in examples:
        embedding = model.encode(example)     
        embedding_ls.append(embedding)
    return embedding_ls

# extract embeddings for all the training examples
class2TrainDatasetWithEmbedding = {}
for label in class2TrainDataset:
    embeddings = ExtractEmbeddings(class2TrainDataset[label])
    class2TrainDatasetWithEmbedding[label] = [class2TrainDataset[label], embeddings]

在上述代码中，我们使用sentence_transformers加载all-mpnet-base-v2模型，并对每个类别的训练集数据进行文本嵌入，获取它们的词向量，储存在内存中。

接着，针对每条评估测试样本，选择每个类别中与其语义相似度最高的1条样本，形成Dynamic Few-Shot prompt，代码如下：

# extract similar queries for a given input text from each of the classes
def getSimilarExamples(input_text, dataset, dataset_embedding):
    input_embedding = model.encode(input_text)
    sim_score = util.dot_score(input_embedding, dataset_embedding)[0]
    topN_ids = np.argsort(-sim_score)
    return [dataset[i] for i in topN_ids]
    
def getClasswiseSimilarExamples(input_text, class2TrainDatasetWithEmbedding):
    classwiseSimilarExamples = {}
    for label in class2TrainDataset:
        similarExamples = getSimilarExamples(input_text, class2TrainDatasetWithEmbedding[label][0], class2TrainDatasetWithEmbedding[label][1])
        classwiseSimilarExamples[label] = similarExamples
    return classwiseSimilarExamples
    
# generate a prompt with similar examples in each of the classes
def generateDynamicPrompt(input_text, class2TrainDatasetWithEmbedding, N=3):
    classwiseSimilarExamples = getClasswiseSimilarExamples(input_text, class2TrainDatasetWithEmbedding)
    dynamic_prompt = "\nFollowing are some examples."
    for label in classwiseSimilarExamples:
        for example in classwiseSimilarExamples[label][:N]:
            dynamic_prompt += "\nquestion: " + example
            dynamic_prompt += "\noutput: " + label
    return dynamic_prompt
    
# dynamic prompt with one similar example in each of the classes
fewshot_examples = generateDynamicPrompt(dataset['test'][0][text_key], class2TrainDatasetWithEmbedding, N=1)
dynamic_prompt = task_prompt + fewshot_examples + query_prompt + dataset['test'][0][text_key] + answer_prompt
>>> dynamic_prompt

此时，test数据集中的第一条样本的Dynamic Few-Shot prompt为：

As a Question Answering agent, your goal is to categorize questions into different semantic classes that impose constraints on potential answers, so that they can be utilized in later stages of the question answering process.
Following are the semantic classes: [ABBR, ENTY, DESC, HUM, LOC, NUM]
Following are some examples.
question: What do the letters D.C. stand for in Washington , D.C. ?
output: ABBR
question: What race is 1 , 137 miles long ?
output: ENTY
question: Why is the mile 528 feet ?
output: DESC
question: Who lives at 39 Stone Canyon Way ?
output: HUM
question: What Colorado city owns its own glacier ?
output: LOC
question: How high is the city of Denver ?
output: NUM
Classify the following question into one of the above classes. Please answer in a single word.
question: How far is it from Denver to Aspen ?
output:

可以看到此时的Dynamic Few-Shot prompt中的样本明显比Few-Shot prompt中的样本更好。

此时，再对全量test数据集进行评估，代码如下：

labels = []
predictions = []
for example in dataset['test']:
    fewshot_examples = generateDynamicPrompt(example[text_key], class2TrainDatasetWithEmbedding, N=1)
    dynamic_prompt = task_prompt + fewshot_examples + query_prompt + example[text_key] + answer_prompt
    pred = generate_using_gpt(dynamic_prompt)
    pred=trim_text(pred)
    labels.append(example[label_type])
    if pred not in class2id:
        predictions.append(-1)
    else:
        predictions.append(class2id[pred])
        
report = classification_report(labels, predictions, digits=4)

评估结果如下：

              precision    recall  f1-score   support

           0     1.0000    0.7778    0.8750         9
           1     0.7083    0.7234    0.7158        94
           2     0.8615    0.8116    0.8358       138
           3     0.9508    0.8923    0.9206        65
           4     0.8824    0.9259    0.9036        81
           5     0.8926    0.9558    0.9231       113

    accuracy                         0.8560       500
   macro avg     0.8826    0.8478    0.8623       500
weighted avg     0.8572    0.8560    0.8557       500

最终得到的weighted avg F1值为0.8557.