淘先锋技术网

首页 1 2 3 4 5 6 7

参考资料:实验楼的《探究bert的阅读机理》

模型的加载

# 加载模型
model = DistilBertForQuestionAnswering.from_pretrained(MODEL_PATH)
# 将模型移到相应设备
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(DEVICE)  # 查看模型结构

基于同样的路径,加载 tokenizer:

# 加载 tokenizer,用于处理训练数据
from transformers import DistilBertTokenizerFast
# 初始化tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_PATH)

定义测试的问题与上下文

question = "in 2011, Beyonce performed for four nights where?"
context = "Her fourth studio album 4 was released on June 28, 2011 in the US. 4 sold 310,000 copies in its first week and debuted atop the Billboard 200 chart, giving Beyonc\u00e9 her fourth consecutive number-one album in the US. The album was preceded by two of its singles \"Run the World (Girls)\" and \"Best Thing I Never Had\", which both attained moderate success. The fourth single \"Love on Top\" was a commercial success in the US. 4 also produced four other singles; \"Party\", \"Countdown\", \"I Care\" and \"End of Time\". \"Eat, Play, Love\", a cover story written by Beyonc\u00e9 for Essence that detailed her 2010 career break, won her a writing award from the New York Association of Black Journalists. In late 2011, she took the stage at New York's Roseland Ballroom for four nights of special performances: the 4 Intimate Nights with Beyonc\u00e9 concerts saw the performance of her 4 album to a standing room only."
# 答案:New York's Roseland Ballroom

利用tokenzier对数据进行拼接和编码

encodings = tokenizer(context, question, truncation=True, padding=True)

输出结果:

 
# 对输入数据进行处理
encodings = tokenizer(context, question, truncation=True, padding=True)
# 提取输入数据
input_ids = torch.tensor(encodings['input_ids']).unsqueeze(0).to(DEVICE)
attention_mask = torch.tensor(
    encodings['attention_mask']).unsqueeze(0).to(DEVICE)

 做出预测

# 作出预测
outputs = model(input_ids, attention_mask)

输出是一个包含 答案起始位置 和 答案终止位置 的词向量

 取出起始位置和终止位置,做argmax

start_pred = torch.argmax(outputs['start_logits'], dim=1)
end_pred = torch.argmax(outputs['end_logits'], dim=1)

获得起始位置和终止位置

将单词编码(input_ids)转化为词,用convert_ids_to_tokens()

# 提取预测答案位置
start_pred = torch.argmax(outputs['start_logits'], dim=1)
end_pred = torch.argmax(outputs['end_logits'], dim=1)
all_tokens = tokenizer.convert_ids_to_tokens(
    input_ids[0].detach().tolist())  # 获取输入的 tokens
['[CLS]',
 'her',
 'fourth',
 'studio',
 'album',
 '4',
 'was',
 'released',
 'on',
 'june',
 '28',
 ',',
 '2011',
 'in',
 'the',
 'us',
 '.',
 '4',
 'sold',
 '310',
 ',',
 '000',
 'copies',
 'in',
 'its',
 'first',
 'week',
 'and',
 'debuted',
 'atop',
 'the',
 'billboard',
 '200',
 'chart',
 ',',
 'giving',
 'beyonce',
 'her',
 'fourth',
 'consecutive',
 'number',
 '-',
 'one',
 'album',
 'in',
 'the',
 'us',
 '.',
 'the',
 'album',
 'was',
 'preceded',
 'by',
 'two',
 'of',
 'its',
 'singles',
 '"',
 'run',
 'the',
 'world',
 '(',
 'girls',
 ')',
 '"',
 'and',
 '"',
 'best',
 'thing',
 'i',
 'never',
 'had',
 '"',
 ',',
 'which',
 'both',
 'attained',
 'moderate',
 'success',
 '.',
 'the',
 'fourth',
 'single',
 '"',
 'love',
 'on',
 'top',
 '"',
 'was',
 'a',
 'commercial',
 'success',
 'in',
 'the',
 'us',
 '.',
 '4',
 'also',
 'produced',
 'four',
 'other',
 'singles',
 ';',
 '"',
 'party',
 '"',
 ',',
 '"',
 'countdown',
 '"',
 ',',
 '"',
 'i',
 'care',
 '"',
 'and',
 '"',
 'end',
 'of',
 'time',
 '"',
 '.',
 '"',
 'eat',
 ',',
 'play',
 ',',
 'love',
 '"',
 ',',
 'a',
 'cover',
 'story',
 'written',
 'by',
 'beyonce',
 'for',
 'essence',
 'that',
 'detailed',
 'her',
 '2010',
 'career',
 'break',
 ',',
 'won',
 'her',
 'a',
 'writing',
 'award',
 'from',
 'the',
 'new',
 'york',
 'association',
 'of',
 'black',
 'journalists',
 '.',
 'in',
 'late',
 '2011',
 ',',
 'she',
 'took',
 'the',
 'stage',
 'at',
 'new',
 'york',
 "'",
 's',
 'rose',
 '##land',
 'ballroom',
 'for',
 'four',
 'nights',
 'of',
 'special',
 'performances',
 ':',
 'the',
 '4',
 'intimate',
 'nights',
 'with',
 'beyonce',
 'concerts',
 'saw',
 'the',
 'performance',
 'of',
 'her',
 '4',
 'album',
 'to',
 'a',
 'standing',
 'room',
 'only',
 '.',
 '[SEP]',
 'in',
 '2011',
 ',',
 'beyonce',
 'performed',
 'for',
 'four',
 'nights',
 'where',
 '?',
 '[SEP]']

获取答案

print("start_pred:", start_pred)
print("end_pred:", end_pred)
# 根据答案起止位置获取 tokens 片断
print('预测回答: ', ' '.join(all_tokens[start_pred:end_pred+1]))
all_tokens[start_pred:end_pred+1]
['new', 'york', "'", 's', 'rose', '##land', 'ballroom']
' '.join(all_tokens[start_pred:end_pred+1])
"new york ' s rose ##land ballroom"