HuggingFace Transformers 4.17 : Tutorials : タスクの概要 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/15/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Tutorials : Summary of the tasks

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : Tutorials : タスクの概要

このページはライブラリを利用するとき最も頻度の高いユースケースを示します。利用可能なモデルはユースケースで多くの様々な configuration と素晴らしい多用途性を可能にします。最も単純なものはここで提示され、質問応答、シークエンス分類、固有表現認識等のようなタスクのための使用方法を紹介します。

これらのサンプルは自動モデル (= auto-models) を活用します、これらは与えられたチェックポイントに従ってモデルをインスタンス化するクラスで、正しいモデル・アーキテクチャを自動的に選択します。詳細については AutoModel ドキュメントを確認してください。より特定的にコードを自由に変更してそれを特定のユースケースに適応させてください。

モデルがタスク上で上手く動作するためには、それはタスクに対応するチェックポイントからロードされなければなりません。これらのチェックポイントは通常は大規模なデータのコーパス上で事前訓練されて特定のタスク上で再調整されます。これは以下を意味しています :

総てのモデルが総てのタスク上で再調整されてはいません。特定のタスク上でモデルを再調整することを望む場合には、examples ディレクトリの run_$TASK.py スクリプトの一つを活用できます。
再調整済みモデルは特定のデータセット上で再調整されました。このデータセットは貴方のユースケースとドメインと重なるかもしれないしそうでないかもしれません。前述のように、貴方のモデルを再調整するために examples スクリプトを利用しても良いですし、あるいは貴方自身の訓練スクリプトを作成しても良いです。

タスク上で推論を行なうため、幾つかのメカニズムがライブラリにより利用可能になっています :

Pipeline : 非常に利用しやすい抽象化で、これらは 2 行ほどの少ないコードを必要とするだけです。
直接モデル利用 : 抽象度は低いですが、トークナイザー (PyTorch/TensorFlow) への直接アクセスと full 推論機能を通してより多くの柔軟性とパワーがあります。

両者のアプローチがここで紹介されます。

Note: ここで提示される総てのタスクは特定のタスク上で再調整された事前訓練済みチェックポイントを利用しています。特定のタスク上で再調整されていないチェックポイントのロードはそのタスクのために使用された追加のヘッドではなくベース transformer 層だけをロードして、ヘッドの重みをランダムに初期化します。
これはランダム出力を生成します。

シークエンス分類

シークエンス分類は与えられたクラス数に従ってシークエンスを分類するタスクです。シークエンス分類のサンプルは GLUE データセットで、これはそのタスクに完全に基づいています。GLUE 分類タスク上でモデルを再調整したいのであれば、run_glue.py, run_glue.py (for tensorflow), run_text_classification.py (for tensorflow) または run_xnli.py スクリプトを利用して良いです。

ここにセンチメント分析を行なうために pipeline を使用するサンプルがあります : シークエンスがポジティブかネガティブかを識別します。それは sst2 上で再調整されたモデルを利用します、これは GLUE タスクです。

これは次のように、スコアとともにラベル (“POSITIVE” or “NEGATIVE”) を返します :

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999

2 つのシークエンスが互いの言い換え (= paraphrase) であるかを決定するモデルを使用してシークエンス分類を行なうサンプルがここにあります。そのプロセスは以下です :

チェックポイント名からトークナイザーとモデルをインスタンス化します。モデルは BERT モデルとして識別されてそれをチェックポイントにストアされた重みでロードします。
2 つのセンテンスから、正しいモデル固有の separator、トークン型 id と attention マスクを伴うシークエンスを構築します (これらはトークナイザーにより自動的に作成されます)。
シークエンスをモデルに渡してその結果それは 2 つの利用可能なクラスの一つに分類されます : 0 (not a paraphrase) と 1 (is a paraphrase) 。
クラスに渡る確率を得るために結果の softmax を計算します。
結果をプリントします。

PyTorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e.  and ) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 10%
is paraphrase: 90%

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 94%
is paraphrase: 6%

TensorFlow

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

classes = ["not paraphrase", "is paraphrase"]

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

# The tokenizer will automatically add any model specific separators (i.e.  and ) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")

paraphrase_classification_logits = model(paraphrase).logits
not_paraphrase_classification_logits = model(not_paraphrase).logits

paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

not paraphrase: 10%
is paraphrase: 90%

# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 94%
is paraphrase: 6%

Extractive 質問応答

Extractive (抽出可能な) 質問応答は質問が与えられたときテキストから答えを抽出するタスクです。質問応答データセットの例は SQuAD データセットで、これはこのタスクに完全に基づいています。モデルを SQuAD タスク上で再調整したいのであれば、run_qa.py と run_squad.py (for tensorflow) スクリプトを利用できます。

ここに質問応答を行なう pipeline を使用するサンプルがあります : 質問が与えられたときテキストから答えを抽出します。それは SQuAD 上で微調整されたモデルを利用します。

from transformers import pipeline

question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

これはテキストから抽出された答え、信頼度スコアを、“start” と “end” 値と一緒に返します、これらはテキストの抽出された答えの位置です。

result = question_answerer(question="What is extractive question answering?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(
    f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
)

Answer: 'the task of extracting an answer from a text given a question', score: 0.6177, start: 34, end: 95
Answer: 'SQuAD dataset', score: 0.5152, start: 147, end: 160

ここにモデルとトークナイザーを使用した質問応答のサンプルがあります。プロセスは以下のようなものです :

チェックポイント名からトークナイザーとモデルをインスタンス化します。モデルは BERT モデルとして識別されそしてそれをチェックポイントにストアされた重みでロードします。
テキストと幾つかの質問を定義します。
質問に渡り反復して、そして正しいモデル固有の separator、トークン型 id と attention マスクで、テキストと現在の質問からシークエンスを構築します。
このシークエンスをモデルに渡します。これは開始と終了位置の両者に対して、シークエンス・トークン全体 (質問とテキスト) に渡るスコアの範囲を出力します。
トークンに対する確率を得るために結果の softmax を計算します。
識別された開始と停止値からトークンを取得して、それらのトークンを文字列に変換します。
結果をプリントします。

PyTorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = torch.argmax(answer_start_scores)
    # Get the most likely end of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch

TensorFlow

from transformers import AutoTokenizer, TFAutoModelForQuestionAnswering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = TFAutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="tf")
    input_ids = inputs["input_ids"].numpy()[0]

    outputs = model(inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # Get the most likely beginning of answer with the argmax of the score
    answer_start = tf.argmax(answer_start_scores, axis=1).numpy()[0]
    # Get the most likely end of answer with the argmax of the score
    answer_end = tf.argmax(answer_end_scores, axis=1).numpy()[0] + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
    )

    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2. 0 and pytorch

言語モデリング

言語モデリングはモデルをコーパスに適合させるタスク で、ドメイン固有である可能性があります。総てのポピュラーな transformer ベースのモデルは言語モデリングの変種 (e.g. masked 言語モデリングによる BERT、casual 言語モデリングによる GPT-2) を使用して訓練されます。

言語モデリングは事前訓練以外でも有用である可能性があります、例えばモデル分布をドメイン固有にシフトするためです : 非常に大規模なコーパスに対して訓練された言語モデルを使用し、それを新しいニュース・データセットや科学論文 e.g. LysandreJik/arxiv-nlp に再調整します。

Masked 言語モデリング

masked 言語モデリングは masking トークンでシークエンスのトークンをマスクして、モデルに適切なトークンでそのマスクを埋めることを促すタスクです。これはモデルに右側のコンテキスト (マスクの右側のトークン) と左側のコンテキスト (マスクの左側のトークン) の両者に注意を払うことを可能にします。そのような訓練は、SQuAD のような (質問応答、Lewis, Lui, Goyal et al., part 4.2 参照) 双方向コンテキストを必要とするような下流タスクのための強力な基底を作成します。masked 言語モデリング・タスク上でモデルを再調整したい場合には、run_mlm.py スクリプトを活用できます。

ここにシークエンスからのマスクを置き換えるために pipeline を使用するサンプルがあります :

from transformers import pipeline

unmasker = pipeline("fill-mask")

これはマスクが埋められたシークエンス、信頼度スコア、そしてトークナイザー語彙のトークン id を出力します :

from pprint import pprint

pprint(
    unmasker(
        f"HuggingFace is creating a {unmasker.tokenizer.mask_token} that the community uses to solve NLP tasks."
    )
)

[{'score': 0.1793,
  'sequence': 'HuggingFace is creating a tool that the community uses to solve '
              'NLP tasks.',
  'token': 3944,
  'token_str': ' tool'},
 {'score': 0.1135,
  'sequence': 'HuggingFace is creating a framework that the community uses to '
              'solve NLP tasks.',
  'token': 7208,
  'token_str': ' framework'},
 {'score': 0.0524,
  'sequence': 'HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.',
  'token': 5560,
  'token_str': ' library'},
 {'score': 0.0349,
  'sequence': 'HuggingFace is creating a database that the community uses to '
              'solve NLP tasks.',
  'token': 8503,
  'token_str': ' database'},
 {'score': 0.0286,
  'sequence': 'HuggingFace is creating a prototype that the community uses to '
              'solve NLP tasks.',
  'token': 17715,
  'token_str': ' prototype'}]

ここにモデルとトークナイザーを使用して masked 言語モデリングを行なうサンプルがあります。そのプロセスは以下です :

チェックポイント名からトークナイザーとモデルをインスタンス化します。モデルは DistilBERT モデルとして識別されてチェックポイントにストアされている重みとともにそれをロードします。
単語の代わりに tokenizer.mask_token を配置して、マスクされたトークンを持つシークエンスを定義します。
そのシークエンスを ID のリストにエンコードしてそのリスト内のマスクされたトークンの位置を見つけます。
マスク・トークンのインデックスにおける予測を取得します : このテンソルは語彙と同じサイズを持ち、値は各トークンに帰するスコアです。モデルはそれがそのコンテキストで可能性が高いと判断するトークンに、より高いスコアを与えます。
PyTorch topk or TensorFlow top_k メソッドを使用してトップ 5 のトークンを取得します。
マスク・トークンをトークンで置き換えて、結果をプリントします。

PyTorch

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = (
    "Distilled models are smaller than the models they mimic. Using them instead of the large "
    f"versions would help {tokenizer.mask_token} our carbon footprint."
)

inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

TensorFlow

from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

sequence = (
    "Distilled models are smaller than the models they mimic. Using them instead of the large "
    f"versions would help {tokenizer.mask_token} our carbon footprint."
)

inputs = tokenizer(sequence, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

token_logits = model(**inputs).logits
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(mask_token_logits, 5).indices.numpy()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

これはモデルにより予測された top 5 トークンを伴う、5 つのシークエンスをプリントします。

Causal 言語モデリング

Causal (因果) 言語モデリングはトークンのシークエンスに続くトークンを予測するタスクです。この状況では、モデルは左側のコンテキスト (マスクの左側のトークン) にのみ注意を払います。そのような訓練は生成タスクのための特に興味深いです。causal 言語モデリング・タスク上でモデルを再調整したい場合、run_clm.py スクリプトを活用できます。

通常、次のトークンは (モデルが入力シークエンスから生成する) 最後の隠れ状態のロジットからサンプリングすることにより予測されます。

ここにトークナイザーとモデルを使用して (トークンの入力シークエンスに続く) 次のトークンをサンプリングするために PreTrainedModel.top_k_top_p_filtering メソッドを利用する例があります。

PyTorch

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

Hugging Face is based in DUMBO, New York City, and ...

TensorFlow

from transformers import TFAutoModelForCausalLM, AutoTokenizer, tf_top_k_top_p_filtering
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = TFAutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="tf")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = tf_top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
next_token = tf.random.categorical(filtered_next_token_logits, dtype=tf.int32, num_samples=1)

generated = tf.concat([input_ids, next_token], axis=1)

resulting_string = tokenizer.decode(generated.numpy().tolist()[0])
print(resulting_string)

Hugging Face is based in DUMBO, New York City, and ...

これは元のシークエンスに続く (望ましくは) 首尾一貫した次のトークンを出力します、これは私達のケースでは単語 “is” or “features” です :

次のセクションでは、一度に一つのトークンの代わりに指定された長さまで複数のトークンを生成するために generation_utils.GenerationMixin.generate() を使用する方法を示します。

テキスト生成

テキスト生成 (a.k.a. open-ended テキスト生成) では、その目標は与えられたコンテキストからの継続であるテキストの首尾一貫した部分を作成することです。以下のサンプルは GPT-2 がテキストを生成するために pipeline でどのように使用されるかを示します。デフォルトでは総てのモデルはそれらに相当する configuration で設定されているように、pipeline で使用されるとき Top-K サンプリングを適用します (例えば gpt-2 config 参照)。

from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a
"free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

ここでは、コンテキスト “As far as I am concerned, I will” からモデルは 50 トークンの合計最大長を持つランダムテキストを生成します。内部的には、pipeline() オブジェクトはテキストを生成するためにメソッド PreTrainedModel.generate() を呼び出します。このメソッドに対するデフォルト引数は、引数 max_length and do_sample に対して上で示されたように、pipeline 内で override できます。

下は XLNet とそのトークナイザーを使用するテキスト生成のサンプルです、これは generate() の直接的な呼び出しを伴います :

PyTorch

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

print(generated)

Today the weather is really nice and I am planning ...

TensorFlow

from transformers import TFAutoModelForCausalLM, AutoTokenizer

model = TFAutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="tf")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

print(generated)

Today the weather is really nice and I am planning ...

テキスト生成は現在 PyTorch で GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL と Reformer で、そして殆どのモデルについて TensorFlow でも可能です。上のサンプルで見られるように XLNet と Transfo-XL は上手く動作するためにはしばしばパッドされる必要があります。open-ended テキスト生成のためには GPT-2 は通常は良い選択です、何故ならばそれは causal 言語モデリング目的で数百万の web ページで訓練されたからです。

テキスト生成のための様々なデコーディング・ストラテジーをどのように適用するかの詳細については、ここのテキスト生成ブログ投稿も参照してください。

固有表現認識

固有表現認識 (NER) は例えば人物、組織や位置としてトークンを識別するクラスに従ってトークンを分類するタスクです。固有表現認識データセットの例は CoNLL-2003 データセットで、これはそのタスクに完全に基づいています。NER タスク上でモデルを再調整したいのであれば、run_ner.py スクリプトを利用して良いです。

ここに固有表現認識を行なうために pipeline を使用するサンプルがあります、具体的には、トークンを 9 クラスの一つに属するものとして識別することを試みます :

O, 固有表現外 (= Outside of a named entity)
B-MIS, 別の雑多な (= miscellaneous) エンティティの直後の雑多なエンティティの開始
I-MIS, 種々雑多なエンティティ
B-PER, 別の人物名の直後の人物名の開始
I-PER, 人物名
B-ORG, 別の組織の直後の組織の開始
I-ORG, 組織
B-LOC, 別の場所の直後の場所の開始
I-LOC, 場所

それは dbmdz からの @stefan-it により再調整された、CoNLL-2003 上の再調整済みモデルを利用します。

from transformers import pipeline

ner_pipe = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

これは上で定義された 9 クラスからのエンティティの一つとして識別された総ての単語のリストを出力します。ここに期待される結果があります :

for entity in ner_pipe(sequence):
    print(entity)

{'entity': 'I-ORG', 'score': 0.9996, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9910, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9995, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.9993, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9994, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.9863, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.9514, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.9337, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9762, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9915, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}

シークエンス “Hugging Face” のトークンがどのように組織として識別され、そして “New York City”, “DUMBO” と “Manhattan Bridge” が場所として識別されたかに注意してください。

モデルとトークナイザーを使用する、固有表現認識を行なうサンプルがここにあります。そのプロセスは以下です :

チェックポイント名からトークナイザーとモデルをインスタンス化します。モデルは BERT モデルとして識別されてチェックポイントにストアされた重みでそれをロードします。
“Hugging Face” を組織として “New York City” を場所とするような、既知のエンティティでシークエンスを定義します。
単語を予測にマップできるようにトークンに分解します。最初にシーケンスを完全にエンコードしてデコードすることにより小さいハックを利用します、その結果特殊トークンを含む文字列が残ります。
そのシークエンスを ID にエンコードします (特殊トークンが自動的に追加されます)。
入力をモデルに渡して最初の出力を得ることにより予測を取得します。これは各トークンに対する 9 個の可能なクラスに渡る分布という結果になります。各トークンのための最尤クラスを得るために argmax を取ります。
各トークンをその予測と一緒に zip してそれをプリントします。

PyTorch

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = (
    "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
    "therefore very close to the Manhattan Bridge."
)

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)

TensorFlow

from transformers import TFAutoModelForTokenClassification, AutoTokenizer
import tensorflow as tf

model = TFAutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = (
    "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
    "therefore very close to the Manhattan Bridge."
)

inputs = tokenizer(sequence, return_tensors="tf")
tokens = inputs.tokens()

outputs = model(**inputs)[0]
predictions = tf.argmax(outputs, axis=2)

これは対応する予測にマップされた各トークンのリストを出力します。pipeline とは異なり、ここでは総てのトークンは予測を持ちます、何故ならばそのトークンで特定のエンティティが見つからなかったことを意味する “0” th クラスを削除しないからです。

上のサンプルでは、predictions は予測されたクラスに対応する整数値です。クラス番号に対応するクラス名を復元するために model.config.id2label プロパティを使用できます、これは下で示されます :

for token, prediction in zip(tokens, predictions[0].numpy()):
    print((token, model.config.id2label[prediction]))

('[CLS]', 'O')
('Hu', 'I-ORG')
('##gging', 'I-ORG')
('Face', 'I-ORG')
('Inc', 'I-ORG')
('.', 'O')
('is', 'O')
('a', 'O')
('company', 'O')
('based', 'O')
('in', 'O')
('New', 'I-LOC')
('York', 'I-LOC')
('City', 'I-LOC')
('.', 'O')
('Its', 'O')
('headquarters', 'O')
('are', 'O')
('in', 'O')
('D', 'I-LOC')
('##UM', 'I-LOC')
('##BO', 'I-LOC')
(',', 'O')
('therefore', 'O')
('very', 'O')
('close', 'O')
('to', 'O')
('the', 'O')
('Manhattan', 'I-LOC')
('Bridge', 'I-LOC')
('.', 'O')
('[SEP]', 'O')

要約

要約はドキュメントや記事をより短いテキストに要約するタスクです。要約タスク上でモデルを再調整したい場合には、run_summarization.py スクリプトを活用できます。

要約データセットの例は CNN / Daily Mail データセットで、これは長いニュース記事から成りそして要約タスクのために作成されました。モデルを要約タスクで再調整したい場合には、このドキュメントで様々なアプローチが説明されています。

ここに要約を行なうためのパイプラインを使用する例があります。それは CNN / Daily Mail データセット上で再調整された Bart モデルを利用しています。

from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

要約 pipeline は PretrainedModel.generate() メソッドに依存していますので、下で示されるように pipeline の PretrainedModel.generate() のデフォルト引数を max_length と min_length に対して直接 override することができます。これは次の要約を出力します :

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in
the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and
2002 . At one time, she was married to eight men at once, prosecutors say .'}]

モデルとトークナイザーを使用する要約を行なうサンプルがここにあります。そのプロセスは以下です :

チェックポイント名からトークナイザーとモデルをインスタンス化します。要約は通常は Bart or T5 のようなエンコーダ-デコーダ・モデルを使用して成されます。
要約されるべき記事を定義します。
T5 固有の prefix “summarize: “ を追加します。
要約を生成するために PretrainedModel.generate() メソッドを使用します。

このサンプルでは Google の T5 モデルを利用しています。それは (CNN / Daily Mail を含む) マルチタスク混合データセット上でだけ事前訓練されていますが、それは非常に良い結果を生成します。

PyTorch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0]))

<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.</s>

TensorFlow

from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="tf", max_length=512)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0]))

<pad> prosecutors say the marriages were part of an immigration scam. if convicted, barrientos faces two criminal
counts of "offering a false instrument for filing in the first degree" she has been married 10 times, nine of them
between 1999 and 2002.

翻訳

翻訳は一つの言語から別の言語へテキストを変換するタスクです。翻訳タスク上でモデルを再調整したい場合には、run_translation.py を活用できます。

翻訳データセットの例は WMT 英独データセットです、これは入力データとして英語のセンテンスをそしてターゲットデータとして独語のセンテンスを持ちます。翻訳タスク上でモデルを再調整したい場合には、様々なアプローチがこのドキュメントで説明されます。

ここに翻訳を行なうための pipeline を使用するサンプルがあります。それは T5 モデルを利用しています、これは (WMT を含む) マルチタスク混合データセット上でのみ事前訓練されましたが、印象的な翻訳結果を生成します。

from transformers import pipeline

translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

翻訳 pipeline は PretrainedModel.generate() メソッドに依存していますので、上で max_length に対して示されたように pipeline で PretrainedModel.generate() のデフォルト引数を直接 override できます。

ここにモデルとトークナイザーを使用して翻訳を行なうサンプルがあります。そのプロセスは以下です :

チェックポイント名からトークナイザーとモデルをインスタンス化します。要約は通常は Bart or T5 のようなエンコーダ-デコーダ・モデルを使用して成されます。
翻訳されるべきセンテンスを定義します。
T5 固有のプレフィックス “translate English to German: “ を追加します。
翻訳を遂行するために PretrainedModel.generate() メソッドを使用します。

PyTorch

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="pt",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.</s>

TensorFlow

from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

inputs = tokenizer(
    "translate English to German: Hugging Face is a technology company based in New York and Paris",
    return_tensors="tf",
)
outputs = model.generate(inputs["input_ids"], max_length=40, num_beams=4, early_stopping=True)

print(tokenizer.decode(outputs[0]))

<pad> Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.

pipeline サンプルでのように、同じ翻訳を得ます。

以上

Transformers

HuggingFace Transformers 4.17 : Tutorials : タスクの概要

HuggingFace Transformers 4.17 : Tutorials : タスクの概要 (翻訳/解説)

HuggingFace Transformers : Tutorials : タスクの概要

シークエンス分類

Extractive 質問応答

言語モデリング

Masked 言語モデリング

Causal 言語モデリング

テキスト生成

固有表現認識

要約

翻訳

ClassCat® Chatbot

人工知能開発支援

最近の投稿

カテゴリー

Transformers

HuggingFace Transformers 4.17 : Tutorials : タスクの概要

HuggingFace Transformers 4.17 : Tutorials : タスクの概要 (翻訳/解説)

HuggingFace Transformers : Tutorials : タスクの概要

シークエンス分類

Extractive 質問応答

言語モデリング

Masked 言語モデリング

Causal 言語モデリング

テキスト生成

固有表現認識

要約

翻訳

ClassCat® Chatbot

人工知能開発支援

最近の投稿

カテゴリー

タグ