ホーム » HuggingFace Transformers » HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング

投稿者: sales-info in HuggingFace Transformers 投稿日: 04/29/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/29/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Language modeling

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 言語モデリング

言語モデリングはセンテンスの単語を予測します。言語モデリングの 2 つの形式があります。

Causal (因果) 言語モデリングはトークンのシークエンスの次のトークンを予測します、そしてモデルは左側のトークンにだけ注意を払うことができます。

Masked 言語モデリングはシークエンスのマスクされたトークンを予測します、そしてモデルはモデルは双方向にトークンに注意を払うことができます。

このガイドは causal 言語モデリング用の DistilGPT2 と masked 言語モデリング用の DistilRoBERTa を ELI5 データセットの r/askscience サブセットで再調整する方法を示します。

Note : このガイドで表されている同じステップに従い、GPT-Neo, GPT-J, と BERT のような言語モデリングのための他のアーキテクチャを再調整することができます。
関連するモデル, データセット, そしてメトリクスの詳細については、テキスト生成タスクのページ、そして fill mask タスクのページを見てください。

ELI5 データセットのロード

Datasets ライブラリから ELI5 データセットの最初の 5000 行だけをロードします、何故ならばそれはかなり大きいからです :

from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

このデータセットを訓練とテストセットに分割します :

eli5 = eli5.train_test_split(test_size=0.2)

そしてサンプルを見ましょう :

eli5["train"][0]

{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

text は answers 辞書内でネストされたサブフィールドであることに注意してください。データセットを前処理するとき、text サブフィールドを別のカラムに抽出する必要があります。

前処理

causal 言語モデリングについては、text サブフィールドを処理するために DistilGPT2 トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

masked 言語モデリングについては、代わりに DistilRoBERTa トークナイザーがロードされます :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

flatten メソッドでネスト構造から text サブフィールドを抽出します :

eli5 = eli5.flatten()
eli5["train"][0]

{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

今は各サブフィールドは answers の prefix で示されるように個別のカラムになっています。answers.txt はリストであることに注意してください。各センテンスを別個にトークン化する代わりにそれらを一緒にトークン化するためにリストを文字列に変換します。

ここに、リストを文字列に変換して、DistilGPT2 の最大入力長よりも長くならないようにシークエンスを切り詰める前処理関数を作成する方法があります :

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

データセット全体に対して前処理関数を適用するために Datasets map 関数を使用します。データセットの複数の要素を一度に処理する batched=True を設定して、num_proc でプロセスを増やすことにより map 関数を高速化できます。必要ないカラムは削除します :

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

次に、情報の損失を防ぐために冗長なサンプルから切り詰められたテキストを捕捉する 2 番目の前処理関数が必要です。この前処理関数は以下を行なうべきです :

総てのテキストを連結する。
連結されたテキストを block_size で定義された小さいチャンクに分割する。

block_size = 128


def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

データセット全体に対して group_texts 関数を適用します :

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

causal 言語モデリングについては、サンプルのバッチを作成するために DataCollatorForLanguageModeling を使用します。それはまたバッチ内の最長要素の長さにテキストを動的にパディングしますので、それらは均一な長さです。padding=True を設定することでトークナイザーの関数でテキストをパディングすることも可能ですが、動的パディングはより効率的です。

パディング・トークンとしてシークエンスの終端トークンを使用し、mlm=False を設定することができます。これは入力を 1 要素右にシフトされたラベルとして使用します。

PyTorch

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

TensorFlow

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

masked 言語モデリングについては、データをイテレートするたびにトークンをランダムにマスクするために mlm_probability を指定する必要があることを除いて、同じ DataCollatorForLanguageModeling を使用します。

PyTorch

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

TensorFlow

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

Causal 言語モデリング

Causal 言語モデリングはテキスト生成のために頻繁に使用されます。このセクションは新しいテキストを生成するために DistilGPT2 を再調整する方法を示します。

Trainer で再調整

AutoModelForCausalLM で DistilGPT2 をロードします :

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセットとデータ collator と共に訓練引数を Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

データセットを to_tf_dataset で tf.data.Dataset 形式に変換します。columns で入力とラベルを、データセット順序をシャッフルするか否か、バッチサイズ、そしてデータ collator を指定します :

tf_train_set = lm_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = lm_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

TFAutoModelForCausalLM で DistilGPT2 をロードします :

from transformers import TFAutoModelForCausalLM

model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

compile で訓練のためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Masked 言語モデリング

Masked 言語モデリングはまた fill-mask タスクとしても知られています、何故ならばそれはシークエンスのマスクされたトークンを予測するからです。masked 言語モデリングのためのモデルは、左側のコンテキストだけの代わりにシークエンス全体の良いコンテキスト理解を必要とします。このセクションは、マスクされた単語を予測するために DistilRoBERTa を再調整する方法を示します。

Trainer で再調整

AutoModelForMaskedlM で DistilRoBERTa をロードします :

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセット, トークナイザー, そしてデータ collator と共に訓練引数を Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

データセットを to_tf_dataset で tf.data.Dataset 形式に変換します。columns で入力とラベルを、データセット順序をシャッフルするか否か、バッチサイズ、そしてデータ collator を指定します :

tf_train_set = lm_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = lm_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

TFAutoModelForMaskedLM で DistilRoBERTa をロードします :

from transformers import TFAutoModelForMaskedLM

model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base")

compile で訓練のためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Note : causal 言語モデリングのためのモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

タグ: HuggingFace Transformers 4.17

Transformers

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング (翻訳/解説)

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 言語モデリング

ELI5 データセットのロード

前処理

Causal 言語モデリング

Trainer で再調整

TensorFlow による再調整

Masked 言語モデリング

Trainer で再調整

TensorFlow による再調整

ClassCat® Chatbot

人工知能開発支援

最近の投稿

カテゴリー