ホーム » 「HuggingFace Transformers 4.17」タグがついた投稿

タグアーカイブ: HuggingFace Transformers 4.17

HuggingFace Transformers 4.17 : Notebooks : ゼロからの新しい言語モデルの訓練

05/14/2022

HuggingFace Transformers 4.17 : Notebooks : ゼロからの新しい言語モデルの訓練 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/14/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

notebooks : How to train a new language model from scratch using Transformers and Tokenizers

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : Notebooks : ゼロからの新しい言語モデルの訓練

過去数カ月に渡り、新しい言語モデルをゼロから訓練する ことをこれまでより容易にする目標で transformers と tokenizers ライブラリに幾つかの改良をしました。

この投稿では、Esperanto 上で “small” モデル (84 M パラメータ = 6 層, 768 隠れサイズ, 12 アテンションヘッド) を訓練する方法を実演します – それは DistilBERT と同じ数の層 & ヘッドです。そして品詞タギングの下流タスクでモデルを再調整します。

1. データセットを探し出す

最初に、Esperanto のテキストコーパスを見つけましょう。ここでは INRIA からの OSCAR コーパスの Esperanto 部分を使用します。OSCAR は、Web の Common Crawl ダンプの言語分類とフィルタリングにより取得された巨大な多言語コーパスです。

データセットの Esperanto 部は 299 M しかありませんので、Leipzig コーパス・コレクションの Esperanto 部分コーパスと結合します、これはニュース、文献と wikipedia のような多様なソースからのテキストから成ります。

最終的な訓練コーパスは 3 GB のサイズを持ちます、これは依然として小さいです、モデルに対して、事前訓練するためにより多くのデータを取得できればより良い結果を得られます。

# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

2. トークナイザーの訓練

RoBERTa と同じ特殊トークンを持つ、(GPT-2 と同じ) バイトレベルの Byte-pair エンコーディング・トークナイザーを訓練することを選択します。サイズは 52,000 であると任意に選択しましょう。

(例えば BERT のような WordPiece トークナイザーではなく) バイトレベル BPE を訓練することを勧めます、何故ならばそれは語彙を単一バイトのアルファベットから構築し始めますので、総ての単語がトークンに分解可能です (no more <unk> トークン！)。

# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 4min, sys: 3min 7s, total: 7min 7s
Wall time: 2min 25s

そしてファイルをディスクにセーブしましょう :

!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")

['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

🔥🔥 Wow, that was fast! ⚡️🔥

そして頻度によりランク付けされた最も頻度の高いトークンのリストである vocab.json とマージのリスト merges.txt の両者を持ちます。

{
    "<s>": 0,
    "<pad>": 1,
    "</s>": 2,
    "<unk>": 3,
    "<mask>": 4,
    "!": 5,
    "\"": 6,
    "#": 7,
    "$": 8,
    "%": 9,
    "&": 10,
    "'": 11,
    "(": 12,
    ")": 13,
    # ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...

素晴らしいことはトークナイザーが Esperanto に対して最適化されていることです。英語のために訓練された一般的なトークナイザーに比べて、よりネイティブな単語が単一の分割されていないトークンで表されています。発音区別符号 (= diacritics) i.e. Esperanto で使用されるアクセント付き文字 – ĉ, ĝ, ĥ, ĵ, ŝ, と ŭ – はネイティブにエンコードされます。またシークエンスをより効率的な流儀で表します。ここではこのコーパス上、エンコードされたシークエンスの平均長は事前訓練済み GPT-2 トークナイザーを使用したときよりも ~30% 小さいです。

RoBERTa 特殊トークンの処理を含み、それを tokenizers でどのように使用するかがここにあります、もちろん transformers からそれを直接使用することもできます。

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

tokenizer.encode("Mi estas Julien.")

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

tokenizer.encode("Mi estas Julien.").tokens

['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

3. ゼロから言語モデルを訓練する

Update : このセクションは run_language_modeling.py スクリプトに沿っていて、新しい Trainer を直接使用しています。最も好きなアプローチを自由に選択してください。

RoBERTa-like モデルを訓練します、これは幾つかの変更を持つ BERT-like なものです (詳細はドキュメントを確認してください)。

モデルは BERT-like ですから、Masked 言語モデリングのタスクでそれを訓練します、i.e. データセットのランダムにマスクした任意のトークンをどのように埋めるかの予測です。これはサンプルスクリプトにより処理されます。

# Check that we have a GPU
!nvidia-smi

Fri May 15 21:17:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

モデルに対して以下の config を定義する

from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

そして transformers 内でトークナイザーを再作成しましょう。

from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

最後にモデルを初期化しましょう。

Important : ゼロから訓練していますので、既存の事前訓練済みモデルやチェックポイントからではなく、config から初期化するだけです。

from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

model.num_parameters()
# => 84 million parameters

84095008

次に訓練データセットを構築しましょう

テキストファイルにトークナイザーを適用することによりデータセットを構築します。

ここでは、1 つのテキストファイルを持つだけですので、Dataset をカスタマイズする必要さえありません。そのまま LineByLineDataset を単に使用します。

%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

CPU times: user 4min 54s, sys: 2.98 s, total: 4min 57s
Wall time: 1min 37s

run_language_modeling.py スクリプト内のように、data_collator を定義する必要があります。

これは単なる小さいヘルパーで、データセットの異なるサンプルをまとめて (PyTorch が逆伝播を実行する方法を知る) オブジェクトにバッチ化するのに役立ちます。

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

最後に、Trainer を初期化する準備が整いました。

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

訓練開始

%%time
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…
HBox(children=(FloatProgress(value=0.0, description='Iteration', max=15228.0, style=ProgressStyle(description_…
{"loss": 7.152712148666382, "learning_rate": 4.8358287365379566e-05, "epoch": 0.03283425269240872, "step": 500}
{"loss": 6.928811420440674, "learning_rate": 4.671657473075913e-05, "epoch": 0.06566850538481744, "step": 1000}
{"loss": 6.789419063568115, "learning_rate": 4.5074862096138694e-05, "epoch": 0.09850275807722617, "step": 1500}
{"loss": 6.688932447433472, "learning_rate": 4.343314946151826e-05, "epoch": 0.1313370107696349, "step": 2000}
{"loss": 6.595982004165649, "learning_rate": 4.179143682689782e-05, "epoch": 0.1641712634620436, "step": 2500}
{"loss": 6.545944199562073, "learning_rate": 4.0149724192277385e-05, "epoch": 0.19700551615445233, "step": 3000}
{"loss": 6.4864857263565066, "learning_rate": 3.850801155765695e-05, "epoch": 0.22983976884686105, "step": 3500}
{"loss": 6.412427802085876, "learning_rate": 3.686629892303651e-05, "epoch": 0.2626740215392698, "step": 4000}
{"loss": 6.363630670547486, "learning_rate": 3.522458628841608e-05, "epoch": 0.29550827423167847, "step": 4500}
{"loss": 6.273832890510559, "learning_rate": 3.358287365379564e-05, "epoch": 0.3283425269240872, "step": 5000}
{"loss": 6.197585330963134, "learning_rate": 3.1941161019175205e-05, "epoch": 0.3611767796164959, "step": 5500}
{"loss": 6.097779376983643, "learning_rate": 3.029944838455477e-05, "epoch": 0.39401103230890466, "step": 6000}
{"loss": 5.985456382751464, "learning_rate": 2.8657735749934332e-05, "epoch": 0.42684528500131336, "step": 6500}
{"loss": 5.8448616371154785, "learning_rate": 2.70160231153139e-05, "epoch": 0.4596795376937221, "step": 7000}
{"loss": 5.692522863388062, "learning_rate": 2.5374310480693457e-05, "epoch": 0.4925137903861308, "step": 7500}
{"loss": 5.562082152366639, "learning_rate": 2.3732597846073024e-05, "epoch": 0.5253480430785396, "step": 8000}
{"loss": 5.457240365982056, "learning_rate": 2.2090885211452588e-05, "epoch": 0.5581822957709482, "step": 8500}
{"loss": 5.376953645706177, "learning_rate": 2.0449172576832152e-05, "epoch": 0.5910165484633569, "step": 9000}
{"loss": 5.298609251022339, "learning_rate": 1.8807459942211716e-05, "epoch": 0.6238508011557657, "step": 9500}
{"loss": 5.225468152046203, "learning_rate": 1.716574730759128e-05, "epoch": 0.6566850538481744, "step": 10000}
{"loss": 5.174519973754883, "learning_rate": 1.5524034672970843e-05, "epoch": 0.6895193065405831, "step": 10500}
{"loss": 5.113943946838379, "learning_rate": 1.3882322038350407e-05, "epoch": 0.7223535592329918, "step": 11000}
{"loss": 5.08140989112854, "learning_rate": 1.2240609403729971e-05, "epoch": 0.7551878119254006, "step": 11500}
{"loss": 5.072491912841797, "learning_rate": 1.0598896769109535e-05, "epoch": 0.7880220646178093, "step": 12000}
{"loss": 5.012459496498108, "learning_rate": 8.957184134489099e-06, "epoch": 0.820856317310218, "step": 12500}
{"loss": 4.999591351509094, "learning_rate": 7.315471499868663e-06, "epoch": 0.8536905700026267, "step": 13000}
{"loss": 4.994838352203369, "learning_rate": 5.673758865248227e-06, "epoch": 0.8865248226950354, "step": 13500}
{"loss": 4.955870885848999, "learning_rate": 4.032046230627791e-06, "epoch": 0.9193590753874442, "step": 14000}
{"loss": 4.941655583381653, "learning_rate": 2.390333596007355e-06, "epoch": 0.9521933280798529, "step": 14500}
{"loss": 4.931783639907837, "learning_rate": 7.486209613869189e-07, "epoch": 0.9850275807722616, "step": 15000}

CPU times: user 1h 43min 36s, sys: 1h 3min 28s, total: 2h 47min 4s
Wall time: 2h 46min 46s
TrainOutput(global_step=15228, training_loss=5.762423221226405)

🎉 最終的なモデル (+ tokenizer + config) をディスクにセーブする

trainer.save_model("./EsperBERTo")

4. LM が実際に訓練されたことを確認する

訓練と評価損失が下がるのを見るのとは別に、言語モデルが何か興味深いことを学習しているかどうかを確認する最も簡単な方法は FillMaskPipeline を使用することです。

パイプラインはトークナイザーとモデルの単純なラッパーで、’fill-mask’ は maked トークン (ここでは <mask>) を含むシークエンスを入力させて、そして最も可能性の高い filled シークエンスのリストをそれらの確率と一緒に返します。

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

# The sun .
# =>

fill_mask("La suno .")

[{'score': 0.02119220793247223,
  'sequence': '<s> La suno estas.</s>',
  'token': 316},
 {'score': 0.012403824366629124,
  'sequence': '<s> La suno situas.</s>',
  'token': 2340},
 {'score': 0.011061107739806175,
  'sequence': '<s> La suno estis.</s>',
  'token': 394},
 {'score': 0.008284995332360268,
  'sequence': '<s> La suno de.</s>',
  'token': 274},
 {'score': 0.006471084896475077,
  'sequence': '<s> La suno akvo.</s>',
  'token': 1833}]

OK, 単純な構文/文法は機能しています。もう少し興味深いプロンプトを試してみましょう :

fill_mask("Jen la komenco de bela .")

# This is the beginning of a beautiful .
# =>

[{'score': 0.01814725436270237,
  'sequence': '<s> Jen la komenco de bela urbo.</s>',
  'token': 871},
 {'score': 0.015888698399066925,
  'sequence': '<s> Jen la komenco de bela vivo.</s>',
  'token': 1160},
 {'score': 0.015662025660276413,
  'sequence': '<s> Jen la komenco de bela tempo.</s>',
  'token': 1021},
 {'score': 0.015555007383227348,
  'sequence': '<s> Jen la komenco de bela mondo.</s>',
  'token': 945},
 {'score': 0.01412549614906311,
  'sequence': '<s> Jen la komenco de bela tago.</s>',
  'token': 1633}]

5. モデルの共有 🎉

最後に、素敵なモデルを持つとき、それをコミュニティで共有することを考えてください :

CLI : transformers-cli upload を使用してモデルをアップロードします。
README.md モデルカードを書いてそれを model_cards/ 下のレポジトリに追加します。モデルカードは理想的には以下を含むべきです :
- モデルの説明
- 訓練 params (データセット, 前処理, ハイパーパラメータ)
- 評価結果
- 意図された用途 & 制限
- 他のどんなものでも有用です！🤓

TADA!

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using AutoModel.from_pretrained(“username/model_name”).

If you want to take a look at models in different languages, check https://huggingface.co/models

以上

HuggingFace Transformers 4.17 : Notebooks : 画像分類の再調整

05/08/2022

HuggingFace Transformers 4.17 : Notebooks/Examples : 画像分類の再調整 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/08/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

notebooks/examples : Fine-tuning for Image Classification with HuggingFace Transformers

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : Notebooks/Examples : 画像分類の再調整

このノートブックは画像分類のための事前訓練済みビジョンモデルをカスタムデータセット上で再調整する方法を示します。このアイデアは、事前訓練済みエンコーダの上にランダムに初期化された分類ヘッドを追加してモデルをラベル付けられたデータセット上でモデル全体を再調整することです。

ImageFolder

このノートブックはノートブックをカスタムデータセット (つまり、このチュートリアルでは EuroSAT ) 上で容易に実行するために ImageFolder 機能を活用します。ローカルフォルダか、zip や tar のようなローカル/リモートファイルから Dataset をロードできます。

任意のモデル

このノートブックは、以下のような、モデルが画像分類ヘッドを持つ限りは、モデルハブからの任意のビジョンモデル・チェックポイントで任意の画像分類データセット上で実行するために構築されています。

要するに、AutoModelForImageClassification によりサポートされた任意のモデルです。

データ増強

このノートブックはデータ増強を適用するために Torchvision の transforms を利用します – 以下を含む、他のライブラリを利用する別のノートブックも提供していることに注意してください :

Albumentations
Kornia (訳注: リンク切れ)
imgaug (訳注: リンク切れ)

このノートブックでは、https://huggingface.co/microsoft/swin-tiny-patch4-window7-224 チェックポイントから再調整しますが、ハブには利用可能な非常に多くのチェックポイントがあることに注意してください。

model_checkpoint = "microsoft/swin-tiny-patch4-window7-224" # pre-trained model from which to fine-tune
batch_size = 32 # batch size for training and evaluation

始める前に、datasets と transformers ライブラリをインストールしましょう。

!pip install -q datasets transformers

このノートブックをローカルで開いている場合は、環境がそれらのライブラリの最新バージョンのインストールを持っていることを確認してください。

モデルをコミュニティと共有して、推論 API で下図で示されるもののような結果を生成することを可能にするには、従うべき幾つかのステップが更にあります。

最初に Hugging Face web サイトの認証トークンをストアしてから (まだならここでサインアップ！) 次のセルを実行してトークンを入力する必要があります :

from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token

Authenticated through git-credential store but this isn’t the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store

そしてモデルチェックポイントをアップロードするには Git-LFS をインストールする必要があります :

%%capture
!sudo apt -qq install git-lfs
!git config --global credential.helper store

画像分類タスク上のモデルの再調整

このノートブックでは、 Transformers ビジョンモデルの一つを画像分類データセットで再調整する方法を見ます。

画像が与えられたとき、目標は “tiger” のように、そのための適切なクラスを予測することです。下のスクリーンショットは ImageNet-1k で再調整された ViT から取られたものです – 推論ウィジェットを試してください！

データセットのロード

カスタムデータセットを DatasetDict にダウンロードするために Datasets ライブラリの ImageFolder 機能を使用します。

この場合、EuroSAT データセットはリモートでホストされていますので、data_files 引数を与えます。代わりに、画像を含むローカルフォルダーを持つ場合、data_dir 引数を使用してそれらをロードできます。

from datasets import load_dataset 

# load a custom dataset from local/remote files or folders using the ImageFolder feature

# option 1: local/remote files (supporting the following formats: tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset("imagefolder", data_files="https://madm.dfki.de/files/sentinel/EuroSAT.zip")

# note that you can also provide several splits:
# dataset = load_dataset("imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]})

# note that you can push your dataset to the hub very easily (and reload afterwards using load_dataset)!
# dataset.push_to_hub("nielsr/eurosat")
# dataset.push_to_hub("nielsr/eurosat", private=True)

# option 2: local folder
# dataset = load_dataset("imagefolder", data_dir="path_to_folder")

# option 3: just load any existing dataset from the hub, like CIFAR-10, FashionMNIST ...
# dataset = load_dataset("cifar10")

Using custom data configuration default-0537267e6f812d56
Downloading and preparing dataset image_folder/default to /root/.cache/huggingface/datasets/image_folder/default-0537267e6f812d56/0.0.0/ee92df8e96c6907f3c851a987be3fd03d4b93b247e727b69a8e23ac94392a091...
Downloading data files: 0it [00:00, ?it/s]
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/94.3M [00:00<?, ?B/s]
Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Dataset image_folder downloaded and prepared to /root/.cache/huggingface/datasets/image_folder/default-0537267e6f812d56/0.0.0/ee92df8e96c6907f3c851a987be3fd03d4b93b247e727b69a8e23ac94392a091. Subsequent calls will reuse this data.
  0%|          | 0/1 [00:00<?, ?it/s]

Accuracy メトリックもロードしましょう、これは訓練の間と訓練後の両方でモデルを評価するために使用します。

from datasets import load_metric

metric = load_metric("accuracy")

Downloading builder script:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

dataset オブジェクト自身は DatasetDict で、これは分割毎に一つのキーを含みます (この場合は訓練分割のために “train” だけです)。

dataset

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 27000
    })
})

実際の要素にアクセスするためには、最初に分割を選択してからインデックスを与える必要があります :

example = dataset["train"][10]
example

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x7FF2F6277B10>,
 'label': 2}

各サンプルは画像と対応するラベルから成ります。これをデータセットの features を確認して検証することもできます :

dataset["train"].features

{'image': Image(decode=True, id=None),
 'label': ClassLabel(num_classes=10, names=['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake'], id=None)}

クールなことに、次のように画像を直接見ることができます (‘image’ フィールドが Image 機能であるため) :

example['image']

EuroSAT データセットの画像は低解像度 (64×64 ピクセル) なので少し大きくしましょう :

example['image'].resize((200, 200))

対応ラベルをプリントしましょう :

example['label']

見て分かるように、label フィールドは実際の文字列ラベルではありません。デフォルトでは ClassLabel フィールドは便宜上、整数にエンコードされます :

dataset["train"].features["label"]

ClassLabel(num_classes=10, names=['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial', 'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake'], id=None)

それらを文字列にデコードし戻すしてそれらが何かを見るために id2label 辞書を作成しましょう。後でモデルをロードするとき、反対の label2id もまた有用です。

labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

id2label[2]

'HerbaceousVegetation'

データの前処理

これらの画像をモデルに供給できる前に、それらを前処理する必要があります。

画像の前処理は典型的には以下に行き着きます : (1) 特定のサイズにリサイズする, (2) カラーチャネル (R, G, B) を平均と標準偏差を用いて正規化する。これらは 画像変換 (image transformations) と呼ばれます。

加えて、モデルをより堅牢にして高い精度を得るために典型的には訓練の間に (ランダム切り抜きと反転のような) データ増強 (data augmentation) と呼ばれるものを遂行します。データ増強はまた訓練データのサイズを増やす素晴らしいテクニックです。

このチュートリアルでは画像変換/データ増強のために torchvision.transforms を使用しますが、(albumentations, imgaug, Kornia 等のような) 任意の他のパッケージを使用できることに注意してください。

モデルアーキテクチャのために (1) 適切なサイズにリサイズする, (2) 適切な画像平均と標準偏差を使用する, ことを確実にするため、AutoFeatureExtractor.from_pretrained メソッドで特徴抽出器と呼ばれるものをインスタンス化します。

この特徴抽出器は最小限のプリプロセッサで、推論用の画像を準備するために使用できます。

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
feature_extractor

Downloading:   0%|          | 0.00/255 [00:00<?, ?B/s]
ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "resample": 3,
  "size": 224
}

Datasets ライブラリはデータを非常に簡単に処理するために作成されています。そしてカスタム関数を書くことができます、これは (.map() or .set_transform() を使用して) データセット全体に対して適用できます。

ここでは 2 つの別の関数を定義します、一つは訓練のため (これはデータ増強を含みます) で、一つは検証のため (これはリサイズ, 中心切り抜きと正規化だけを含みます) です。

from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    Resize,
    ToTensor,
)

normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
train_transforms = Compose(
        [
            RandomResizedCrop(feature_extractor.size),
            RandomHorizontalFlip(),
            ToTensor(),
            normalize,
        ]
    )

val_transforms = Compose(
        [
            Resize(feature_extractor.size),
            CenterCrop(feature_extractor.size),
            ToTensor(),
            normalize,
        ]
    )

def preprocess_train(example_batch):
    """Apply train_transforms across a batch."""
    example_batch["pixel_values"] = [
        train_transforms(image.convert("RGB")) for image in example_batch["image"]
    ]
    return example_batch

def preprocess_val(example_batch):
    """Apply val_transforms across a batch."""
    example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]]
    return example_batch

次に、これらの関数を適用してデータセットを前処理できます。set_transform 機能を使用します、これは上の関数を on-the-fly に適用することを可能にします (つまり、画像が RAM にロードされたときだけにそれらが適用されます)。

# split up training into training + validation
splits = dataset["train"].train_test_split(test_size=0.1)
train_ds = splits['train']
val_ds = splits['test']

train_ds.set_transform(preprocess_train)
val_ds.set_transform(preprocess_val)

“pixel_values” 特徴が追加されたことを見るために要素にアクセスしましょう :

train_ds[0]

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=64x64 at 0x7FF2EFFB0D90>,
 'label': 9,
 'pixel_values': tensor([[[-0.3541, -0.3541, -0.3541,  ..., -0.3712, -0.3712, -0.3712],
          [-0.3541, -0.3541, -0.3541,  ..., -0.3712, -0.3712, -0.3712],
          [-0.3541, -0.3541, -0.3541,  ..., -0.3712, -0.3712, -0.3712],
          ...,
          [-0.4397, -0.4397, -0.4397,  ..., -0.4911, -0.4911, -0.4911],
          [-0.4397, -0.4397, -0.4397,  ..., -0.4911, -0.4911, -0.4911],
          [-0.4397, -0.4397, -0.4397,  ..., -0.4911, -0.4911, -0.4911]],
 
         [[-0.2500, -0.2500, -0.2500,  ..., -0.2850, -0.2850, -0.2850],
          [-0.2500, -0.2500, -0.2500,  ..., -0.2850, -0.2850, -0.2850],
          [-0.2500, -0.2500, -0.2500,  ..., -0.2850, -0.2850, -0.2850],
          ...,
          [-0.3550, -0.3550, -0.3550,  ..., -0.4076, -0.4076, -0.4076],
          [-0.3550, -0.3550, -0.3550,  ..., -0.4076, -0.4076, -0.4076],
          [-0.3550, -0.3550, -0.3550,  ..., -0.4076, -0.4076, -0.4076]],
 
         [[ 0.1128,  0.1128,  0.1128,  ...,  0.1651,  0.1651,  0.1651],
          [ 0.1128,  0.1128,  0.1128,  ...,  0.1651,  0.1651,  0.1651],
          [ 0.1128,  0.1128,  0.1128,  ...,  0.1651,  0.1651,  0.1651],
          ...,
          [ 0.0605,  0.0605,  0.0605,  ...,  0.0082,  0.0082,  0.0082],
          [ 0.0605,  0.0605,  0.0605,  ...,  0.0082,  0.0082,  0.0082],
          [ 0.0605,  0.0605,  0.0605,  ...,  0.0082,  0.0082,  0.0082]]])}

モデルの訓練

データの準備ができた今、事前訓練済みモデルをダウンロードして再調整できます。分類のためには AutoModelForImageClassification クラスを使用します。その from_pretrained メソッドの呼び出しは重みをダウンロードしてキャッシュします。ラベル id とラベル数はデータセット依存なので、ここでは model_checkpoint と共に label2id と id2label を渡します。これは (カスタム数の出力ニューロンを持つ) カスタム分類ヘッドが作成されることを確実にします。

NOTE : (ImageNet-1k 上で既に再調整されている) facebook/convnext-tiny-224 のような、既に再調整されたチェックポイントを再調整することを計画している場合、from_pretrained メソッドに追加引数 ignore_mismatched_sizes=True を提供する必要があります。(1000 出力ニューロンを持つ) 出力ヘッドは捨てられ、カスタム数の出力ニューロンを含む新しい、ランダムに初期化された分類ヘッドにより置き換えられことを確実にします。事前訓練済みモデルがヘッドを含まない場合、この引数を指定する必要はありません。

from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint, 
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

Downloading:   0%|          | 0.00/70.1k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/108M [00:00<?, ?B/s]
/usr/local/lib/python3.7/dist-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Some weights of SwinForImageClassification were not initialized from the model checkpoint at microsoft/swin-tiny-patch4-window7-224 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([10]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

警告は、幾つかの重み (分類器層の重みとバイアス) を捨てて、幾つかの他の重み (新しい分類器層の重みとバイアス) をランダムに初期化していることを知らせています。このケースではこれは想定されています、何故ならば事前訓練済みの重みを持っていない新しいヘッドを追加しているからです、そのためライブラリはこのモデルを推論用に使用する前に再調整するべきであると警告しています、これはまさに行おうとしていることです。

Trainer をインスタンス化するためには、訓練 configuration と評価メトリックを定義する必要があります。最も重要なのは TrainingArguments で、これは訓練をカスタマイズするための総ての属性を含むクラスです。それは一つのフォルダ名を必要とします、これはモデルのチェックポイントをセーブするために使用されます。

殆どの訓練引数は説明を要しませんが、ここで非常に重要なものは remove_unused_columns=False です。これはモデルの call 関数で使用されない特徴はドロップされます。デフォルトでこれは True です、何故ならば通常は使用されない特徴カラムはドロップされるのが理想的で、入力をモデルの call 関数内にアンパックすることを容易にします。しかし、私達のケースでは、’pixel_values’ を作成するために未使用の特徴 (特に ‘image’) を必要とします。

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-eurosat",
    remove_unused_columns=False,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

ここでは評価が各エポックの最後に成されるように設定し、学習率を微調整し、ノートブックの冒頭で定義された batch_size を使用し、そして訓練のためのエポック数と重み減衰をカスタマイズします。ベストモデルは訓練の最後のものではないかもしれないので、訓練の最後に Trainer にそれが (metric_name に従って) セーブしたベストモデルをロードさせます。

最後の引数 push_to_hub は Trainer が訓練中にモデルをハブに定期的にプッシュすることを可能にします。ノートブックの冒頭のインストールステップに従わなかった場合にはそれを除去してください。モデルをレポジトリの名前とは異なる名前でローカルにセーブすることを望む場合や、貴方の名前空間ではなく組織下でモデルをプッシュすることを望む場合、repo 名を設定するために hub_model_id を使用してください (それは名前空間を含む、完全な名前である必要があります : 例えば “nielsr/vit-finetuned-cifar10” or “huggingface/nielsr/vit-finetuned-cifar10” です)。

次に、予測からメトリクスを計算する方法に対する関数を定義する必要があり、これは先にロードしたメトリックを単に使用します。行わなければならない唯一の前処理は予測ロジットの argmax を取ります :

import numpy as np

# the compute_metrics function takes a Named Tuple as input:
# predictions, which are the logits of the model as Numpy arrays,
# and label_ids, which are the ground-truth labels as Numpy arrays.
def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

collate_fn も定義します、これはサンプルをまとめてバッチ処理するために使用されます。各バッチは 2 つのキーからなります、つまり pixel_values と labels です。

import torch

def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

そしてこの総てをデータセットとともに Trainer に渡す必要があるだけです :

trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
)

Cloning https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat into local empty directory.

データを既に前処理している時に tokenizer として feature_extractor を渡すことに疑問があるかもしれません。これは、(JSON としてストアされた) 特徴抽出器 configuration ファイルがまたハブ上のレポにアップロードされることを確実にするだけです。

そして train メソッドを呼び出すことによりモデルを再調整できます :

train_results = trainer.train()
# rest is optional but nice to have
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 24300
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 4
  Total optimization steps = 570

[570/570 16:11, Epoch 3/3]
Epoch	Training Loss	Validation Loss	Accuracy
1	0.262100	0.108344	0.962963
2	0.176900	0.142533	0.950000
3	0.134300	0.066442	0.974444

***** Running Evaluation *****
  Num examples = 2700
  Batch size = 32
Saving model checkpoint to swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-190
Configuration saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-190/config.json
Model weights saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-190/pytorch_model.bin
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-190/preprocessor_config.json
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 2700
  Batch size = 32
Saving model checkpoint to swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-380
Configuration saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-380/config.json
Model weights saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-380/pytorch_model.bin
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-380/preprocessor_config.json
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/preprocessor_config.json
***** Running Evaluation *****
  Num examples = 2700
  Batch size = 32
Saving model checkpoint to swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-570
Configuration saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-570/config.json
Model weights saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-570/pytorch_model.bin
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-570/preprocessor_config.json
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/preprocessor_config.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from swin-tiny-patch4-window7-224-finetuned-eurosat/checkpoint-570 (score: 0.9744444444444444).
Saving model checkpoint to swin-tiny-patch4-window7-224-finetuned-eurosat
Configuration saved in swin-tiny-patch4-window7-224-finetuned-eurosat/config.json
Model weights saved in swin-tiny-patch4-window7-224-finetuned-eurosat/pytorch_model.bin
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/preprocessor_config.json
Saving model checkpoint to swin-tiny-patch4-window7-224-finetuned-eurosat
Configuration saved in swin-tiny-patch4-window7-224-finetuned-eurosat/config.json
Model weights saved in swin-tiny-patch4-window7-224-finetuned-eurosat/pytorch_model.bin
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/preprocessor_config.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.

Upload file pytorch_model.bin:   0%|          | 3.34k/105M [00:00<?, ?B/s]
Upload file runs/Apr12_08-48-13_9520b574893c/events.out.tfevents.1649753401.9520b574893c.77.0:  24%|##4

To https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat
   b46a767..6d6b8dc  main -> main

To https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat
   6d6b8dc..25dd5d7  main -> main

***** train metrics *****
  epoch                    =          3.0
  total_flos               = 1687935228GF
  train_loss               =       0.3276
  train_runtime            =   0:16:13.91
  train_samples_per_second =       74.852
  train_steps_per_second   =        0.585

Trainer がベストモデルを正しく再ロードしたことは (それが最後のものでない場合) evaluate メソッドで確認できます :

metrics = trainer.evaluate()
# some nice to haves:
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

***** Running Evaluation *****
  Num examples = 2700
  Batch size = 32

[85/85 00:15]
***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9744
  eval_loss               =     0.0664
  eval_runtime            = 0:00:16.12
  eval_samples_per_second =     167.48
  eval_steps_per_second   =      5.273

そして単にこの命令を実行すれば、訓練の結果をハブにアップロードすることができます (Trainer は Tensorboard ログに加えてモデルカードを自動的に作成することに注意してください – “Training metrics” タブ参照 – amazing isn’t it?) :

trainer.push_to_hub()

Saving model checkpoint to swin-tiny-patch4-window7-224-finetuned-eurosat
Configuration saved in swin-tiny-patch4-window7-224-finetuned-eurosat/config.json
Model weights saved in swin-tiny-patch4-window7-224-finetuned-eurosat/pytorch_model.bin
Feature extractor saved in swin-tiny-patch4-window7-224-finetuned-eurosat/preprocessor_config.json

Upload file runs/Apr12_08-48-13_9520b574893c/events.out.tfevents.1649754586.9520b574893c.77.2: 100%|##########…

To https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat
   25dd5d7..2164338  main -> main

'https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/commit/2164338db59d40004286bc65800bfa50561ecd3d'

今ではこのモデルを総ての友人、家族、お気に入りのペットと共有することができます : それを識別子 “your-username/the-name-you-picked” でロードできます、例えば :

from transformers import AutoModelForImageClassification, AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("nielsr/my-awesome-model")
model = AutoModelForImageClassification.from_pretrained("nielsr/my-awesome-model")

推論

新しい画像があり、それに対して予測をしたいとしましょう。森林の衛星画像 (それは EuroSAT データセットの一部ではありません) をロードして、モデルがどのように行なうかを見ましょう。

from PIL import Image
import requests

url = 'https://huggingface.co/nielsr/convnext-tiny-finetuned-eurostat/resolve/main/forest.png'
image = Image.open(requests.get(url, stream=True).raw)
image

ハブから特徴抽出器とモデルをロードします (ここでは、Auto クラスを使用します、これはハブの repo の config.json と preprocessor_config.json ファイルに基づいて適切なクラスが自動的にロードされることを確実にします)。


from transformers import AutoModelForImageClassification, AutoFeatureExtractor

repo_name = "nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat"

feature_extractor = AutoFeatureExtractor.from_pretrained(repo_name)
model = AutoModelForImageClassification.from_pretrained(repo_name)

https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/preprocessor_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpqggthctf

Downloading:   0%|          | 0.00/240 [00:00<?, ?B/s]

storing https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/preprocessor_config.json in cache at /root/.cache/huggingface/transformers/7b742d61fc51f2ef5f81a75f80b26419c9f5bd86cc3022ed5784d09823f219f2.e34548f8325ec440fcf4990d4a8dbbfd665397400e9a700766de032d2b45cf6b
creating metadata file for /root/.cache/huggingface/transformers/7b742d61fc51f2ef5f81a75f80b26419c9f5bd86cc3022ed5784d09823f219f2.e34548f8325ec440fcf4990d4a8dbbfd665397400e9a700766de032d2b45cf6b
loading feature extractor configuration file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/preprocessor_config.json from cache at /root/.cache/huggingface/transformers/7b742d61fc51f2ef5f81a75f80b26419c9f5bd86cc3022ed5784d09823f219f2.e34548f8325ec440fcf4990d4a8dbbfd665397400e9a700766de032d2b45cf6b
Feature extractor ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "resample": 3,
  "size": 224
}

https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpzdd89w3g

Downloading:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

storing https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/83e4a1dea85e8e284e4da8ae1e3cf950c2c7e74d65a5a188049b3371fcd151bd.f1ed4852dd8f4c3d0c565427607bc41fff51b58ac73a0970bec8456e5c64cea0
creating metadata file for /root/.cache/huggingface/transformers/83e4a1dea85e8e284e4da8ae1e3cf950c2c7e74d65a5a188049b3371fcd151bd.f1ed4852dd8f4c3d0c565427607bc41fff51b58ac73a0970bec8456e5c64cea0
loading configuration file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/83e4a1dea85e8e284e4da8ae1e3cf950c2c7e74d65a5a188049b3371fcd151bd.f1ed4852dd8f4c3d0c565427607bc41fff51b58ac73a0970bec8456e5c64cea0
Model config SwinConfig {
  "_name_or_path": "nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat",
  "architectures": [
    "SwinForImageClassification"
  ],
  "attention_probs_dropout_prob": 0.0,
  "depths": [
    2,
    2,
    6,
    2
  ],
  "drop_path_rate": 0.1,
  "embed_dim": 96,
  "encoder_stride": 32,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "id2label": {
    "0": "AnnualCrop",
    "1": "Forest",
    "2": "HerbaceousVegetation",
    "3": "Highway",
    "4": "Industrial",
    "5": "Pasture",
    "6": "PermanentCrop",
    "7": "Residential",
    "8": "River",
    "9": "SeaLake"
  },
  "image_size": 224,
  "initializer_range": 0.02,
  "label2id": {
    "AnnualCrop": 0,
    "Forest": 1,
    "HerbaceousVegetation": 2,
    "Highway": 3,
    "Industrial": 4,
    "Pasture": 5,
    "PermanentCrop": 6,
    "Residential": 7,
    "River": 8,
    "SeaLake": 9
  },
  "layer_norm_eps": 1e-05,
  "mlp_ratio": 4.0,
  "model_type": "swin",
  "num_channels": 3,
  "num_heads": [
    3,
    6,
    12,
    24
  ],
  "num_layers": 4,
  "patch_size": 4,
  "path_norm": true,
  "problem_type": "single_label_classification",
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_absolute_embeddings": false,
  "window_size": 7
}

https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpkh0vdu53

Downloading:   0%|          | 0.00/105M [00:00<?, ?B/s]

storing https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/3daadbe0cabef18dc0e2232ae080d135a9d4ee6b1dc7675725ef38bedb990b81.818e63819e125637bd8a94f43b6899d1552f0b45884f1c28c458a5cb55dfa9e5
creating metadata file for /root/.cache/huggingface/transformers/3daadbe0cabef18dc0e2232ae080d135a9d4ee6b1dc7675725ef38bedb990b81.818e63819e125637bd8a94f43b6899d1552f0b45884f1c28c458a5cb55dfa9e5
loading weights file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/3daadbe0cabef18dc0e2232ae080d135a9d4ee6b1dc7675725ef38bedb990b81.818e63819e125637bd8a94f43b6899d1552f0b45884f1c28c458a5cb55dfa9e5
All model checkpoint weights were used when initializing SwinForImageClassification.

All the weights of SwinForImageClassification were initialized from the model checkpoint at nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use SwinForImageClassification for predictions without further training.

# prepare image for the model
encoding = feature_extractor(image.convert("RGB"), return_tensors="pt")
print(encoding.pixel_values.shape)

torch.Size([1, 3, 224, 224])

import torch

# forward pass
with torch.no_grad():
  outputs = model(**encoding)
  logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Predicted class: Forest

Looks like our model got it correct!

パイプライン API

ハブの任意のモデルで推論を素早く実行する他の方法はパイプライン API を利用することです、これは上で手動で行なったステップの総てを抽象化します。それは前処理、forward パスとポスト処理の総てを単一オブジェクトで実行します。

訓練済みのモデルに対してこれを示しましょう :

from transformers import pipeline

pipe = pipeline("image-classification", "nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat")

loading configuration file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/83e4a1dea85e8e284e4da8ae1e3cf950c2c7e74d65a5a188049b3371fcd151bd.f1ed4852dd8f4c3d0c565427607bc41fff51b58ac73a0970bec8456e5c64cea0
Model config SwinConfig {
  "_name_or_path": "nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat",
  "architectures": [
    "SwinForImageClassification"
  ],
  "attention_probs_dropout_prob": 0.0,
  "depths": [
    2,
    2,
    6,
    2
  ],
  "drop_path_rate": 0.1,
  "embed_dim": 96,
  "encoder_stride": 32,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "id2label": {
    "0": "AnnualCrop",
    "1": "Forest",
    "2": "HerbaceousVegetation",
    "3": "Highway",
    "4": "Industrial",
    "5": "Pasture",
    "6": "PermanentCrop",
    "7": "Residential",
    "8": "River",
    "9": "SeaLake"
  },
  "image_size": 224,
  "initializer_range": 0.02,
  "label2id": {
    "AnnualCrop": 0,
    "Forest": 1,
    "HerbaceousVegetation": 2,
    "Highway": 3,
    "Industrial": 4,
    "Pasture": 5,
    "PermanentCrop": 6,
    "Residential": 7,
    "River": 8,
    "SeaLake": 9
  },
  "layer_norm_eps": 1e-05,
  "mlp_ratio": 4.0,
  "model_type": "swin",
  "num_channels": 3,
  "num_heads": [
    3,
    6,
    12,
    24
  ],
  "num_layers": 4,
  "patch_size": 4,
  "path_norm": true,
  "problem_type": "single_label_classification",
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_absolute_embeddings": false,
  "window_size": 7
}

loading configuration file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/83e4a1dea85e8e284e4da8ae1e3cf950c2c7e74d65a5a188049b3371fcd151bd.f1ed4852dd8f4c3d0c565427607bc41fff51b58ac73a0970bec8456e5c64cea0
Model config SwinConfig {
  "_name_or_path": "nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat",
  "architectures": [
    "SwinForImageClassification"
  ],
  "attention_probs_dropout_prob": 0.0,
  "depths": [
    2,
    2,
    6,
    2
  ],
  "drop_path_rate": 0.1,
  "embed_dim": 96,
  "encoder_stride": 32,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "id2label": {
    "0": "AnnualCrop",
    "1": "Forest",
    "2": "HerbaceousVegetation",
    "3": "Highway",
    "4": "Industrial",
    "5": "Pasture",
    "6": "PermanentCrop",
    "7": "Residential",
    "8": "River",
    "9": "SeaLake"
  },
  "image_size": 224,
  "initializer_range": 0.02,
  "label2id": {
    "AnnualCrop": 0,
    "Forest": 1,
    "HerbaceousVegetation": 2,
    "Highway": 3,
    "Industrial": 4,
    "Pasture": 5,
    "PermanentCrop": 6,
    "Residential": 7,
    "River": 8,
    "SeaLake": 9
  },
  "layer_norm_eps": 1e-05,
  "mlp_ratio": 4.0,
  "model_type": "swin",
  "num_channels": 3,
  "num_heads": [
    3,
    6,
    12,
    24
  ],
  "num_layers": 4,
  "patch_size": 4,
  "path_norm": true,
  "problem_type": "single_label_classification",
  "qkv_bias": true,
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_absolute_embeddings": false,
  "window_size": 7
}

loading weights file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/3daadbe0cabef18dc0e2232ae080d135a9d4ee6b1dc7675725ef38bedb990b81.818e63819e125637bd8a94f43b6899d1552f0b45884f1c28c458a5cb55dfa9e5
All model checkpoint weights were used when initializing SwinForImageClassification.

All the weights of SwinForImageClassification were initialized from the model checkpoint at nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use SwinForImageClassification for predictions without further training.
loading feature extractor configuration file https://huggingface.co/nielsr/swin-tiny-patch4-window7-224-finetuned-eurosat/resolve/main/preprocessor_config.json from cache at /root/.cache/huggingface/transformers/7b742d61fc51f2ef5f81a75f80b26419c9f5bd86cc3022ed5784d09823f219f2.e34548f8325ec440fcf4990d4a8dbbfd665397400e9a700766de032d2b45cf6b
Feature extractor ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "resample": 3,
  "size": 224
}

pipe(image)

[{'label': 'Forest', 'score': 0.7000269889831543},
 {'label': 'HerbaceousVegetation', 'score': 0.14589950442314148},
 {'label': 'Pasture', 'score': 0.10370415449142456},
 {'label': 'Highway', 'score': 0.014327816665172577},
 {'label': 'Residential', 'score': 0.0139168007299304}]

ご覧のように、それは最高確率を持つクラスラベルを示すだけでなく、対応するスコアとともに top 5 ラベルを返します。このパイプラインはまたローカルのモデルと特徴抽出器でも動作します :

pipe = pipeline("image-classification", 
                model=model,
                feature_extractor=feature_extractor)

pipe(image)

[{'label': 'Forest', 'score': 0.7000269889831543},
 {'label': 'HerbaceousVegetation', 'score': 0.14589950442314148},
 {'label': 'Pasture', 'score': 0.10370415449142456},
 {'label': 'Highway', 'score': 0.014327816665172577},
 {'label': 'Residential', 'score': 0.0139168007299304}]

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 多岐選択

05/03/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 多岐選択 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/03/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Multiple choice

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 多岐選択

多岐選択タスクは、幾つかの回答候補がコンテキストと一緒に提供される点以外は、質問応答に似ています。モデルは、コンテキストが与えられたとき複数の入力から正しい回答を選択するように訓練されています。

このガイドは、複数の選択肢と幾つかのコンテキストが与えられたときに最善の答えを選択するように、SWAG データセットの通常の configuration で BERT を再調整する方法を示します。

SWAG データセットのロード

Datasets ライブラリから SWAG データセットをロードします :

そしてサンプルを見てみましょう :

swag["train"][0]

{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

sent1 と sent2 フィールドはセンテンスがどのように始まるかを示し、各 ending フィールドはどのようにセンテンスが終了するかを示します。センテンスの始まりが与えられたとき、モデルは、label フィールドで示される、正しいセンテンスの終わりを選択しなければなりません。

前処理

各センテンスの開始と 4 つの可能な終わりを処理するために BERT トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

前処理関数は以下が必要です :

sent1 フィールドの 4 つのコピーを作成し、センテンスがどのように開始されるか再現するためにそれらの各々を sent2 と結合できるようにします。
sent2 を 4 つの可能なセンテンスの終わりの各々と結合します。
これらの 2 つのリストをトークン化できるように平坦化 (= flatten) し、それから後で各サンプルが対応する input_ids, attention_mask と labels フィールドを持つように unflatten します。

ending_names = ["ending0", "ending1", "ending2", "ending3"]


def preprocess_function(examples):
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    question_headers = examples["sent2"]
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
    ]

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

データセット全体に対して前処理関数を適用するために Datasets map 関数を使用します。データセットの複数の要素を一度に処理する batched=True を設定することにより map 関数を高速化できます :

tokenized_swag = swag.map(preprocess_function, batched=True)

Transformers は多岐選択のためのデータ collator 作成する必要があります。多岐選択のためのサンプルのバッチを作成するために DataCollatorWithPadding を使用できます。それはまたバッチ内の最長要素の長さにテキストとラベルを動的にパディングしますので、それらは均一な長さです。

DataCollatorForMultipleChoice は総てのモデル入力を平坦化し、パディングを適用し、そして結果を unflatten します :

PyTorch

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

TensorFlow

from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import tensorflow as tf


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="tf",
        )

        batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
        batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
        return batch

Trainer で再調整

AutoModelForMultipleChoice で BERT をロードします :

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセット, トークナイザー, そしてデータ collator と共に訓練引数を Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_swag["train"],
    eval_dataset=tokenized_swag["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

データセットを to_tf_dataset で tf.data.Dataset 形式に変換します。columns で入力を、label_cols でターゲットを、データセット順序をシャッフルするか否か、バッチサイズ、そしてデータ collator を指定します :

data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
tf_train_set = tokenized_swag["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["labels"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_validation_set = tokenized_swag["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids"],
    label_cols=["labels"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer

batch_size = 16
num_train_epochs = 2
total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

TFAutoModelForMultipleChoice で BERT をロードします :

from transformers import TFAutoModelForMultipleChoice

model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

compile で訓練のためにモデルを configure します :

model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2)

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 要約

05/02/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 要約 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/02/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Summarization

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 要約

要約はドキュメントや記事のより短いバージョンを作成します、これは重要な情報の総てを捉えています。翻訳とともに、それは sequence-to-sequence タスクとして定式化できる別の例です。要約は以下のものであり得ます :

Extractive : ドキュメントから最も関連性のある情報を抽出する。
Abstractive : 最も関連性のある情報を捉えた新しいテキストを生成する。

このガイドは abstractive 要約のための BillSum データセットのカリフォルニア州 bill サブセット上で T5 を再調整する方法を示します。

Note : 関連するモデル, データセット, そしてメトリックスについての詳細は要約タスクのページを見てください。

BillSum データセットのロード

Datasets ライブラリから BillSum データセットをロードします :

from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

このデータセットを訓練とテストセットに分割します :

billsum = billsum.train_test_split(test_size=0.2)

そしてサンプルを見てみましょう :

billsum["train"][0]

{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.',
 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}

text フィールドが入力で summary フィールドがターゲットです。

前処理

text と summary を処理するために T5 トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

前処理は以下を必要とします :

T5 がこれが要約タスクであることを知るように、入力をプロンプトで prefix します。複数の NLP タスクが可能な幾つかのモデルは特定のタスクに対してプロンプトを必要とします。
入力とラベルのトークン化を並列化するために as_target_tokenizer() 関数でコンテキストマネージャを使用します。
max_length パラメータで設定された最大長よりも長くならないようにシークエンスを切り詰めます。

prefix = "summarize: "


def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_billsum = billsum.map(preprocess_function, batched=True)

サンプルのバッチを作成するために DataCollatorForSeq2Seq を使用します。それはまたバッチ内の最長要素の長さにテキストとラベルを動的にパディングしますので、それらは均一な長さです。padding=True を設定することでトークナイザーの関数でテキストをパディングすることも可能ですが、動的パディングはより効率的です。

PyTorch

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

TensorFlow

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")

Trainer で再調整

AutoModelForSeq2SeqLM で T5 をロードします :

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

この時点で、3 つのステップだけが残っています :

Seq2SeqTrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセット, トークナイザー, そしてデータ collator と共に訓練引数を Seq2SeqTrainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

データセットを to_tf_dataset で tf.data.Dataset 形式に変換します。columns で入力とラベルを、データセット順序をシャッフルするか否か、バッチサイズ、そしてデータ collator を指定します :

tf_train_set = tokenized_billsum["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = tokenized_billsum["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

TFAutoModelForSeq2SeqLM で T5 をロードします :

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")

compile で訓練のためにモデルを configure します :

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Note : 要約のためにモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 翻訳

05/01/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 翻訳 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/01/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Translation

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 翻訳

翻訳は一つの言語から別の言語にテキストのシークエンスを変換することです。それは、ビジョンや音声タスクにも拡張できる強力なフレームワークである、sequence-to-sequence 問題として定式化できる幾つかのタスクの一つです。

このガイドは、英語テキストをフランス語に翻訳するために OPUS Books データセットの英-仏サブセット上で T5 を再調整する方法を示します。

Note : 関連するモデル, データセット, そしてメトリックスについての詳細は翻訳タスクのページを見てください。

OPUS Books データセットのロード

Datasets ライブラリから OPUS Books データセットをロードします :

from datasets import load_dataset

books = load_dataset("opus_books", "en-fr")

データセットを訓練とテストセットに分割します :

books = books["train"].train_test_split(test_size=0.2)

そしてサンプルを見てみましょう :

books["train"][0]

{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}

translation フィールドはテキストの英語とフランス語の翻訳を含む辞書です。

前処理

言語ペアを処理するために T5 トークンナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small")

前処理関数は以下を必要とします :

入力をプロンプトで prefix します、その結果 T5 はこれが翻訳タスクであることを知ります。複数の NLP タスクが可能な幾つかのモデルは特定のタスクのためにプロンプトを必要とします。
入力 (英語) とターゲット (フランス語) を別々にトークン化します。英語語彙で事前訓練されたトークナイザーでフランス語テキストをトークン化はできません。コンテキストマネージャはフランス語をトークン化する前にトークナイザーをフランス語に設定するのに役立ちます。
max_length パラメータで設定された最大長よりも長くならないようにシークエンスを切り詰めます。
```
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "


def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
```
データセット全体に対して前処理関数を適用するために Datasets map 関数を使用します。データセットの複数の要素を一度に処理する batched=True を設定することにより map 関数を高速化できます :
```
tokenized_books = books.map(preprocess_function, batched=True)
```
サンプルのバッチを作成するために DataCollatorForSeq2Seq を使用します。それはまたバッチ内の最長要素の長さにテキストとラベルを動的にパディングしますので、それらは均一な長さです。padding=True を設定することでトークナイザーの関数でテキストをパディングすることも可能ですが、動的パディングはより効率的です。

PyTorch
```
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
```
TensorFlow
```
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="tf")
```
Trainer で再調整

AutoModelForSeq2SeqLM で T5 をロードします :
```
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
この時点で、3 つのステップだけが残っています :
1. Seq2SeqTrainingArguments で訓練ハイパーパラメータを定義します。
2. モデル, データセット, トークナイザー, そしてデータ collator と共に訓練引数を Seq2SeqTrainer に渡します。
3. モデルを再調整するために train() を呼び出します。
```
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=True,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
```
TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

データセットを to_tf_dataset で tf.data.Dataset 形式に変換します。columns で入力とラベルを、データセット順序をシャッフルするか否か、バッチサイズ、そしてデータ collator を指定します :
```
tf_train_set = tokenized_books["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = tokenized_books["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)
```
optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :
```
from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
```
TFAutoModelForSeq2SeqLM で T5 をロードします :
```
from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
```
compile で訓練のためにモデルを configure します :
```
model.compile(optimizer=optimizer)
```
モデルを再調整するために fit を呼び出します :
```
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)
```
Note : 翻訳のためにモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング

04/29/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 言語モデリング (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/29/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Language modeling

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 言語モデリング

言語モデリングはセンテンスの単語を予測します。言語モデリングの 2 つの形式があります。

Causal (因果) 言語モデリングはトークンのシークエンスの次のトークンを予測します、そしてモデルは左側のトークンにだけ注意を払うことができます。

Masked 言語モデリングはシークエンスのマスクされたトークンを予測します、そしてモデルはモデルは双方向にトークンに注意を払うことができます。

このガイドは causal 言語モデリング用の DistilGPT2 と masked 言語モデリング用の DistilRoBERTa を ELI5 データセットの r/askscience サブセットで再調整する方法を示します。

Note : このガイドで表されている同じステップに従い、GPT-Neo, GPT-J, と BERT のような言語モデリングのための他のアーキテクチャを再調整することができます。
関連するモデル, データセット, そしてメトリクスの詳細については、テキスト生成タスクのページ、そして fill mask タスクのページを見てください。

ELI5 データセットのロード

Datasets ライブラリから ELI5 データセットの最初の 5000 行だけをロードします、何故ならばそれはかなり大きいからです :

from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

このデータセットを訓練とテストセットに分割します :

eli5 = eli5.train_test_split(test_size=0.2)

そしてサンプルを見ましょう :

eli5["train"][0]

{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

text は answers 辞書内でネストされたサブフィールドであることに注意してください。データセットを前処理するとき、text サブフィールドを別のカラムに抽出する必要があります。

前処理

causal 言語モデリングについては、text サブフィールドを処理するために DistilGPT2 トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

masked 言語モデリングについては、代わりに DistilRoBERTa トークナイザーがロードされます :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

flatten メソッドでネスト構造から text サブフィールドを抽出します :

eli5 = eli5.flatten()
eli5["train"][0]

{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

今は各サブフィールドは answers の prefix で示されるように個別のカラムになっています。answers.txt はリストであることに注意してください。各センテンスを別個にトークン化する代わりにそれらを一緒にトークン化するためにリストを文字列に変換します。

ここに、リストを文字列に変換して、DistilGPT2 の最大入力長よりも長くならないようにシークエンスを切り詰める前処理関数を作成する方法があります :

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

データセット全体に対して前処理関数を適用するために Datasets map 関数を使用します。データセットの複数の要素を一度に処理する batched=True を設定して、num_proc でプロセスを増やすことにより map 関数を高速化できます。必要ないカラムは削除します :

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

次に、情報の損失を防ぐために冗長なサンプルから切り詰められたテキストを捕捉する 2 番目の前処理関数が必要です。この前処理関数は以下を行なうべきです :

総てのテキストを連結する。
連結されたテキストを block_size で定義された小さいチャンクに分割する。

block_size = 128


def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

データセット全体に対して group_texts 関数を適用します :

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

causal 言語モデリングについては、サンプルのバッチを作成するために DataCollatorForLanguageModeling を使用します。それはまたバッチ内の最長要素の長さにテキストを動的にパディングしますので、それらは均一な長さです。padding=True を設定することでトークナイザーの関数でテキストをパディングすることも可能ですが、動的パディングはより効率的です。

パディング・トークンとしてシークエンスの終端トークンを使用し、mlm=False を設定することができます。これは入力を 1 要素右にシフトされたラベルとして使用します。

PyTorch

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

TensorFlow

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

masked 言語モデリングについては、データをイテレートするたびにトークンをランダムにマスクするために mlm_probability を指定する必要があることを除いて、同じ DataCollatorForLanguageModeling を使用します。

PyTorch

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

TensorFlow

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

Causal 言語モデリング

Causal 言語モデリングはテキスト生成のために頻繁に使用されます。このセクションは新しいテキストを生成するために DistilGPT2 を再調整する方法を示します。

Trainer で再調整

AutoModelForCausalLM で DistilGPT2 をロードします :

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセットとデータ collator と共に訓練引数を Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

tf_train_set = lm_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = lm_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

TFAutoModelForCausalLM で DistilGPT2 をロードします :

from transformers import TFAutoModelForCausalLM

model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

compile で訓練のためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Masked 言語モデリング

Masked 言語モデリングはまた fill-mask タスクとしても知られています、何故ならばそれはシークエンスのマスクされたトークンを予測するからです。masked 言語モデリングのためのモデルは、左側のコンテキストだけの代わりにシークエンス全体の良いコンテキスト理解を必要とします。このセクションは、マスクされた単語を予測するために DistilRoBERTa を再調整する方法を示します。

Trainer で再調整

AutoModelForMaskedlM で DistilRoBERTa をロードします :

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセット, トークナイザー, そしてデータ collator と共に訓練引数を Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

tf_train_set = lm_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = lm_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

TFAutoModelForMaskedLM で DistilRoBERTa をロードします :

from transformers import TFAutoModelForMaskedLM

model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base")

compile で訓練のためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Note : causal 言語モデリングのためのモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 質問応答

04/28/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – 質問応答 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/28/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Question answering

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – 質問応答

質問応答タスクは質問が与えられたときに答えを返します。質問応答の 2 つの一般的な形式があります :

Extractive : 与えられたコンテキストから答えを抽出する。
Abstractive : コンテキストから正しく質問に答える答えを生成する。

このガイドは extractive 質問応答に対して SQuAD データセット上で DistilBERT を再調整する方法を示します。

Note : 質問応答と関連するモデル、データセットとメトリクスの他の形式についての詳細は質問応答タスクのページを見てください。

SQuAD データセットのロード

Datasets ライブラリから SQuAD データセットをロードします :

from datasets import load_dataset

squad = load_dataset("squad")

そしてサンプルを見てみましょう :

squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

answers フィールドは答えの開始位置と答えのテキストを含む辞書です。

前処理

質問とコンテキストフィールドを処理するために DistilBERT トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

知っておくべき、質問応答に特有の幾つかの前処理ステップがあります :

データセットの幾つかのサンプルは、モデルの最大入力長を超える、非常に長いコンテキストを持っている可能性があります。truncation=”only_second” を設定することでコンテキストだけを切り詰めてください。
次に、return_offset_mapping=True を設定して答えの開始と終了位置を元のコンテキストにマップします。
手動でマッピングすることで、答えの開始と終了トークンを見つけることができます。sequence_ids メソッドを使用して、オフセットのどの部分が質問に対応するのか、そしてどの部分がコンテキストに対応するのか見つけられます。

ここに、切り詰めて、答えの開始とトークンをコンテキストにマップする関数を作成できる方法があります :

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

データセット全体に対して前処理関数を適用するためには Datasets map 関数を使用します。データセットの複数の要素を一度に処理する batched=True を設定することで map 関数を高速化できます。必要でないカラムは削除します :

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

サンプルのバッチを作成するために DefaultDataCollator を使用します。 Transformers の他のデータ collator とは違い、DefaultDataCollator はパディングのような追加の前処理を適用しません。

PyTorch

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

TensorFlow

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

Trainer で再調整

AutoModelForQuestionAnswering で DistilBERT をロードします :

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
訓練引数をモデル、データセット、トークナイザー、そしてデータ collator と共に Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

データセットを to_tf_dataset で tf.data.Dataset 形式に変換します。columns で入力と答えの開始と終了位置を、データセット順序をシャッフルするか否か、バッチサイズ、そしてデータ collator を指定します :

tf_train_set = tokenized_squad["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    dummy_labels=True,
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = tokenized_squad["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "start_positions", "end_positions"],
    dummy_labels=True,
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer

batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=total_train_steps,
)

TFAutoModelForQuestionAnswering で DistilBERT をロードします :

from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")

compile で訓練するためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)

Note : 質問応答のためにモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – トークン分類

04/28/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – トークン分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/27/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Token classification

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – トークン分類

トークン分類はセンテンスの個々のトークンにラベルを割当てます。最も一般的なトークン分類タスクの一つは固有表現認識 (NER) です。NER はセンテンスの各エンティティに対して、人、位置や組織のようなラベルを見つけようとします。

このガイドは、新しいエンティティを検出するために WNUT 17 データセットで DistilBERT を再調整する方法を示します。

Note : トークン分類と関連するモデル、データセットとメトリクスの他の形式についての詳細はトークン分類タスクのページを見てください。

WNUT 17 データセットのロード

Datasets ライブラリから WNUT 17 データセットをロードします :

from datasets import load_dataset

wnut = load_dataset("wnut_17")

そしてサンプルを見てみましょう :

wnut["train"][0]

{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}

ner_tags の各数値はエンティティを表します。詳細は数値をラベル名に変換します :

label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product",
]

ner_tag は企業、場所や人のようなエンティティを表します。各 ner_tag を prefix する文字はエンティティのトークン位置を示します :

B- はエンティティの始まりを示します。
I- はトークンが同じエンティティ内に含まれていることを示します (e.g., State トークンは Empire State Building のようなエンティティの一部です)。
0 はトークンがどのエンティティにも対応していないことを示します。

前処理

トークンを処理するために DistilBERT トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

入力は既に単語に分割されていますので、単語をサブワードにトークン化するには is_split_into_words=True を設定します :

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']

特殊トークン [CLS] と [SEP] の追加とサブワード・トークン化は入力とラベルの間の不一致を生み出します。単一ラベルに対応する単一の単語は 2 つのサブワードに分割されるかもしれません。トークンとラベルを以下により再調整 (= realign) する必要があります :

word_ids メソッドで総てのトークンを対応する単語にマップする。
特殊トークン [CLS] と [SEP] にラベル -100 を割当てて PyTorch 損失関数がそれらを無視するようにする。
与えられた単語の最初のトークンだけにラベル付けします。同じ単語の他のサブトークンには -100 を割当てます。

ここに、トークンとラベルを再調整して、シークエンスが DistilBERT の最大入力長よりも長くならないように切り詰める関数を作成する方法があります :

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

データセット全体に対してラベルをトークン化してアラインするために Datasets map 関数を使用します。データセットの複数の要素を一度に処理する batched=True を設定することにより map 関数を高速化できます :

tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

サンプルのバッチを作成するために DataCollatorForTokenClassification を使用します。それはまたバッチ内の最長要素の長さにテキストとラベルを動的にパディングしますので、それらは均一な長さです。padding=True を設定することでトークナイザーの関数でテキストをパディングすることも可能ですが、動的パディングはより効率的です。

PyTorch

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

TensorFlow

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")

Trainer で再調整

想定されるラベルの数と共に AutoModelForTokenClassification で DistilBERT をロードします :

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
訓練引数をモデル、データセット、トークナイザー、そしてデータ collator と共に Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

tf_train_set = tokenized_wnut["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = tokenized_wnut["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer

batch_size = 16
num_train_epochs = 3
num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=2e-5,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
    num_warmup_steps=0,
)

想定されるラベル数と共に TFAutoModelForTokenClassification で DistilBERT をロードします :

from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

compile で訓練のためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)

Note : トークン分類のためにモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – テキスト分類

04/27/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用の再調整 – テキスト分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/27/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : Fine-Tune for Downstream Tasks : Text classification

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用の再調整 – テキスト分類

テキスト分類はラベルやクラスをテキストに割り当てる一般的な NLP タスクです。今日の最大手の企業の幾つかにより製品で広く使用されるテキスト分類の多くの実践的なアプリケーションがあります。テキスト分類の最もポピュラーな形式の一つはセンチメント分析です、これはテキストのシークエンスにポジティブ、ネガティブやニュートラルのようなラベルを割当てます。

このガイドは、映画レビューがポジティブかネガティブかを決定するために IMDb データセット上で DistilBERT を再調整する方法を示します。

Note : テキスト分類の他の形式と関連するモデル, データセット, そしてメトリックスについての詳細はテキスト分類タスクのページを見てください。

IMDb データセットのロード

Datasets ライブラリから IMDb データセットをロードします :

from datasets import load_dataset

imdb = load_dataset("imdb")

そしてサンプルを見ます :

imdb["test"][0]

{
    "label": 0,
    "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}

このデータセットには 2 つのフィールドがあります :

text : 映画レビューのテキストを含む文字列。
label : ネガティブなレビューのための 0、あるいはポジティブなレビューのための 1 のいずれかであり得る値。

前処理

text フィールドを処理するために DistilBERT トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

テキストをトークン化し、シークエンスが DistilBERT の最大入力長よりも長くならないように切り詰める前処理関数を作成します :

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_imdb = imdb.map(preprocess_function, batched=True)

サンプルのバッチを作成するために DataCollatorWithPadding を使用します。それはまたバッチ内の最長要素の長さにテキストを動的にパディングしますので、それらは均一な長さです。padding=True を設定することでトークナイザーの関数でテキストをパディングすることも可能ですが、動的パディングはより効率的です。

PyTorch

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

TensorFlow

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

Trainer で再調整

想定されるラベル数と共に AutoModelForSequenceClassification で DistilBERT をロードします :

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Note : Trainer によるモデルの再調整に馴染みがない場合は、ここの基本的なチュートリアルを見てください！

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義します。
モデル, データセット, トークナイザー, そしてデータ collator と共に訓練引数を Trainer に渡します。
モデルを再調整するために train() を呼び出します。

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Note : Trainer はトークナイザーをそれに渡すときデフォルトでは動的パディングを適用します。この場合、データ collator を明示的に指定する必要はありません。

TensorFlow による再調整

TensorFlow でモデルを再調整することは、幾つかの違いはありますが、同様に簡単です。

tf_train_dataset = tokenized_imdb["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_dataset = tokenized_imdb["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "label"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

optimizer 関数, 学習率スケジュール, そして幾つかの訓練ハイパーパラメータをセットアップします :

from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

想定されるラベル数と共に TFAutoModelForSequenceClassification で DistilBERT をロードします :

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

compile で訓練のためにモデルを configure します :

import tensorflow as tf

model.compile(optimizer=optimizer)

モデルを再調整するために fit を呼び出します :

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3)

Note : テキスト分類のためにモデルを再調整する方法の詳細なサンプルについては、対応する PyTorch ノートブックか TensorFlow ノートブックを見てください。

以上

HuggingFace Transformers 4.17 : ガイド : 下流タスク用にモデルを再調整する方法

04/26/2022

HuggingFace Transformers 4.17 : ガイド : 下流タスク用にモデルを再調整する方法 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/26/2022 (v4.17.0)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

How-To Guide : How to fine-tune a model for common downstream tasks

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Transformers : ガイド : 下流タスク用にモデルを再調整する方法

このガイドは一般的な下流タスクに対して Transformers モデルを再調整する方法を示します。データセットを素早くロードして前処理するために Datasets ライブラリを使用し、PyTorch と TensorFlow による訓練のためにそれらを準備します。

始める前に、 Datasets ライブラリがインストールされていることを確認してください。詳細なインストール手順については、 Datasets インストール・ページを参照してください。このガイドのサンプルの総てはデータセットをロードして前処理するためにを使用します。

pip install datasets

以下のためのモデルを再調整する方法を学習します :

IMDb レビューによるシークエンス分類

シークエンス分類は与えられた数のクラスに従ってテキストのシークエンスを分類するタスクを指します。この例では、レビューがポジティブかネガティブかを決定するために IMDb データセットでモデルを再調整する方法を学習します。

IMDb データセットのロード

Datasets ライブラリはデータセットのロードを簡単します :

from datasets import load_dataset

imdb = load_dataset("imdb")

これは DatasetDict オブジェクトをロードします、これに対してサンプルを見るためにインデックスできます :

imdb["train"][0]

{
    "label": 1,
    "text": "Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as \"Teachers\". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is \"Teachers\". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!",
}

前処理

次のステップはテキストをモデルにより可読な形式にトークン化することです。適切にトークン化された単語を確実にするためにモデルがそれで訓練されたのと同じトークナイザーをロードすることは重要です。AutoTokenizer で DistilBERT トークナイザーをロードします、何故ならば結局は、事前訓練済み DistilBERT モデルを使用して分類器を訓練するからです。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

トークナイザーをインスタンス化したので、テキストをトークン化する関数を作成します。また、テキストの長いシークエンスはモデルの最大入力長よりも長くならないように切り詰める必要があります :

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

前処理関数をデータセット全体に適用するために Datasets map 関数を使用します。より高速な前処理のためにデータセットの複数の要素に一度に前処理関数を適用するためには batched=True を設定することもできます :

tokenized_imdb = imdb.map(preprocess_function, batched=True)

type(tokenized_imdb), tokenized_imdb.keys()

(datasets.dataset_dict.DatasetDict, dict_keys(['train', 'test', 'unsupervised']))

print(tokenized_imdb['train'][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.

The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.

What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.

I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.',
'label': 0,
'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

最後に、テキストをそれらが均一な長さになるようにパディングします。トークナイザーの関数で padding=True を設定することによりテキストをパディングすることが可能である一方で、テキストをそのバッチ内の最長要素の長さにパディングするだけのほうがより効率的です。これは 動的パディング として知られています。DataCollatorWithPadding 関数でこれを行なうことができます :

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Trainer API で再調整

次に AutoModelForSequenceClassification クラスのモデルを想定されるラベル数でロードします :

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

この時点で、3 つのステップだけが残っています :

TrainingArguments で訓練ハイパーパラメータを定義する。
訓練引数をモデル, データセット, トークナイザーとデータ collator と一緒に Trainer に渡します。
モデルを再調整するために Trainer.train() を呼び出します。

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

TensorFlow による再調整

(訳注: 原文参照)

WNUT の新たに出現した (= emerging) エンティティによるトークン分類

トークン分類はセンテンスの個々のトークンを分類するタスクを指します。最も一般的なトークン分類タスクの一つは固有表現認識 (NER, Named Entity Recognition) です。NER は、人, 場所や組織のような、センテンスの各エンティティに対してラベルを見つけることを試みます。この例では、新しいエンティティを検出するために WNUT 17 データセットでモデルを再調整する方法を学習します。

WNUT 17 データセットのロード

Datasets ライブラリから WNUT 17 データセットをロードします :

from datasets import load_dataset

wnut = load_dataset("wnut_17")

データセットを素早く見るとセンテンスの各単語に関連付けされたラベルが示されています :

wnut["train"][0]

{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}

次により固有の NER タグを閲覧します :

label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list

[
    "O",
    "B-corporation",
    "I-corporation",
    "B-creative-work",
    "I-creative-work",
    "B-group",
    "I-group",
    "B-location",
    "I-location",
    "B-person",
    "I-person",
    "B-product",
    "I-product",
]

各 NER タグを prefix する文字は以下を意味します :

B- はエンティティの始まりを示します。
I- は、トークンが同じエンティティ内に含まれていることを示しています (e.g., State トークンは Empire State Building のようなエンティティの一部です)。
0 はトークンがどのエンティティにも対応していないことを示します。

前処理

次にテキストをトークン化する必要があります。AutoTokenizer で DistilBERT トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

入力は既に単語に分割されていますので、単語をサブワードにトークン化するために is_split_into_words=True を設定します :

#example = wnut["train"][0]

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']

特殊トークン [CLS] と [SEP] の追加とサブワードのトークン化は入力とラベル間の不一致を起こします。以下によりラベルとトークンを再調整 (= realign) します :

word_ids メソッドで総てのトークンを対応する単語にマップします。
特殊トークン [CLS] と [SEP] にラベル -100 を割当てると、PyTorch 損失関数はそれらを無視します。
与えられた単語の最初のトークンにだけラベル付けします。同じ単語の他のサブトークンには -100 を割当てます。

ここにラベルとトークンを再調整する関数を作成する方法があります :

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

そして Datasets map 関数でデータセット全体に対してラベルをトークン化してアラインします :

tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

最後に、テキストとラベルをそれらが均一な長さになるようにパディングします :

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

Trainer API で再調整する

想定されるラベル数と共に AutoModelForTokenClassification クラスでモデルをロードします :

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))

TrainingArguments 内に訓練引数を集めます :

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

モデル, 訓練引数, データセット, データ collator, そしてトークナイザーを Trainer に集めます :

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

モデルを再調整します :

trainer.train()

TensorFlow による再調整

(訳注: 原文参照)

SQuAD による質問応答

多くのタイプの質問応答 (QA) タスクがあります。Extractive QA は質問が与えられたときテキストから答えを識別することにフォーカスします。この例では、SQuAD データセットでモデルを再調整する方法を学習します。

SQuAD データセットのロード

Datasets ライブラリから SQuAD データセットをロードします :

from datasets import load_dataset

squad = load_dataset("squad")

データセットのサンプルを見ましょう :

squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

前処理

AutoTokenizer で DistilBERT トークナイザーをロードします :

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

質問応答のためにテキストを前処理するとき、知るべき幾つかのことがあります :

データセットの幾つかのサンプルは、モデルの最大入力長を超える非常に長いコンテキストを持っている可能性があります。コンテキストを切り詰めて truncation=”only_second” を設定することによりこれに対応できます。
次に、答えの開始位置と終了位置を元のコンテキストにマップする必要があります。これを処理するために return_offset_mapping=True を設定します。
手動でマッピングすることにより、答えの開始と終了トークンを見つけられます。オフセットのどの部分が質問に対応し、オフセットのどの部分がコンテキストに対応するかを見つけるために sequence_ids メソッドを使用します。

下で示されるように総てを前処理関数に集めます :

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

Datasets map 関数で前処理関数をデータセット全体に対して適用します :

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

前処理されたサンプルをまとめてバッチ化します :

from transformers import default_data_collator

data_collator = default_data_collator

Trainer API で再調整する

AutoModelForQuestionAnswering クラスでモデルをロードする :

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

TrainingArguments で訓練引数を集めます :

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

Trainer にモデル, 訓練引数, データセット, データ collator, とトークナイザーを集めます :

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

モデルを再調整します :

trainer.train()

TensorFlow による再調整

(訳注: 原文参照)

以上