ホーム » PyOD » PyOD 0.8 : 概要

PyOD 0.8 : 概要

投稿者: sales-info in PyOD, 外れ値検知投稿日: 06/23/2021

PyOD 0.8 : 概要 (翻訳/解説)
翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 06/25/2021 (0.8.9)

* 本ページは、PyOD の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料 Web セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。
スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
ウェビナー運用には弊社製品「ClassCat® Webinar」を利用しています。

クラスキャットは人工知能・テレワークに関する各種サービスを提供しております :

人工知能研究開発支援	人工知能研修サービス	テレワーク & オンライン授業を支援
PoC(概念実証)を失敗させないための支援 (本支援はセミナーに参加しアンケートに回答した方を対象としています。)

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/ ; Facebook

PyOD 0.8 : 概要

PyOD (Python Outlier Detection) は多変量データで中心を離れた (= outlying) オブジェクトを検知するための包括的でスケーラブルな Python ツールキットです。このエキサイティングなしかし挑戦的な分野は一般に外れ値検知 or 異常検知として言及されます。

PyOD は古典的 LOF (SIGMOD 2000) から最新の COPOD (ICDM 2020) まで、30 以上の検知アルゴリズムを含みます。2017 年から、PyOD [AZNL19] は数多くの学術的な研究と商用製品で成功的に利用されてきました [ AGSW19, ALCJ+19, AWDL+19, AZNHL19 ]。それはまた Analytics Vidhya, Towards Data Science, KDnuggets, Computer Vision News そして awesome-machine-learning を含む、様々な専門投稿/チュートリアルで機械学習コミュニティにより良く認知されています。

PyOD は以下のように特徴付けられます :

統一された API、詳細なドキュメント、そして様々なアルゴリズムに渡る対話的なサンプル。
高度なモデルは、scikit-learn からの古典的なもの、最新の深層学習メソッド、そして COPOD のような新しいアルゴリズムを含みます。
numba と joblib を使用し、可能なときには JIT と並列化による最適化されたパフォーマンス。
SUOD [ AZHC+21 ] による高速訓練 & 予測。
Python 2 & 3 の両者と互換。

API デモ :

# train the COPOD detector
from pyod.models.copod import COPOD
clf = COPOD()
clf.fit(X_train)

# get outlier scores
y_train_scores = clf.decision_scores_  # raw outlier scores
y_test_scores = clf.decision_function(X_test)  # outlier scores

Citing PyOD :

※ 原文を確認してください。

主要なリンクとリソース :

インストール

インストールのために pip を使用することが勧められます。最新バージョンがインストールされていることを確実にしてください、PyOD は頻繁にアップデートされるためです :

pip install pyod            # normal install
pip install --upgrade pyod  # or update if needed
pip install --pre pyod      # or include pre-release version for new features

代わりに、clone して setup.py ファイルを実行しても良いです :

git clone https://github.com/yzhao062/pyod.git
cd pyod
pip install .

必要な依存性 :

Python 2.7, 3.5, 3.6, or 3.7
combo>=0.0.8
joblib
numpy>=1.13
numba>=0.35
pandas>=0.25
scipy>=0.19.1
scikit_learn>=0.19.1
statsmodels

オプションの依存性 (詳細は下を見てください) :

combo (オプション, models/combination.py と FeatureBagging のために必要)
keras (オプション, AutoEncoder と他の深層学習モデルのために必要)
matplotlib (オプション, examples の実行のために必要)
pandas (オプション, benchmark の実行のために必要)
tensorflow (オプション, AutoEncoder と他の深層学習モデルのために必要)
xgboost (オプション, XGBOD のために必要)

警告 1 : PyOD は複数のニューラルネットワーク・ベースのモデルを持ちます、例えば AutoEncoder で、これは PyTorch と TensorFlow の両者で実装されています。けれども、PyOD は DL ライブラリを貴方のためにインストール しません。これは貴方のローカルコピーに干渉するリスクを軽減します。ニューラルネット・ベースのモデルを使用することを望む場合、Keras とバックエンド・ライブラリ、例えば TensorFlow がインストールされていることを確かにしてください。手順が提供されています : neural-net FAQ。同様に、XGBOD のような、xgboost に依存するモデルはデフォルトでは xgboost のインストールを強制 しません。

警告 2 : サンプルの実行は matplotlib を必要とします、これは mac OS 上の conda 仮想環境ではエラーを投げるかもしれません。理由と解法 mac_matplotlib を見てください。

警告 3 : PyOD は scikit-learn にも存在する複数のモデルを含みます。けれども、これら 2 つのライブラリの API は正確に同じではありません — 一貫性のためにそれらの一つだけを使用し、結果を混ぜないことを勧めます。より多くの情報については sckit-learn と PyOD の間の違いを参照してください。

API チートシート & リファレンス

完全な API リファレンス ( https://pyod.readthedocs.io/en/latest/pyod.html )。総ての検出器のための API チートシート :

fit(X) : 検出器を fit させます。
decision_function(X) : fit された検出器を使用して X の raw 異常スコアを予測します。
predict(X) : fit された検出器を使用して特定のサンプルが外れ値であるか否かを予測します。
predict_proba(X) : fit された検出器を使用してサンプルが外れ値である確率を予測します。

fit されたモデルの主要な属性 :

decision_scores_ : 訓練データの外れ値スコア。より高ければ、より異常です。外れ値は高いスコアを持つ傾向があります。
labels_ : 訓練データの二値ラベル。0 は inliers をそして 1 は外れ値/異常を表します。

Note : fit_predict() と fit_predict_score() は V0.6.9 では整合性のために deprecated となり、V0.8.0 で除去されます。To get the binary labels of the training data X_train, one should call clf.fit(X_train) and use clf.labels_, instead of calling clf.predict(X_train).

モデル・セーブ & ロード

PyOD はモデルの永続性に関して sklearn と同様のアプローチを取ります。明確にするため model persistence を見てください。

手短に言えば、PyOD モデルをセーブしてロードするために joblib や pickle を利用することを勧めます。例として “examples/save_load_model_example.py” を見てください。要するに、それは下のように単純です :

from joblib import dump, load

# save the model
dump(clf, 'clf.joblib')
# load the model
clf = load('clf.joblib')

実装アルゴリズム

PyOD ツールキットは 3 つの主要な機能グループから成ります :

(i) 個々の検知アルゴリズム :

PCA

線形モデル
主成分分析 (固有ベクトル超平面までの重み付けられた射影距離の合計)
2003
pyod.models.pca.PCA (クラス)
[ ASCSC03 ]

MCD

線形モデル
最小共分散行列式 (外れ値スコアとして mahalanobis を使用)
1999
pyod.models.mcd.MCD
[ AHR04, ARD99 ]

OCSVM

線形モデル
One-Class サポートベクターマシン
2001
pyod.models.ocsvm.OCSVM
[ AScholkopfPST+01 ]

MDD

線形モデル
偏差ベースの外れ値検知 (LMDD)
1996
pyod.models.lmdd.LMDD
[ AAAR96 ]

LOF

近接度ベース
局所外れ値因子 (= Local Outlier Factor)
2000
pyod.models.lof.LOF
[ ABKNS00 ]

COF

近接度ベース
接続性ベースの外れ値因子
2002
pyod.models.cof.COF
[ ATCFC02 ]

CBLOF

近接度ベース
クラスタリング・ベースの局所外れ値因子
2003
pyod.models.cblof.CBLOF
[ AHXD03 ]

LOCI

近接度ベース
LOCI: 局所相関積分を使用する高速外れ値検知
2003
pyod.models.loci.LOCI
[ APKGF03 ]

HBOS

近接度ベース
ヒストグラム・ベースの外れ値スコア
2012
pyod.models.hbos.HBOS
[ AGD12 ]

kNN

近接度ベース
k 近傍 (外れ値スコアとして kth 近傍への距離を使用)
2000
pyod.models.knn.KNN
[ AAP02, ARRS00 ]

AvgKNN

近接度ベース
平均 kNN (外れ値スコアとして k 近傍への平均距離を使用)
2002
pyod.models.knn.KNN
[ AAP02, ARRS00 ]

MedKNN

近接度ベース
中央値 kNN (外れ値スコアとして k 近傍への中央値距離を使用
2002
pyod.models.knn.KNN
[ AAP02, ARRS00 ]

SOD

近接度ベース
Subspace (部分空間) 外れ値検知
2009
pyod.models.sod.SOD
[ AKKrogerSZ09 ]

ROD

近接度ベース
Rotation ベースの外れ値検知
2020
pyod.models.rod.ROD
[ AABC20 ]

ABOD

確率的
Angle ベースの外れ値検知
2008
pyod.models.abod.ABOD
[ AKZ+08 ]

FastABOD

確率的
近似を使用する高速 Angle ベースの外れ値検知
2008
pyod.models.abod.ABOD
[ AKZ+08 ]

COPOD

確率的
COPOD: Copula-ベースの外れ値検知
2020
pyod.models.copod.COPOD
[ ALZB+20 ]

MAD

確率的
中央値絶対偏差 (MAD)
1993
pyod.models.mad.MAD
[ AIH93 ]

SOS

確率的
確率的外れ値選択
2012
pyod.models.sos.SOS
[ AJHuszarPvdH12 ]

IForest

外れ値アンサンブル
Isolation Forest
2008
pyod.models.iforest.IForest
[ ALTZ08, ALTZ12 ]

(短縮名なし)

外れ値アンサンブル
特徴バギング
2005
pyod.models.feature_bagging.FeatureBagging
[ ALK05 ]

LSCP

外れ値アンサンブル
LSCP: 並列外れ値アンサンブルの局所的選択的組み合わせ (Locally Selective Combination)
2019
pyod.models.lscp.LSCP
[ AZNHL19 ]

XGBOD

外れ値アンサンブル
Extreme Boosting ベースの外れ値検知 (教師あり)
2018
pyod.models.xgbod.XGBOD
[ AZH18 ]

LODA

外れ値アンサンブル
異常の軽量オンライン検出器
2016
pyod.models.loda.LODA
[ APevny16 ]

AutoEncoder

ニューラルネット
完全結合 AutoEncoder (外れ値スコアとして再構築エラーを使用)
2015
pyod.models.auto_encoder.AutoEncoder
[ AAgg15 ]

VAE

ニューラルネット
変分 AutoEncoder (外れ値スコアとして再構築エラーを使用)
2013
pyod.models.vae.VAE
[ AKW13 ]

Beta-VAE

ニューラルネット
変分 AutoEncoder (gamma and capacity を変化させることによる総てのカスタマイズされた損失項)
2018
pyod.models.vae.VAE
[ ABHP+18 ]

SO_GAAL

ニューラルネット
Single-Objective 敵対的生成 Active Learning
2019
pyod.models.so_gaal.SO_GAAL
[ ALLZ+19 ]

MO_GAAL

ニューラルネット
Multiple-Objective 敵対的生成 Active Learning
2019
pyod.models.mo_gaal.MO_GAAL
[ ALLZ+19 ]

(ii) 外れ値アンサンブル & 外れ値検出器 Combination フレームワーク :

(短縮名なし)

外れ値アンサンブル
特徴 Bagging
2005
pyod.models.feature_bagging.FeatureBagging
[ ALK05 ]

LSCP

外れ値アンサンブル
LSCP: 並列外れ値アンサンブルの局所的選択的組み合わせ (Locally Selective Combination)
2019
pyod.models.lscp.LSCP
[ AZNHL19 ]

XGBOD

外れ値アンサンブル
Extreme Boosting ベースの外れ値検知 (教師あり)
2018
pyod.models.xgbod.XGBOD
[ AZH18 ]

LODA

外れ値アンサンブル
異常の軽量オンライン検出器
2018
pyod.models.loda.LODA
[ APevny16 ]

Average

組合せ (= combination)
スコアの平均による単純な組合せ
2015
pyod.models.combination.average()
[ AAS15 ]

Weighted Average

組合せ
検出器重みを使用したスコアの平均による単純な組合せ
2015
pyod.models.combination.average()
[ AAS15 ]

Maximization

組合せ (= combination)
最大スコアを取ることによる単純な組合せ
2015
pyod.models.combination.maximization()
[ AAS15 ]

AOM

組合せ (= combination)
最大値の平均
2015
pyod.models.combination.aom()
[ AAS15 ]

MOA

組合せ (= combination)
平均値の最大化
2015
pyod.models.combination.moa()
[ AAS15 ]

Median

組合せ (= combination)
スコアの中央値を取ることによる単純な組合せ
2015
pyod.models.combination.median()
[ AAS15 ]

majority Vote

組合せ (= combination)
ラベルの投票の過半数を取ることによる単純な組合せ (重みが利用できます)
2015
pyod.models.combination.majority_vote()
[ AAS15 ]

(iii) ユティリティ関数 :

generate_data

データ
合成データ生成 ; 正規データは多変量ガウス分布により生成されそして外れ値は一様分布により生成されます。
pyod.utils.data.generate_data()
generate_data

generate_data_clusters

データ
クラスターの合成データ生成 ; より複雑なデータパターンが複数のクラスターで作成できます。
pyod.utils.data.generate_data_clusters()
generate_data_clusters

wpearsonr

統計
2 つのサンプルの重み付けられた Pearson 相関 (係数) を計算する
pyod.utils.stat_models.wpearsonr()
wpearsonr

get_label_n

ユティリティ
1 を top n 外れ値スコアに割り当てることにより raw 外れ値スコアを二値ラベルに変換します。
pyod.utils.utility.get_label_n()
get_label_n

precision_n_scores

ユティリティ
calculate precision @ rank n
pyod.utils.utility.precision_n_scores()
precision_n_scores

アルゴリズム・ベンチマーク

実装モデルの間の比較は下で利用可能です (図、compare_all_models.py、対話的な Jupyter Notebooks)。Jupyter Notebooks については、”/notebooks/Compare All Models.ipynb” をナビゲートしてください。

外れ値検知

ベンチマークはアルゴリズム選択のため実装されたモデルの概要を提供するために供給されます。トータルで、17 ベンチマーク・データセットが比較のために使用され、これは ODDS でダウンロードできます。

各データセットについて、それは最初に訓練のために 60% とテストのために 40% に分割されます。総ての実験はランダム分割で 10 回独立に繰り返されます。10 トライアルの平均が最終的な結果として見なされます。3 つの評価メトリクスが提供されます :

ROC (受信者操作特性) 曲線下の面積
精度 (= Precision) @ rank n (P@N)
実行時間

最新のベンチマークを確認してください。benchmark.py を実行することによりこのプロセスを再現できます。

リファレンス

AAgg15 : Charu C Aggarwal. Outlier analysis. In Data mining, 75–79. Springer, 2015.
AAS15(1,2,3,4,5,6,7) : Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explorations Newsletter, 17(1):24–47, 2015.
AABC20 : Yahya Almardeny, Noureddine Boujnah, and Frances Cleary. A novel outlier detection method for multivariate data. IEEE Transactions on Knowledge and Data Engineering, 2020.
AAP02(1,2,3) : Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery, 15–27. Springer, 2002.
AAAR96 : Andreas Arning, Rakesh Agrawal, and Prabhakar Raghavan. A linear method for deviation detection in large databases. In KDD, volume 1141, 972–981. 1996.
ABKNS00 : Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, 93–104. ACM, 2000.
ABHP+18 : Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599, 2018.
AGD12 : Markus Goldstein and Andreas Dengel. Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. KI-2012: Poster and Demo Track, pages 59–63, 2012.
AGSW19 : Parikshit Gopalan, Vatsal Sharan, and Udi Wieder. Pidforest: anomaly detection via partial identification. In Advances in Neural Information Processing Systems, 15783–15793. 2019.
AHR04 : Johanna Hardin and David M Rocke. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625–638, 2004.
AHXD03 : Zengyou He, Xiaofei Xu, and Shengchun Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10):1641–1650, 2003.
AIH93 : Boris Iglewicz and David Caster Hoaglin. How to detect and handle outliers. Volume 16. Asq Press, 1993.
AJHuszarPvdH12 : JHM Janssens, Ferenc Huszár, EO Postma, and HJ van den Herik. Stochastic outlier selection. Technical Report, Technical report TiCC TR 2012-001, Tilburg University, Tilburg Center for Cognition and Communication, Tilburg, The Netherlands, 2012.
AKW13 : Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
AKKrogerSZ09 : Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 831–838. Springer, 2009.
AKZ+08(1,2) : Hans-Peter Kriegel, Arthur Zimek, and others. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 444–452. ACM, 2008.
ALK05(1,2) : Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 157–166. ACM, 2005.
ALCJ+19 : Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, and See-Kiong Ng. Mad-gan: multivariate anomaly detection for time series data with generative adversarial networks. In International Conference on Artificial Neural Networks, 703–716. Springer, 2019.
ALZB+20 : Zheng Li, Yue Zhao, Nicola Botta, Cezar Ionescu, and Xiyang Hu. COPOD: copula-based outlier detection. In IEEE International Conference on Data Mining (ICDM). IEEE, 2020.
ALTZ08 : Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 413–422. IEEE, 2008.
ALTZ12 : Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1):3, 2012.
ALLZ+19(1,2) : Yezheng Liu, Zhe Li, Chong Zhou, Yuanchun Jiang, Jianshan Sun, Meng Wang, and Xiangnan He. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering, 2019.
APKGF03 : Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B Gibbons, and Christos Faloutsos. Loci: fast outlier detection using the local correlation integral. In Data Engineering, 2003. Proceedings. 19th International Conference on, 315–326. IEEE, 2003.
APevny16(1,2) : Tomáš Pevn\`y. Loda: lightweight on-line detector of anomalies. Machine Learning, 102(2):275–304, 2016.
ARRS00(1,2,3) : Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, volume 29, 427–438. ACM, 2000.
ARD99 : Peter J Rousseeuw and Katrien Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3):212–223, 1999.
AScholkopfPST+01 : Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural computation, 13(7):1443–1471, 2001.
ASCSC03 : Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang. A novel anomaly detection scheme based on principal component classifier. Technical Report, MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING, 2003.
ATCFC02 : Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David W Cheung. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 535–548. Springer, 2002.
AWDL+19 : Xuhong Wang, Ying Du, Shijie Lin, Ping Cui, Yuntian Shen, and Yupu Yang. Advae: a self-adversarial variational autoencoder with gaussian anomaly prior knowledge for anomaly detection. Knowledge-Based Systems, 2019.
AZH18(1,2) : Yue Zhao and Maciej K Hryniewicki. Xgbod: improving supervised outlier detection with unsupervised representation learning. In International Joint Conference on Neural Networks (IJCNN). IEEE, 2018.
AZHC+21 : Yue Zhao, Xiyang Hu, Cheng Cheng, Cong Wang, Changlin Wan, Wen Wang, Jianing Yang, Haoping Bai, Zheng Li, Cao Xiao, Yunlong Wang, Zhi Qiao, Jimeng Sun, and Leman Akoglu. Suod: accelerating large-scale unsupervised heterogeneous outlier detection. Proceedings of Machine Learning and Systems, 2021.
AZNHL19(1,2,3) : Yue Zhao, Zain Nasrullah, Maciej K Hryniewicki, and Zheng Li. LSCP: locally selective combination in parallel outlier ensembles. In Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, 585–593. Calgary, Canada, May 2019. SIAM.
URL: https://doi.org/10.1137/1.9781611975673.66, doi:10.1137/1.9781611975673.66.
AZNL19 : Yue Zhao, Zain Nasrullah, and Zheng Li. PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20(96):1–7, 2019.

以上

タグ: PyOD 0.8

異常検知外れ値検知 & 時系列予測

PyOD 0.8 : 概要

PyOD 0.8 : 概要

インストール

API チートシート & リファレンス

モデル・セーブ & ロード

実装アルゴリズム

アルゴリズム・ベンチマーク

リファレンス

ClassCat® Chatbot

人工知能開発支援

最近の投稿

カテゴリー

異常検知 外れ値検知 & 時系列予測

PyOD 0.8 : 概要

PyOD 0.8 : 概要

インストール

API チートシート & リファレンス

モデル・セーブ & ロード

実装アルゴリズム

アルゴリズム・ベンチマーク

リファレンス

ClassCat® Chatbot

人工知能開発支援

最近の投稿

カテゴリー

タグ

異常検知外れ値検知 & 時系列予測