sklearnのpipelineの使い方

make_pipelineを通して、(入力)=>(変換器(複数))=>(推定器)=>(出力) のwrapperを利用できる。

  • 変換器は fit & transform
  • 推定器は fit
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline


df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',
                header=None)

X = df.loc[:, 2:].values
y = df.loc[:, 1].values

## y を Label Encode
le = LabelEncoder()
y = le.fit_transform(y)
print(le.classes_)

## トレーニングデータとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

## pipeline の作成
## sklearn.pipeline.make_pipeline(*steps, **kwargs)

pipe_lr = make_pipeline(StandardScaler(),
                       PCA(n_components=2),
                       LogisticRegression(random_state=1))
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)


## Test Accuracy: 0.956
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

sklearn.pipeline.make_pipeline — scikit-learn 0.19.1 documentation