データ分析の流れ
準備
- Prepare Problem a) Load libraries b) Load dataset
- Summarize Data a) Descriptive statistics b) Data visualizations
Prepare Data a) Data Cleaning b) Feature Selection c) Data Transforms (Normalize,...)
direcotory構成
echo '.DS_Store .ipynb_checkpoints/' > .gitignore [ -d app ] || mkdir app/ [ -d app/utils ] || mkdir app/utils [ -d app/utils/preprocessing ] || mkdir app/utils/preprocessing [ -d config ] || mkdir config [ -d data ] || mkdir data/ [ -d data/rawdata ] || mkdir data/rawdata && echo '* !.gitignore' > data/rawdata/.gitignore [ -d data/preprocessed_data ] || mkdir data/preprocessed_data && echo '* !.gitignore' > data/preprocessed_data/.gitignore [ -d data/model_params ] || mkdir data/model_params && echo '* !.gitignore' > data/model_params/.gitignore [ -d data/output ] || mkdir data/output && echo '* !.gitignore' > data/output/.gitignore [ -d log ] || mkdir log && echo '* !.gitignore' > log/.gitignore [ -d model ] || mkdir model [ -d tmp ] || mkdir tmp && echo '* !.gitignore' > tmp/.gitignore
- dvcを使用する場合
echo '.DS_Store .ipynb_checkpoints/' > .gitignore echo 'README' >>README.md [ -d app ] || mkdir app/ [ -d app/utils ] || mkdir app/utils [ -d config ] || mkdir config [ -d data ] || mkdir data/ [ -d data/input ] || mkdir data/input [ -d data/preprocessed_data ] || mkdir data/preprocessed_data [ -d data/output ] || mkdir data/output [ -d log ] || mkdir log [ -d model ] || mkdir model [ -d tmp ] || mkdir tmp
分析
- Evaluate Algorithms a) Split-out validation dataset b) Test options and evaluation metric c) Spot Check Algorithms d) Compare Algorithms
- Improve Accuracy a) Algorithm Tuning b) Ensembles
- Finalize Model a) Predictions on validation dataset b) Create standalone model on entire training dataset c) Save model for later use
https://www.kaggle.com/dennise/coursera-competition-getting-started-eda
https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9towardsdatascience.com
例(train: ECサイトの transaction)
準備
1. repare Problem a) Load libraries b) Load dataset
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns color = sns.color_palette() %matplotlib inline from scipy import stats test=pd.read_csv('test.csv.gz',compression='gzip') train=pd.read_csv('train.csv.gz',compression='gzip') # データ初見確認 train.info() train.describe() # count が index と等しければ、non-null と考えられる train.head()
2. Summarize Data a) Descriptive statistics b) Data visualizations
train.price.hist() train.price.value_counts() train.price.nunique() # histgram train.price.apply(lambda x: np.log10(x+2)).hist(figsize=(12,16)) # index 順の plot train.block_num.plot(figsize=(20,4)) # データ作り込み train["category"]=train.name.apply(lambda x:x.split()[0]) date_format = '%d.%m.%Y' train['day'] = pd.to_datetime(train['date'], format=date_format).dt.day train['month'] = pd.to_datetime(train['date'], format=date_format).dt.month train['year'] = pd.to_datetime(train['date'], format=date_format).dt.year train['weekday'] = pd.to_datetime(train['date'], format=date_format).dt.dayofweek train["revenue"]=train.item_cnt_day * train.price train.groupby("item_category_id").sum()["revenue"].hist(figsize=(20,4),bins=100) # 不必要な行の削除 train.drop("date",axis=1,inplace=True) train.drop("name",axis=1,inplace=True) train.drop("item_cnt_day",axis=1,inplace=True) # 売上数の推移 train.groupby("block_num").sum()['revenue'].plot() train.groupby("block_num").mean()['revenue'].plot() # 特定商品名に対する価格の散布図 prices_hoge=train[train.category=="hogehoge"]["item_price"] plt.figure(figsize=(20, 8), dpi=80) plt.scatter(prices_hoge.index, prices_hoge,s=0.1) # groupby した結果を pivot table で見る # upnstack <--> stack (階層化データの変換) train.groupby(["block_num","item_category_id"]).sum()["revenue"].unstack() train.pivot_table(index=['shop_id','item_id'], columns='block_num', values='item_cnt_day',aggfunc='sum').fillna(0.0) train.groupby(["block_num","item_category_id"]).sum()["revenue"].unstack().plot(figsize=(20,20)) train.groupby(["block_num","shop_id"]).sum()["revenue"].unstack().plot(figsize=(20,20)) # データの相関 sns.pairplot(train) # データの入力ミスの修正 train.loc[train["shop_id"]==0,"shop_id"]=1 # テストデータの内容確認 test_list = test.shop_id.unique() out_of_test = [i for i in train.shop_id.unique() if i not in test_list]
3. Prepare Data a) Data Cleaning b) Feature Selection c) Data Transforms (Normalize,...)
# a) Data Cleaning columns_needed = ['item_id', 'shop_id', ...] df = train[columns_needed] # b) Feature Selection c) Data Transforms (Normalize,...) df["price_category"]=np.nan df["price_category"][(df["price"]>=0)&(df["price"]<=10000)]=0 df["price_category"][(df["price"]>10000)]=1 from sklearn import preprocessing le = preprocessing.LabelEncoder() le.fit(df.category) df["meta_category"] = le.transform(df.category) scaler = preprocessing.StandardScaler() le.fit(df.star) df['star'] = scaler.transform(df.star) X_train=df.drop("item_cnt_month", axis=1) # Reason for dropping item_price explained below y_train=df["item_cnt_month"] X_train.fillna(0, inplace=True)
分析
1. Evaluate Algorithms a) Split-out validation dataset b) Test options and evaluation metric c) Spot Check Algorithms d) Compare Algorithms
import lightgbm as lgb from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # 線形回帰 linmodel=LinearRegression() linmodel.fit(X_train, y_train) lin_pred=linmodel.predict(X_train) print('R-squared is %f' % r2_score(lin_pred, y_train)) # boost model lgb_params = { 'feature_fraction': 0.75, 'metric': 'rmse', 'nthread':1, 'min_data_in_leaf': 2**7, 'bagging_fraction': 0.75, 'learning_rate': 0.03, 'objective': 'mse', 'bagging_seed': 2**7, 'num_leaves': 2**7, 'bagging_freq':1, 'verbose':0 } model = lgb.train(lgb_params, lgb.Dataset(X_train, label=y_train), 100) pred_lgb = model.predict(X_train) print('R-squared is %f' % r2_score(pred_lgb, y_train))
http://kidnohr.hatenadiary.com/entry/2018/09/21/012446
2. Improve Accuracy a) Algorithm Tuning b) Ensembles
meta_feature = np.c_[lin_pred, pred_lgb] meta_lr = LinearRegression() meta_lr.fit(meta_feature, final_train[33]) meta_pred = meta_lr.predict(meta_feature) print('R-squared is %f' % r2_score(meta_pred, y_train))
3. Finalize Model a) Predictions on validation dataset b) Create standalone model on entire training dataset c) Save model for later use
おまけ
groupby のチートシート