データ分析の流れ

準備

  1. Prepare Problem a) Load libraries b) Load dataset
  2. Summarize Data a) Descriptive statistics b) Data visualizations
  3. Prepare Data a) Data Cleaning b) Feature Selection c) Data Transforms (Normalize,...)

分析

  1. Evaluate Algorithms a) Split-out validation dataset b) Test options and evaluation metric c) Spot Check Algorithms d) Compare Algorithms
  2. Improve Accuracy a) Algorithm Tuning b) Ensembles
  3. Finalize Model a) Predictions on validation dataset b) Create standalone model on entire training dataset c) Save model for later use

https://www.kaggle.com/dennise/coursera-competition-getting-started-eda

例(train: ECサイトの transaction)

準備

1. repare Problem a) Load libraries b) Load dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline 

from scipy import stats


test=pd.read_csv('test.csv.gz',compression='gzip')
train=pd.read_csv('train.csv.gz',compression='gzip')

# データ初見確認
train.info()
train.describe() # count が index と等しければ、non-null と考えられる
train.head()

2. Summarize Data a) Descriptive statistics b) Data visualizations

train.price.hist()
train.price.value_counts()
train.price.nunique()

# histgram
train.price.apply(lambda x: np.log10(x+2)).hist(figsize=(12,16))

# index 順の plot
train.block_num.plot(figsize=(20,4))

# データ作り込み
train["category"]=train.name.apply(lambda x:x.split()[0])

date_format = '%d.%m.%Y'
train['day'] = pd.to_datetime(train['date'], format=date_format).dt.day
train['month'] = pd.to_datetime(train['date'], format=date_format).dt.month
train['year'] = pd.to_datetime(train['date'], format=date_format).dt.year
train['weekday'] = pd.to_datetime(train['date'], format=date_format).dt.dayofweek

train["revenue"]=train.item_cnt_day * train.price

train.groupby("item_category_id").sum()["revenue"].hist(figsize=(20,4),bins=100)

# 不必要な行の削除
train.drop("date",axis=1,inplace=True)
train.drop("name",axis=1,inplace=True)
train.drop("item_cnt_day",axis=1,inplace=True)

# 売上数の推移
train.groupby("block_num").sum()['revenue'].plot()
train.groupby("block_num").mean()['revenue'].plot()

# 特定商品名に対する価格の散布図
prices_hoge=train[train.category=="hogehoge"]["item_price"]

plt.figure(figsize=(20, 8), dpi=80)
plt.scatter(prices_hoge.index, prices_hoge,s=0.1)

# groupby した結果を pivot table で見る
# upnstack <--> stack (階層化データの変換)
train.groupby(["block_num","item_category_id"]).sum()["revenue"].unstack()
train.pivot_table(index=['shop_id','item_id'], columns='block_num', values='item_cnt_day',aggfunc='sum').fillna(0.0)

train.groupby(["block_num","item_category_id"]).sum()["revenue"].unstack().plot(figsize=(20,20))
train.groupby(["block_num","shop_id"]).sum()["revenue"].unstack().plot(figsize=(20,20))

# データの相関
sns.pairplot(train)

# データの入力ミスの修正
train.loc[train["shop_id"]==0,"shop_id"]=1

# テストデータの内容確認
test_list = test.shop_id.unique()
out_of_test = [i for i in train.shop_id.unique() if i not in test_list]

3. Prepare Data a) Data Cleaning b) Feature Selection c) Data Transforms (Normalize,...)

# a) Data Cleaning
columns_needed = ['item_id', 'shop_id', ...]
df = train[columns_needed]

# b) Feature Selection c) Data Transforms (Normalize,...)

df["price_category"]=np.nan
df["price_category"][(df["price"]>=0)&(df["price"]<=10000)]=0
df["price_category"][(df["price"]>10000)]=1

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df.category)
df["meta_category"] = le.transform(df.category)

scaler = preprocessing.StandardScaler()
le.fit(df.star)
df['star'] = scaler.transform(df.star)

X_train=df.drop("item_cnt_month", axis=1)
# Reason for dropping item_price explained below
y_train=df["item_cnt_month"]

X_train.fillna(0, inplace=True)

分析

1. Evaluate Algorithms a) Split-out validation dataset b) Test options and evaluation metric c) Spot Check Algorithms d) Compare Algorithms

import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# 線形回帰
linmodel=LinearRegression()
linmodel.fit(X_train, y_train)
lin_pred=linmodel.predict(X_train)

print('R-squared is %f' % r2_score(lin_pred, y_train))

# boost model
lgb_params = {
               'feature_fraction': 0.75,
               'metric': 'rmse',
               'nthread':1, 
               'min_data_in_leaf': 2**7, 
               'bagging_fraction': 0.75, 
               'learning_rate': 0.03, 
               'objective': 'mse', 
               'bagging_seed': 2**7, 
               'num_leaves': 2**7,
               'bagging_freq':1,
               'verbose':0 
              }

model = lgb.train(lgb_params, lgb.Dataset(X_train, label=y_train), 100)
pred_lgb = model.predict(X_train)

print('R-squared is %f' % r2_score(pred_lgb, y_train))

http://kidnohr.hatenadiary.com/entry/2018/09/21/012446

2. Improve Accuracy a) Algorithm Tuning b) Ensembles

meta_feature = np.c_[lin_pred, pred_lgb]
meta_lr = LinearRegression()
meta_lr.fit(meta_feature, final_train[33])

meta_pred = meta_lr.predict(meta_feature)

print('R-squared is %f' % r2_score(meta_pred, y_train))

3. Finalize Model a) Predictions on validation dataset b) Create standalone model on entire training dataset c) Save model for later use