2018-07-18

Varnishについて

Varnish

いつの日にか実装するかもしれないので、メモ

github.com

qiita.com

2018-07-18

Django の models に対応したテーブルを MySQL から grep する方法

Python3 Django MySQL

以下のコマンドで、取り出す。

mysql -uroot -N information_schema -e "select table_name from tables where table_schema = 'tablename' and table_name like 'prefix_%'" > table.txt

2018-07-11

tf の mnist をニューラルネットワークで分析

機械学習 Python3

正解率が90%と低めに出た。。原因は今度調べよう

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

mnist.train.images.shape

=>
(55000, 784)

n_feature = mnist.train.images.shape[1]
y_onehot = mnist.train.labels
n_classes = 10
random_seed = 123
np.random.seed(random_seed)

g = tf.Graph()
with g.as_default():
    tf.set_random_seed(random_seed)
    tf_x = tf.placeholder(
        dtype=tf.float32,
        shape=(None, n_features),
        name='tf_x'
    )
    tf_y = tf.placeholder(
        dtype=tf.int32,
        shape=(None, n_classes),
        name='tf_y'
    )
    h1 = tf.layers.dense(
        inputs=tf_x,
        units=50,
        activation=tf.tanh,
        name='layer1',
    )
    h2 = tf.layers.dense(
        inputs=h1,
        units=50,
        activation=tf.tanh,
        name='layer2',
    )
    logits = tf.layers.dense(
        inputs=h2,
        units=10,
        activation=None,
        name='layer3'
    )
    predictions = {
        'classes': tf.argmax(logits, axis=1, name='predicted_classes'),
        'probabilities': tf.nn.softmax(logits, name='softmax_tensor')
    }

with g.as_default():
    cost = tf.losses.softmax_cross_entropy(
        onehot_labels=tf_y,
        logits=logits
    )
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
    train_op = optimizer.minimize(loss=cost)
    init_op = tf.global_variables_initializer()

sess = tf.Session(graph=g)
sess.run(init_op)

training_costs = []

for epoch in range(50):
    training_loss = []
    batch_size = 128
    for i in range( (mnist.train.images.shape[0] // batch_size) + 1):
        batch_xs, batch_ys = mnist.train.next_batch(batch_size, shuffle=True)
        feed = {tf_x: batch_xs, tf_y: batch_ys}
        _, batch_cost = sess.run([train_op, cost], feed_dict=feed)
        training_costs.append(batch_cost)
    print(' -- Epoch %2d   Avg. Training Loss: %.4f' % (epoch + 1, np.mean(training_costs)))

 -- Epoch  1   Avg. Training Loss: 2.2021
 -- Epoch  2   Avg. Training Loss: 2.0037
 -- Epoch  3   Avg. Training Loss: 1.8452
 -- Epoch  4   Avg. Training Loss: 1.7153
 -- Epoch  5   Avg. Training Loss: 1.6072
 -- Epoch  6   Avg. Training Loss: 1.5165
 -- Epoch  7   Avg. Training Loss: 1.4384
 -- Epoch  8   Avg. Training Loss: 1.3709
 -- Epoch  9   Avg. Training Loss: 1.3118
 -- Epoch 10   Avg. Training Loss: 1.2595
 -- Epoch 11   Avg. Training Loss: 1.2130
 -- Epoch 12   Avg. Training Loss: 1.1712
 -- Epoch 13   Avg. Training Loss: 1.1335
 -- Epoch 14   Avg. Training Loss: 1.0990
 -- Epoch 15   Avg. Training Loss: 1.0678
 -- Epoch 16   Avg. Training Loss: 1.0392
 -- Epoch 17   Avg. Training Loss: 1.0125
 -- Epoch 18   Avg. Training Loss: 0.9880
 -- Epoch 19   Avg. Training Loss: 0.9652
 -- Epoch 20   Avg. Training Loss: 0.9441
 -- Epoch 21   Avg. Training Loss: 0.9241
 -- Epoch 22   Avg. Training Loss: 0.9055
 -- Epoch 23   Avg. Training Loss: 0.8881
 -- Epoch 24   Avg. Training Loss: 0.8716
 -- Epoch 25   Avg. Training Loss: 0.8562
 -- Epoch 26   Avg. Training Loss: 0.8416
 -- Epoch 27   Avg. Training Loss: 0.8278
 -- Epoch 28   Avg. Training Loss: 0.8146
 -- Epoch 29   Avg. Training Loss: 0.8021
 -- Epoch 30   Avg. Training Loss: 0.7902
 -- Epoch 31   Avg. Training Loss: 0.7789
 -- Epoch 32   Avg. Training Loss: 0.7683
 -- Epoch 33   Avg. Training Loss: 0.7579
 -- Epoch 34   Avg. Training Loss: 0.7480
 -- Epoch 35   Avg. Training Loss: 0.7385
 -- Epoch 36   Avg. Training Loss: 0.7295
 -- Epoch 37   Avg. Training Loss: 0.7208
 -- Epoch 38   Avg. Training Loss: 0.7125
 -- Epoch 39   Avg. Training Loss: 0.7046
 -- Epoch 40   Avg. Training Loss: 0.6969
 -- Epoch 41   Avg. Training Loss: 0.6894
 -- Epoch 42   Avg. Training Loss: 0.6823
 -- Epoch 43   Avg. Training Loss: 0.6754
 -- Epoch 44   Avg. Training Loss: 0.6687
 -- Epoch 45   Avg. Training Loss: 0.6623
 -- Epoch 46   Avg. Training Loss: 0.6560
 -- Epoch 47   Avg. Training Loss: 0.6500
 -- Epoch 48   Avg. Training Loss: 0.6442
 -- Epoch 49   Avg. Training Loss: 0.6386

feed = {tf_x: mnist.test.images,}
y_pred = sess.run(predictions['classes'], feed_dict=feed)

y_pred
=>
array([7, 2, 1, ..., 4, 5, 6])

y_test = np.argmax(mnist.test.labels, axis=1)

100 * np.sum(y_pred == y_test) / y_test.shape[0]
=>
90.77

2018-07-09

Solrのパフォーマンスチューニング

Solr

yomon.hatenablog.com

JVM Settings | Apache Solr Reference Guide 6.6

SolrPerformanceFactors - Solr Wiki

ShawnHeisey - Solr Wiki

SolrPerformanceProblems - Solr Wiki

JVM 自体のチューニング

yoskhdia.hatenablog.com

fomsan.sakura.ne.jp

2018-07-05

word2vecすごいぞ

機械学習 Python3

結構すごい。。表記ゆれとかも吸収できそう。

from gensim.models import word2vec
ls = []
for row in df_id['review_comment'].values[:100000]:
    ls.append(_split_to_rawwords(row))
model = word2vec.Word2Vec(ls, size=500, window=5, min_count=5, workers=4)

model.wv.most_similar(positive=['エアコン'])
...

model.save("./review.model")
model = word2vec.Word2Vec.load("./review.model")

deepage.net

radimrehurek.com

towardsdatascience.com

Vector Representations of Words | TensorFlow

2018-07-05

LDA（Latent Dirichlet Allocation）でのトピック抽出でレビュー分析

機械学習 Python3

レビューの分析方法をまとめる。

import os
import glob
import sys
from datetime import (datetime, date, timedelta)
import logging
import re
import shutil
import tempfile

import pandas as pd
import numpy as np
from scipy.sparse.csc import csc_matrix
from scipy.sparse import csr_matrix

from sklearn.metrics.pairwise import (
    cosine_similarity,
    euclidean_distances,
)
from sklearn import preprocessing
from sklearn.decomposition import NMF
from sklearn.cluster import KMeans


from IPython.display import display, HTML
from pandas.tools.plotting import table
import matplotlib.pyplot as plt


from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation


import pickle
import os

import MeCab

n_samples = 500
n_features = 1000
n_components = 5
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

def lda_print_top_words(components, feature_names, n_top_words):
    for topic_idx, topic in enumerate(components):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


def is_bigger_than_min_tfidf(term, terms, tfidfs):
    '''
    [term for term in terms if is_bigger_than_min_tfidf(term, terms, tfidfs)]で使う
    list化した、語たちのtfidfの値のなかから、順番に当てる関数。
    tfidfの値がMIN_TFIDFよりも大きければTrueを返す
    '''
    if tfidfs[terms.index(term)] > MIN_TFIDF:
        return True
    return False


def tfidf(values):
    # analyzerは文字列を入れると文字列のlistが返る関数
    vectorizer = TfidfVectorizer(analyzer=stems, min_df=1, max_df=50, max_features=n_features)
    corpus = [v for v in values]

    x = vectorizer.fit_transform(corpus)

    return x, vectorizer  # xはtfidf_resultとしてmainで受け取る


def countvec(values):
    # analyzerは文字列を入れると文字列のlistが返る関数
    vectorizer = CountVectorizer(analyzer=stems, min_df=1, max_df=50, max_features=n_features)
    corpus = [v for v in values]

    x = vectorizer.fit_transform(corpus)

    return x, vectorizer  # xはtfidf_resultとしてmainで受け取る


def _split_to_words(text, to_stem=False):
    """
    入力: 'すべて自分のほうへ'
    出力: tuple(['すべて', '自分', 'の', 'ほう', 'へ'])
    """
    tagger = MeCab.Tagger('mecabrc')  # 別のTaggerを使ってもいい
    mecab_result = tagger.parse(text)
    info_of_words = mecab_result.split('\n')
    words = []
    for info in info_of_words:
        # macabで分けると、文の最後に’’が、その手前に'EOS'が来る
        if info == 'EOS' or info == '':
            break
            # info => 'な\t助詞,終助詞,*,*,*,*,な,ナ,ナ'
        info_elems = info.split(',')
        # 6番目に、無活用系の単語が入る。もし6番目が'*'だったら0番目を入れる
        if info_elems[6] == '*':
            # info_elems[0] => 'ヴァンロッサム\t名詞'
            words.append(info_elems[0][:-3])
            continue
        if to_stem:
            # 語幹に変換
            words.append(info_elems[6])
            continue
        # 語をそのまま
        words.append(info_elems[0][:-3])
    words_set = set(words).difference(stop_words_set)
    return list(words_set)


def words(text):
    words = _split_to_words(text=text, to_stem=False)
    return words


def stems(text):
    stems = _split_to_words(text=text, to_stem=True)
    return stems

# レビュー分析用
def review_topickeyword_nmf(item_id):
    df_item_id = df_id.loc[df_id['review_item_id'] == item_id]
    if df_item_id['review_item_id'].count() == 0:
        print('None Item')
    print('itemid: {0}, review_count: {1}'.format(item_id, df_item_id['review_item_id'].count()))
    df_test = df_item_id[['review_rating', 'review_title', 'review_comment']]
    df_test[['review_comment']] = df_test[['review_comment']].applymap(lambda x: '{}'.format(x.replace('\\n', '\n')))
    df_test[['review_rating']] = df_test[['review_rating']].astype('int64')
    for num in range(5, 0, -1):
        df_test_part = df_test.loc[df_test['review_rating'] == num]
        print('review_rating: {0} count: {1}'.format(num, df_test_part['review_rating'].count()))
        y = df_test_part.values[:n_samples, 0]
        tfidf_result, tfidf_vectorizer = tfidf(df_test_part.values[:n_samples, 2])
        tf_result, tf_vectorizer = countvec(df_test_part.values[:n_samples, 2])
        nmf = NMF(n_components=n_components, random_state=1,
              beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
              l1_ratio=.5).fit(tfidf_result)
        tfidf_feature_names = tfidf_vectorizer.get_feature_names()
        print_top_words(nmf, tfidf_feature_names, n_top_words)


def review_topickeyword_lda(item_id):
    df_item_id = df_id.loc[df_id['review_item_id'] == item_id]
    if df_item_id['review_item_id'].count() == 0:
        print('None Item')
    print('itemid: {0}, review_count: {1}'.format(item_id, df_item_id['review_item_id'].count()))
    df_test = df_item_id[['review_rating', 'review_title', 'review_comment']]
    df_test[['review_comment']] = df_test[['review_comment']].applymap(lambda x: '{}'.format(x.replace('\\n', '\n')))
    df_test[['review_rating']] = df_test[['review_rating']].astype('int64')
    for num in range(5, 0, -1):
        df_test_part = df_test.loc[df_test['review_rating'] == num]
        print('review_rating: {0} count: {1}'.format(num, df_test_part['review_rating'].count()))
        y = df_test_part.values[:n_samples, 0]
        tf_result, tf_vectorizer = countvec(df_test_part.values[:n_samples, 2])
        lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                    learning_method='batch',
                                    learning_offset=50.,
                                    random_state=0)
        document_topics = lda.fit_transform(tf_result)
        tf_feature_names = tf_vectorizer.get_feature_names()
        print_top_words(lda, tf_feature_names, n_top_words)

# stopwordで精度上がる
stop_words = """
こちら
それなり
なか
分
手段
かたち
列
店
前回
書
など
まで
すね
土
まし
0
いや
もう
よそ
は
左
ごろ
はじめ
歴
都
〜
今回
多く
本当
文
玉
系
千
あちら
また
！
様
百
内
男
もの
どちら
婦
頃
カ所
6
どこか
いま
未満
者
せる
論
思う
中
俺
ヶ月
円
時点
ハイ
紀
8
お
だ
を
ぜんぶ
ヶ所
そう
年生
みつ
誰
億
道
間
一
くせ
品
見る
彼
何人
関係
結局
箇所
5
する
県
地
の
3
自分
ごっちゃ
ある
境
しかた
市
ほか
校
あな
ー
こっち
ため
事
以後
第
金
って
とても
ちゃん
月
国
し
なん
際
方
感じ
ひと
場合
ヵ所
今
何
楽
会
これ
怒
かく
課
週
すぎる
やつ
通
席
同じ
簿
や
2
向こう
しまう
いろいろ
て
例
だめ
。
五
下記
カ月
どれ
わけ
が
まま
類
後
7
幾つ
又
なに
時間
だけ
です
おれ
ば
もと
もん
あなた
六
られる
輪
新た
てる
首
高
度
元
式
略
哀
確か
ところ
員
すべて
言う
たい
なかば
区
所
のに
喜
右
4
きた
その後
目
伸
以上
村
できる
それぞれ
みなさん
界
そっち
名
…
家
これら
ほう
いつ
近く
うち
ます
ので
町
あれ
ぺん
か
おまえ
かやの
ごと
な
等
よう
達
力
人
わたし
ひとつ
私
兆
子
化
九
まとも
ぶり
見
次
体
いる
とき
自体
毎日
なんて
万
回
ない
段
十
気
どっか
に
さまざま
本当に
、
前
情
いくつ
た
う
ん
匹
ほど
みたい
他
てん
年
部
たび
係
1
二
台
よ
ここ
様々
形
以下
奴
さらい
全部
別
上記
彼女
ヵ月
箇月
どっち
それ
性
はるか
以降
作
でも
あまり
どこ
火
すか
べつ
こと
がら
連
レ
と
そこ
違い
線
半ば
けど
がい
さ
あっち
各
場
水
あたり
的
ふく
四
一つ
たくさん
より
さん
室
八
ずつ
で
はず
れる
なる
法
先
面
話
時
個
9
木
屋
上
しよう
数
我々
おおまか
歳
しか
下
三
ね
誌
府
も
以前
字
ちゃ
とおり
あそこ
・
まさ
点
秒
名前
器
束
あと
外
七
女
用
特に
行
士
へん
？
日
枚
感
観
扱い
手
みんな
そちら
みる
集
から
毎
口
たち
そで
方法
くらい
"""
stop_words_set = set(stop_words.strip().split('\n'))
# stop_words_list = list(stop_words_list)
# print(stop_words_list)

2018-07-04

pythonのscipyでsparseな行列の変換

Python3

sparseな行列についての実装

import numpy as np
from scipy.sparse import coo_matrix

)
a = np.arange(30).reshape(10,3)
print(a)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]
 [15 16 17]
 [18 19 20]
 [21 22 23]
 [24 25 26]
 [27 28 29]]

b, c, d = zip(*a)

print(b, c, d)
(0, 3, 6, 9, 12, 15, 18, 21, 24, 27) (1, 4, 7, 10, 13, 16, 19, 22, 25, 28) (2, 5, 8, 11, 14, 17, 20, 23, 26, 29)

mat = coo_matrix((d, (b, c)), shape=(30, 30))

mat.toarray()
array([[ 0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  5,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  8,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 14,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        17,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0, 20,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0, 23,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0, 26,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 29,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])

日に日に分からんことが増えていく…

φ(..)メモメモ

Varnishについて

Django の models に対応したテーブルを MySQL から grep する方法

tf の mnist をニューラルネットワークで分析

Solrのパフォーマンスチューニング

JVM 自体のチューニング

word2vecすごいぞ

LDA（Latent Dirichlet Allocation）でのトピック抽出でレビュー分析

pythonのscipyでsparseな行列の変換