데모

라이브러리 import 및 설정

%reload_ext autoreload
%autoreload 2
%matplotlib inline
!pip install -U pip
Collecting pip
?25l  Downloading https://files.pythonhosted.org/packages/cb/28/91f26bd088ce8e22169032100d4260614fc3da435025ff389ef1d396a433/pip-20.2.4-py2.py3-none-any.whl (1.5MB)
     |████████████████████████████████| 1.5MB 657kB/s eta 0:00:01
?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.2.4
!pip install -U pandas
Collecting pandas
  Downloading pandas-1.1.4-cp36-cp36m-manylinux1_x86_64.whl (9.5 MB)
     |████████████████████████████████| 9.5 MB 626 kB/s eta 0:00:01
?25hCollecting pytz>=2017.2
  Downloading pytz-2020.4-py2.py3-none-any.whl (509 kB)
     |████████████████████████████████| 509 kB 14.4 MB/s eta 0:00:01
?25hRequirement already satisfied, skipping upgrade: numpy>=1.15.4 in /usr/local/lib/python3.6/dist-packages (from pandas) (1.18.1)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas) (2.8.1)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas) (1.13.0)
Installing collected packages: pytz, pandas
Successfully installed pandas-1.1.4 pytz-2020.4
!pip install -U scikit-learn
Collecting scikit-learn
  Downloading scikit_learn-0.23.2-cp36-cp36m-manylinux1_x86_64.whl (6.8 MB)
     |████████████████████████████████| 6.8 MB 1.2 MB/s eta 0:00:01
?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (1.18.1)
Collecting joblib>=0.11
  Downloading joblib-0.17.0-py3-none-any.whl (301 kB)
     |████████████████████████████████| 301 kB 16.8 MB/s eta 0:00:01
?25hRequirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (1.4.1)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-0.17.0 scikit-learn-0.23.2 threadpoolctl-2.1.0
!pip install -U tensorflow
Collecting tensorflow
  Downloading tensorflow-2.3.1-cp36-cp36m-manylinux2010_x86_64.whl (320.4 MB)
     |████████████████████████████████| 320.4 MB 14 kB/s s eta 0:00:01    |███▎                            | 32.9 MB 16.0 MB/s eta 0:00:18     |█████                           | 49.0 MB 13.0 MB/s eta 0:00:21        | 230.6 MB 15.2 MB/s eta 0:00:06     |███████████████████████▌        | 235.7 MB 17.1 MB/s eta 0:00:05MB/s eta 0:00:06��█████████████▍    | 274.0 MB 18.3 MB/s eta 0:00:03��█████████████████████████   | 290.2 MB 10.9 MB/s eta 0:00:03�█▉ | 308.6 MB 12.8 MB/s eta 0:00:01
?25hRequirement already satisfied, skipping upgrade: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.13.0)
Requirement already satisfied, skipping upgrade: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.11.2)
Collecting tensorflow-estimator<2.4.0,>=2.3.0
  Downloading tensorflow_estimator-2.3.0-py2.py3-none-any.whl (459 kB)
     |████████████████████████████████| 459 kB 17.8 MB/s eta 0:00:01
?25hCollecting gast==0.3.3
  Downloading gast-0.3.3-py2.py3-none-any.whl (9.7 kB)
Requirement already satisfied, skipping upgrade: google-pasta>=0.1.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.1.8)
Requirement already satisfied, skipping upgrade: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.9.0)
Collecting keras-preprocessing<1.2,>=1.1.1
  Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.4 MB/s  eta 0:00:01
?25hRequirement already satisfied, skipping upgrade: numpy<1.19.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.18.1)
Requirement already satisfied, skipping upgrade: wheel>=0.26 in /usr/lib/python3/dist-packages (from tensorflow) (0.30.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.0)
Collecting astunparse==1.6.3
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Requirement already satisfied, skipping upgrade: h5py<2.11.0,>=2.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.10.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.9.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.11.2)
Collecting tensorboard<3,>=2.3.0
  Downloading tensorboard-2.3.0-py3-none-any.whl (6.8 MB)
     |████████████████████████████████| 6.8 MB 9.2 MB/s eta 0:00:01
?25hRequirement already satisfied, skipping upgrade: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.1.0)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.26.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.9.2->tensorflow) (44.0.0)
Requirement already satisfied, skipping upgrade: requests<3,>=2.21.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (2.22.0)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (3.1.1)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.4.1)
Requirement already satisfied, skipping upgrade: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.10.0)
Collecting tensorboard-plugin-wit>=1.6.0
  Downloading tensorboard_plugin_wit-1.7.0-py3-none-any.whl (779 kB)
     |████████████████████████████████| 779 kB 16.4 MB/s eta 0:00:01
?25hRequirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.16.0)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (1.25.7)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /usr/lib/python3/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2.6)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2019.11.28)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (1.3.0)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.2.8)
Requirement already satisfied, skipping upgrade: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.0.0)
Requirement already satisfied, skipping upgrade: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.0)
Requirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (3.1.0)
Requirement already satisfied, skipping upgrade: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.4.8)
Installing collected packages: tensorflow-estimator, gast, keras-preprocessing, astunparse, tensorboard-plugin-wit, tensorboard, tensorflow
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.1.0
    Uninstalling tensorflow-estimator-2.1.0:
      Successfully uninstalled tensorflow-estimator-2.1.0
  Attempting uninstall: gast
    Found existing installation: gast 0.2.2
    Uninstalling gast-0.2.2:
      Successfully uninstalled gast-0.2.2
  Attempting uninstall: keras-preprocessing
    Found existing installation: Keras-Preprocessing 1.1.0
    Uninstalling Keras-Preprocessing-1.1.0:
      Successfully uninstalled Keras-Preprocessing-1.1.0
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.1.0
    Uninstalling tensorboard-2.1.0:
      Successfully uninstalled tensorboard-2.1.0
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

tensorflow-gpu 2.1.0 requires gast==0.2.2, but you'll have gast 0.3.3 which is incompatible.
tensorflow-gpu 2.1.0 requires tensorboard<2.2.0,>=2.1.0, but you'll have tensorboard 2.3.0 which is incompatible.
tensorflow-gpu 2.1.0 requires tensorflow-estimator<2.2.0,>=2.1.0rc0, but you'll have tensorflow-estimator 2.3.0 which is incompatible.
Successfully installed astunparse-1.6.3 gast-0.3.3 keras-preprocessing-1.1.2 tensorboard-2.3.0 tensorboard-plugin-wit-1.7.0 tensorflow-2.3.1 tensorflow-estimator-2.3.0
from matplotlib import rcParams, pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
import re
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import StratifiedKFold
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, GlobalMaxPooling1D, Conv1D, Dropout, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.utils import plot_model, to_categorical
from tensorflow.keras.optimizers import Adam
import warnings 
warnings.filterwarnings(action='ignore')
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    # Restrict TensorFlow to only use the first GPU
    try:
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        print(e)
else:
    print('No GPU detected')
1 Physical GPUs, 1 Logical GPU
rcParams['figure.figsize'] = (16, 8)
plt.style.use('fivethirtyeight')
pd.set_option('max_columns', 100)
pd.set_option("display.precision", 4)
warnings.simplefilter('ignore')

학습데이터 로드

data_dir = Path('../data/dacon-author-classification')
feature_dir = Path('../build/feature')
val_dir = Path('../build/val')
tst_dir = Path('../build/tst')
sub_dir = Path('../build/sub')
dirs = [feature_dir, val_dir, tst_dir, sub_dir]
for d in dirs:
    os.makedirs(d, exist_ok=True)

trn_file = data_dir / 'train.csv'
tst_file = data_dir / 'test_x.csv'
sample_file = data_dir / 'sample_submission.csv'

target_col = 'author'
n_fold = 5
n_class = 5
seed = 42
algo_name = 'cnn'
feature_name = 'emb'
model_name = f'{algo_name}_{feature_name}'

feature_file = feature_dir / f'{feature_name}.csv'
p_val_file = val_dir / f'{model_name}.val.csv'
p_tst_file = tst_dir / f'{model_name}.tst.csv'
sub_file = sub_dir / f'{model_name}.csv'
train = pd.read_csv(trn_file, index_col=0)
train.head()
text author
index
0 He was almost choking. There was so much, so m... 3
1 “Your sister asked for it, I suppose?” 2
2 She was engaged one day as she walked, in per... 1
3 The captain was in the porch, keeping himself ... 4
4 “Have mercy, gentlemen!” odin flung up his han... 3
test = pd.read_csv(tst_file, index_col=0)
test.head()
text
index
0 “Not at all. I think she is one of the most ch...
1 "No," replied he, with sudden consciousness, "...
2 As the lady had stated her intention of scream...
3 “And then suddenly in the silence I heard a so...
4 His conviction remained unchanged. So far as I...

Preprocessing

def alpha_num(text):
    return re.sub(r'[^A-Za-z0-9 ]', '', text)


def remove_stopwords(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stopwords:
            final_text.append(i.strip())
    return " ".join(final_text)


stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", 
             "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", 
             "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", 
             "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", 
             "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", 
             "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", 
             "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", 
             "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", 
             "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", 
             "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", 
             "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
train['text'] = train['text'].str.lower().apply(alpha_num).apply(remove_stopwords)
test['text'] = test['text'].str.lower().apply(alpha_num).apply(remove_stopwords)
X_train = train['text'].values
X_test = test['text'].values
y = train['author'].values
print(X_train.shape, X_test.shape, y.shape)
(54879,) (19617,) (54879,)
X_train[:3]
array(['almost choking much much wanted say strange exclamations came lips pole gazed fixedly bundle notes hand looked odin evident perplexity',
       'sister asked suppose',
       'engaged one day walked perusing janes last letter dwelling passages proved jane not written spirits instead surprised mr odin saw looking odin meeting putting away letter immediately forcing smile said'],
      dtype=object)

Training

vocab_size = 20000
embedding_dim = 64
max_length = 500
padding_type='post'
tokenizer = Tokenizer(num_words = vocab_size)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index
train_sequences = tokenizer.texts_to_sequences(X_train)
test_sequences = tokenizer.texts_to_sequences(X_test)
trn = pad_sequences(train_sequences, padding=padding_type, maxlen=max_length)
tst = pad_sequences(test_sequences, padding=padding_type, maxlen=max_length)
print(trn.shape, tst.shape)
(54879, 500) (19617, 500)
cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)
def get_model():
    model = Sequential([
        Embedding(vocab_size, embedding_dim, input_length=max_length),
        Dropout(.5),
        Conv1D(128, 7, padding="valid", activation="relu", strides=3),
        Conv1D(128, 7, padding="valid", activation="relu", strides=3),    
        GlobalMaxPooling1D(),
        Dense(128, activation='relu'),
        Dropout(.5),
        Dense(n_class, activation='softmax')
    ])
    
    # compile model
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(learning_rate=.005))
    return model
p_val = np.zeros((trn.shape[0], n_class))
p_tst = np.zeros((tst.shape[0], n_class))
for i, (i_trn, i_val) in enumerate(cv.split(trn, y), 1):
    print(f'training model for CV #{i}')
    es = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=3,
                       verbose=1, mode='min', baseline=None, restore_best_weights=True)

    clf = get_model()    
    clf.fit(trn[i_trn], 
            to_categorical(y[i_trn]),
            validation_data=(trn[i_val], to_categorical(y[i_val])),
            epochs=10,
            batch_size=512,
            callbacks=[es])
    p_val[i_val, :] = clf.predict(trn[i_val])
    p_tst += clf.predict(tst) / n_fold
training model for CV #1
Epoch 1/10
86/86 [==============================] - 5s 62ms/step - loss: 1.3471 - val_loss: 0.9571
Epoch 2/10
86/86 [==============================] - 5s 61ms/step - loss: 0.8594 - val_loss: 0.8221
Epoch 3/10
86/86 [==============================] - 5s 60ms/step - loss: 0.6572 - val_loss: 0.8044
Epoch 4/10
86/86 [==============================] - 5s 58ms/step - loss: 0.5348 - val_loss: 0.8316
Epoch 5/10
86/86 [==============================] - 5s 56ms/step - loss: 0.4513 - val_loss: 0.8302
Epoch 6/10
85/86 [============================>.] - ETA: 0s - loss: 0.3966Restoring model weights from the end of the best epoch.
86/86 [==============================] - 5s 57ms/step - loss: 0.3969 - val_loss: 0.9055
Epoch 00006: early stopping
training model for CV #2
Epoch 1/10
86/86 [==============================] - 5s 59ms/step - loss: 1.3603 - val_loss: 1.0853
Epoch 2/10
86/86 [==============================] - 5s 57ms/step - loss: 0.9426 - val_loss: 0.8629
Epoch 3/10
86/86 [==============================] - 5s 57ms/step - loss: 0.6868 - val_loss: 0.8024
Epoch 4/10
86/86 [==============================] - 5s 57ms/step - loss: 0.5505 - val_loss: 0.8167
Epoch 5/10
86/86 [==============================] - 5s 57ms/step - loss: 0.4608 - val_loss: 0.8563
Epoch 6/10
85/86 [============================>.] - ETA: 0s - loss: 0.4063Restoring model weights from the end of the best epoch.
86/86 [==============================] - 5s 56ms/step - loss: 0.4071 - val_loss: 0.8726
Epoch 00006: early stopping
training model for CV #3
Epoch 1/10
86/86 [==============================] - 5s 58ms/step - loss: 1.3691 - val_loss: 1.0746
Epoch 2/10
86/86 [==============================] - 5s 56ms/step - loss: 0.9511 - val_loss: 0.8433
Epoch 3/10
86/86 [==============================] - 5s 56ms/step - loss: 0.6972 - val_loss: 0.8001
Epoch 4/10
86/86 [==============================] - 5s 57ms/step - loss: 0.5603 - val_loss: 0.8008
Epoch 5/10
86/86 [==============================] - 5s 56ms/step - loss: 0.4737 - val_loss: 0.8747
Epoch 6/10
85/86 [============================>.] - ETA: 0s - loss: 0.4151Restoring model weights from the end of the best epoch.
86/86 [==============================] - 5s 56ms/step - loss: 0.4151 - val_loss: 0.8629
Epoch 00006: early stopping
training model for CV #4
Epoch 1/10
86/86 [==============================] - 5s 58ms/step - loss: 1.3022 - val_loss: 0.9929
Epoch 2/10
86/86 [==============================] - 5s 57ms/step - loss: 0.8804 - val_loss: 0.8361
Epoch 3/10
86/86 [==============================] - 5s 57ms/step - loss: 0.6673 - val_loss: 0.7927
Epoch 4/10
86/86 [==============================] - 5s 55ms/step - loss: 0.5445 - val_loss: 0.8166
Epoch 5/10
86/86 [==============================] - 5s 56ms/step - loss: 0.4585 - val_loss: 0.8562
Epoch 6/10
85/86 [============================>.] - ETA: 0s - loss: 0.4134Restoring model weights from the end of the best epoch.
86/86 [==============================] - 5s 57ms/step - loss: 0.4133 - val_loss: 0.9194
Epoch 00006: early stopping
training model for CV #5
Epoch 1/10
86/86 [==============================] - 5s 61ms/step - loss: 1.3985 - val_loss: 1.1223
Epoch 2/10
86/86 [==============================] - 5s 57ms/step - loss: 0.9777 - val_loss: 0.8736
Epoch 3/10
86/86 [==============================] - 5s 57ms/step - loss: 0.7304 - val_loss: 0.8016
Epoch 4/10
86/86 [==============================] - 5s 58ms/step - loss: 0.5764 - val_loss: 0.8007
Epoch 5/10
13/86 [===>..........................] - ETA: 3s - loss: 0.4435
print(f'Accuracy (CV): {accuracy_score(y, np.argmax(p_val, axis=1)) * 100:8.4f}%')
print(f'Log Loss (CV): {log_loss(pd.get_dummies(y), p_val):8.4f}')
np.savetxt(p_val_file, p_val, fmt='%.6f', delimiter=',')
np.savetxt(p_tst_file, p_tst, fmt='%.6f', delimiter=',')

시각화

# model summary
print(clf.summary())
plot_model(clf)

제출 파일 생성

sub = pd.read_csv(sample_file, index_col=0)
print(sub.shape)
sub.head()
sub[sub.columns] = p_tst
sub.head()
sub.to_csv(sub_file)