데모

라이브러리 import 및 설정

%reload_ext autoreload
%autoreload 2
%matplotlib inline
!pip install -U pip
Requirement already up-to-date: pip in /usr/local/lib/python3.6/dist-packages (20.2.4)
!pip install pandas
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (1.1.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas) (2020.4)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.6/dist-packages (from pandas) (1.18.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.7.3->pandas) (1.13.0)
!pip install -U scikit-learn
Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.6/dist-packages (0.23.2)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (0.17.0)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (2.1.0)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from scikit-learn) (1.18.1)
!pip install -U tensorflow
Requirement already up-to-date: tensorflow in /usr/local/lib/python3.6/dist-packages (2.3.1)
Requirement already satisfied, skipping upgrade: six>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.13.0)
Requirement already satisfied, skipping upgrade: wheel>=0.26 in /usr/lib/python3/dist-packages (from tensorflow) (0.30.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.9.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.11.2)
Requirement already satisfied, skipping upgrade: google-pasta>=0.1.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.1.8)
Requirement already satisfied, skipping upgrade: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.9.0)
Requirement already satisfied, skipping upgrade: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (3.1.0)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.26.0)
Requirement already satisfied, skipping upgrade: tensorflow-estimator<2.4.0,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.3.0)
Requirement already satisfied, skipping upgrade: tensorboard<3,>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.3.0)
Requirement already satisfied, skipping upgrade: numpy<1.19.0,>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.18.1)
Requirement already satisfied, skipping upgrade: keras-preprocessing<1.2,>=1.1.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.2)
Requirement already satisfied, skipping upgrade: h5py<2.11.0,>=2.10.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (2.10.0)
Requirement already satisfied, skipping upgrade: astunparse==1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.6.3)
Requirement already satisfied, skipping upgrade: gast==0.3.3 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (0.3.3)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied, skipping upgrade: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow) (1.11.2)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.9.2->tensorflow) (44.0.0)
Requirement already satisfied, skipping upgrade: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.7.0)
Requirement already satisfied, skipping upgrade: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.10.0)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (3.1.1)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.4.1)
Requirement already satisfied, skipping upgrade: requests<3,>=2.21.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (2.22.0)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.16.0)
Requirement already satisfied, skipping upgrade: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.0.0)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.2.8)
Requirement already satisfied, skipping upgrade: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.0)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (1.3.0)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (1.25.7)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /usr/lib/python3/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2.6)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2019.11.28)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (3.0.4)
Requirement already satisfied, skipping upgrade: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.4.8)
Requirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (3.1.0)
!pip install -U tensorflow_hub
Requirement already up-to-date: tensorflow_hub in /usr/local/lib/python3.6/dist-packages (0.10.0)
Requirement already satisfied, skipping upgrade: numpy>=1.12.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow_hub) (1.18.1)
Requirement already satisfied, skipping upgrade: protobuf>=3.8.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow_hub) (3.11.2)
Requirement already satisfied, skipping upgrade: six>=1.9 in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.8.0->tensorflow_hub) (1.13.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from protobuf>=3.8.0->tensorflow_hub) (44.0.0)
!pip install -U sentencepiece
Requirement already up-to-date: sentencepiece in /usr/local/lib/python3.6/dist-packages (0.1.94)
import gc
from matplotlib import rcParams, pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
import re
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import StratifiedKFold
import sys
sys.path.append('../src')
import tensorflow as tf
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.backend import clear_session
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import Dense, Embedding, Bidirectional, LSTM, Dropout
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model, to_categorical
from tensorflow.keras.optimizers import Adam
import tensorflow_hub as hub
import tokenization
import warnings 
warnings.filterwarnings(action='ignore')
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    # Restrict TensorFlow to only use the first GPU
    try:
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        print(e)
else:
    print('No GPU detected')
1 Physical GPUs, 1 Logical GPU
rcParams['figure.figsize'] = (16, 8)
plt.style.use('fivethirtyeight')
pd.set_option('max_columns', 100)
pd.set_option("display.precision", 4)
warnings.simplefilter('ignore')

BERT Tokenizer 로드

http://nlp.stanford.edu/data/glove.6B.zip 를 다운받아 data_dir에 압축을 푼다.

data_dir = Path('../data/dacon-author-classification')
feature_dir = Path('../build/feature')
val_dir = Path('../build/val')
tst_dir = Path('../build/tst')
sub_dir = Path('../build/sub')
dirs = [feature_dir, val_dir, tst_dir, sub_dir]
for d in dirs:
    os.makedirs(d, exist_ok=True)

trn_file = data_dir / 'train.csv'
tst_file = data_dir / 'test_x.csv'
sample_file = data_dir / 'sample_submission.csv'
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"

target_col = 'author'
n_fold = 5
n_class = 5
seed = 42
algo_name = 'bert'
max_len = 100
feature_name = f'n{max_len}'
model_name = f'{algo_name}_{feature_name}'

feature_file = feature_dir / f'{feature_name}.csv'
p_val_file = val_dir / f'{model_name}.val.csv'
p_tst_file = tst_dir / f'{model_name}.tst.csv'
sub_file = sub_dir / f'{model_name}.csv'
bert_layer = hub.KerasLayer(module_url, trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

학습데이터 로드

train = pd.read_csv(trn_file, index_col=0)
train.head()
text author
index
0 He was almost choking. There was so much, so m... 3
1 “Your sister asked for it, I suppose?” 2
2 She was engaged one day as she walked, in per... 1
3 The captain was in the porch, keeping himself ... 4
4 “Have mercy, gentlemen!” odin flung up his han... 3
test = pd.read_csv(tst_file, index_col=0)
test.head()
text
index
0 “Not at all. I think she is one of the most ch...
1 "No," replied he, with sudden consciousness, "...
2 As the lady had stated her intention of scream...
3 “And then suddenly in the silence I heard a so...
4 His conviction remained unchanged. So far as I...

Preprocessing

# https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
def bert_encode(texts, tokenizer, max_len=max_len):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)
trn = bert_encode(train.text.values, tokenizer, max_len=max_len)
tst = bert_encode(test.text.values, tokenizer, max_len=max_len)
y = train['author'].values
print(trn[0].shape, tst[0].shape, y.shape)
(54879, 100) (19617, 100) (54879,)

Training

def get_model(bert_layer, max_len=max_len):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(n_class, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model
cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)
p_val = np.zeros((trn[0].shape[0], n_class))
p_tst = np.zeros((tst[0].shape[0], n_class))
for i, (i_trn, i_val) in enumerate(cv.split(trn[0], y), 1):
    print(f'training model for CV #{i}')
    es = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=3,
                       verbose=1, mode='min', baseline=None, restore_best_weights=True)
    
    clf = get_model(bert_layer, max_len=max_len)
    if i == 1:
        print(clf.summary())
        
    clf.fit([x[i_trn] for x in trn], 
            to_categorical(y[i_trn]),
            validation_data=([x[i_val] for x in trn], to_categorical(y[i_val])),
            epochs=2,
            batch_size=16)
    p_val[i_val, :] = clf.predict([x[i_val] for x in trn])
    p_tst += clf.predict(tst) / n_fold
    
    del clf
    clear_session()
    gc.collect()
training model for CV #1
Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_word_ids (InputLayer)     [(None, 100)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 100)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]                 
                                                                 segment_ids[0][0]                
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 1024)]       0           keras_layer[0][1]                
__________________________________________________________________________________________________
dense (Dense)                   (None, 5)            5125        tf_op_layer_strided_slice[0][0]  
==================================================================================================
Total params: 335,147,014
Trainable params: 335,147,013
Non-trainable params: 1
__________________________________________________________________________________________________
None
Epoch 1/2
   2/2744 [..............................] - ETA: 29:35 - loss: 0.6328 - accuracy: 0.1562WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3230s vs `on_train_batch_end` time: 0.4911s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3230s vs `on_train_batch_end` time: 0.4911s). Check your callbacks.
2744/2744 [==============================] - ETA: 0s - loss: 0.2035 - accuracy: 0.7820WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0199s vs `on_test_batch_end` time: 0.2588s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0199s vs `on_test_batch_end` time: 0.2588s). Check your callbacks.
2744/2744 [==============================] - 2489s 907ms/step - loss: 0.2035 - accuracy: 0.7820 - val_loss: 0.1440 - val_accuracy: 0.8507
Epoch 2/2
2744/2744 [==============================] - 2478s 903ms/step - loss: 0.0823 - accuracy: 0.9204 - val_loss: 0.1395 - val_accuracy: 0.8622
training model for CV #2
Epoch 1/2
   2/2744 [..............................] - ETA: 29:51 - loss: 0.8290 - accuracy: 0.0625WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3269s vs `on_train_batch_end` time: 0.5238s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3269s vs `on_train_batch_end` time: 0.5238s). Check your callbacks.
2744/2744 [==============================] - ETA: 0s - loss: 0.0785 - accuracy: 0.9289WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0240s vs `on_test_batch_end` time: 0.2842s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0240s vs `on_test_batch_end` time: 0.2842s). Check your callbacks.
2744/2744 [==============================] - 2496s 910ms/step - loss: 0.0785 - accuracy: 0.9289 - val_loss: 0.0422 - val_accuracy: 0.9635
Epoch 2/2
2744/2744 [==============================] - 2688s 979ms/step - loss: 0.0263 - accuracy: 0.9771 - val_loss: 0.0533 - val_accuracy: 0.9503
training model for CV #3
Epoch 1/2
   2/2744 [..............................] - ETA: 29:07 - loss: 0.5564 - accuracy: 0.5625WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3179s vs `on_train_batch_end` time: 0.4909s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3179s vs `on_train_batch_end` time: 0.4909s). Check your callbacks.
2744/2744 [==============================] - ETA: 0s - loss: 0.0290 - accuracy: 0.9767WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0223s vs `on_test_batch_end` time: 0.2541s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0223s vs `on_test_batch_end` time: 0.2541s). Check your callbacks.
2744/2744 [==============================] - 2510s 915ms/step - loss: 0.0290 - accuracy: 0.9767 - val_loss: 0.0151 - val_accuracy: 0.9873
Epoch 2/2
2744/2744 [==============================] - 2492s 908ms/step - loss: 0.0158 - accuracy: 0.9875 - val_loss: 0.0224 - val_accuracy: 0.9806
training model for CV #4
Epoch 1/2
   2/2744 [..............................] - ETA: 29:00 - loss: 0.6340 - accuracy: 0.2500WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3217s vs `on_train_batch_end` time: 0.4917s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.3217s vs `on_train_batch_end` time: 0.4917s). Check your callbacks.
2744/2744 [==============================] - ETA: 0s - loss: 0.0174 - accuracy: 0.9869WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0205s vs `on_test_batch_end` time: 0.2527s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0205s vs `on_test_batch_end` time: 0.2527s). Check your callbacks.
2744/2744 [==============================] - 2429s 885ms/step - loss: 0.0174 - accuracy: 0.9869 - val_loss: 0.0125 - val_accuracy: 0.9900
Epoch 2/2
2744/2744 [==============================] - 2425s 884ms/step - loss: 0.0089 - accuracy: 0.9928 - val_loss: 0.0263 - val_accuracy: 0.9809
training model for CV #5
Epoch 1/2
   2/2744 [..............................] - ETA: 36:40 - loss: 0.7592 - accuracy: 0.4062WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0491s vs `on_train_batch_end` time: 0.7488s). Check your callbacks.
WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0491s vs `on_train_batch_end` time: 0.7488s). Check your callbacks.
2744/2744 [==============================] - 2433s 886ms/step - loss: 0.0147 - accuracy: 0.9897 - val_loss: 0.0061 - val_accuracy: 0.9950
Epoch 2/2
2744/2744 [==============================] - 2443s 890ms/step - loss: 0.0075 - accuracy: 0.9944 - val_loss: 0.0084 - val_accuracy: 0.9929
print(f'Accuracy (CV): {accuracy_score(y, np.argmax(p_val, axis=1)) * 100:8.4f}%')
print(f'Log Loss (CV): {log_loss(pd.get_dummies(y), p_val):8.4f}')
Accuracy (CV):  95.3370%
Log Loss (CV):   0.1380
np.savetxt(p_val_file, p_val, fmt='%.6f', delimiter=',')
np.savetxt(p_tst_file, p_tst, fmt='%.6f', delimiter=',')

제출 파일 생성

sub = pd.read_csv(sample_file, index_col=0)
print(sub.shape)
sub.head()
(19617, 5)
0 1 2 3 4
index
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
sub[sub.columns] = p_tst
sub.head()
0 1 2 3 4
index
0 6.4148e-04 0.6342 0.3968 3.2864e-04 2.2312e-04
1 7.3567e-05 0.9995 0.0002 2.9280e-04 1.9901e-04
2 9.9783e-01 0.0043 0.0001 8.8071e-05 6.5895e-05
3 4.2704e-05 0.0002 0.9994 1.2805e-04 8.2788e-05
4 9.9781e-01 0.0013 0.0004 6.9908e-04 2.0771e-04
sub.to_csv(sub_file)