新闻文本分类

解题思路

赛题思路分析:赛题本质是一个文本分类问题,需要根据每句的字符进行分类。但赛题给出的数据是匿名化的,不能直接使用中文分词等操作,这个是赛题的难点。

因此本次赛题的难点是需要对匿名字符进行建模,进而完成文本分类的过程。由于文本数据是一种典型的非结构化数据,因此可能涉及到特征提取分类模型两个部分。为了减低参赛难度,我们提供了一些解题思路供大家参考:

  • 思路1:TF-IDF + 机器学习分类器

直接使用TF-IDF对文本提取特征,并使用分类器进行分类。在分类器的选择上,可以使用SVM、LR、或者XGBoost。

  • 思路2:FastText

FastText是入门款的词向量,利用Facebook提供的FastText工具,可以快速构建出分类器。

  • 思路3:WordVec + 深度学习分类器

WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRNN或者BiLSTM。

  • 思路4:Bert词向量

Bert是高配款的词向量,具有强大的建模学习能力。

数据下载

1
! mkdir ./data
1
2
3
4
5
6
7
8
9
# train data
! wget https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/train_set.csv.zip
! unzip train_set.csv.zip -d ./data
! rm train_set.csv.zip

# test data
! wget https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a.csv.zip
! unzip test_a.csv.zip -d ./data
! rm test_a.csv.zip
1
2
3
4
5
# 2.预训练下载
! wget http://tianchi-media.oss-cn-beijing.aliyuncs.com/dragonball/NLP/emb.zip
! unzip emb.zip
! rm emb.zip
! mv ./emb/bert-mini/bert_config.json ./emb/bert-mini/config.json

新建保存目录

1
! mkdir ./save

安装必要包

1
! pip install fasttext transformers==2.9.0 gensim torch==1.3.0

数据读取

赛题数据虽然是文本数据,每个新闻是不定长的,但任然使用csv格式进行存储。因此可以直接用Pandas完成数据读取的操作。

1
2
import pandas as pd
train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=100)

这里的read_csv由三部分构成:

  • 读取的文件路径,这里需要根据改成你本地的路径,可以使用相对路径或绝对路径;
  • 分隔符sep,为每列分割的字符,设置为\t即可;
  • 读取行数nrows,为此次读取文件的函数,是数值类型(由于数据集比较大,建议先设置为100);
1
train_df.head()
label text
0 2 2967 6758 339 2021 1854 3731 4109 3792 4149 15...
1 11 4464 486 6352 5619 2465 4802 1452 3137 5778 54...
2 3 7346 4068 5074 3747 5681 6093 1777 2226 7354 6...
3 2 7159 948 4866 2109 5520 2490 211 3956 5520 549...
4 3 3646 3055 3055 2490 4659 6065 3370 5814 2465 5...

上图是读取好的数据,是表格的形式。第一列为新闻的类别,第二列为新闻的字符。

数据分析

在读取完成数据集后,我们还可以对数据集进行数据分析的操作。虽然对于非结构数据并不需要做很多的数据分析,但通过数据分析还是可以找出一些规律的。

此步骤我们读取了所有的训练集数据,在此我们通过数据分析希望得出以下结论:

  • 赛题数据中,新闻文本的长度是多少?
  • 赛题数据的类别分布是怎么样的,哪些类别比较多?
  • 赛题数据中,字符分布是怎么样的?

句子长度分析

在赛题数据中每行句子的字符使用空格进行隔开,所以可以直接统计单词的个数来得到每个句子的长度。统计并如下:

1
2
3
%pylab inline
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())
1
2
3
4
5
6
7
8
9
10
opulating the interactive namespace from numpy and matplotlib
count 100.000000
mean 872.320000
std 923.138191
min 64.000000
25% 359.500000
50% 598.000000
75% 1058.000000
max 7125.000000
Name: text_len, dtype: float64

对新闻句子的统计可以得出,本次赛题给定的文本比较长,每个句子平均由907个字符构成,最短的句子长度为2,最长的句子长度为57921。

新闻类别分布

接下来可以对数据集的类别进行分布统计,具体统计每类新闻的样本个数。

1
2
3
train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")
1
Text(0.5, 0, 'category')

在数据集中标签的对应的关系如下:{'科技': 0, '股票': 1, '体育': 2, '娱乐': 3, '时政': 4, '社会': 5, '教育': 6, '财经': 7, '家居': 8, '游戏': 9, '房产': 10, '时尚': 11, '彩票': 12, '星座': 13}

从统计结果可以看出,赛题的数据集类别分布存在较为不均匀的情况。在训练集中科技类新闻最多,其次是股票类新闻,最少的新闻是星座新闻。

字符分布统计

接下来可以统计每个字符出现的次数,首先可以将训练集中所有的句子进行拼接进而划分为字符,并统计每个字符的个数。

从统计结果中可以看出,在训练集中总共包括6869个字,其中编号3750的字出现的次数最多,编号5034的字出现的次数最少。

1
2
3
4
5
6
7
8
9
10
from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)

print(len(word_count))

print(word_count[0])

print(word_count[-1])
1
2
3
2405
('3750', 3702)
('5034', 1)

数据分析的结论

通过上述分析我们可以得出以下结论:

  1. 赛题中每个新闻包含的字符个数平均为1000个,还有一些新闻字符较长;
  2. 赛题中新闻类别分布不均匀,科技类新闻样本量接近4w,星座类新闻样本量不到1k;
  3. 赛题总共包括7000-8000个字符;

通过数据分析,我们还可以得出以下结论:

  1. 每个新闻平均字符个数较多,可能需要截断;
  2. 由于类别不均衡,会严重影响模型的精度;

基于机器学习的文本分类

学习目标

  • 学会TF-IDF的原理和使用
  • 使用sklearn的机器学习模型完成文本分类

文本表示方法

One-hot

这里的One-hot与数据挖掘任务中的操作是一致的,即将每一个单词使用一个离散的向量表示。具体将每个字/词编码一个索引,然后根据索引进行赋值。

One-hot表示方法的例子如下:

1
2
句子1:我 爱 北 京 天 安 门
句子2:我 喜 欢 上 海

首先对所有句子的字进行索引,即将每个字确定一个编号:

1
2
3
4
{
'我': 1, '爱': 2, '北': 3, '京': 4, '天': 5,
'安': 6, '门': 7, '喜': 8, '欢': 9, '上': 10, '海': 11
}

在这里共包括11个字,因此每个字可以转换为一个11维度稀疏向量:

1
2
3
4
我:[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
爱:[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...
海:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Bag of Words

Bag of Words(词袋表示),也称为Count Vectors,每个文档的字/词可以使用其出现次数来进行表示。

1
2
句子1:我 爱 北 京 天 安 门
句子2:我 喜 欢 上 海

直接统计每个字出现的次数,并进行赋值:

1
2
3
4
5
句子1:我 爱 北 京 天 安 门
转换为 [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

句子2:我 喜 欢 上 海
转换为 [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

在sklearn中可以直接CountVectorizer来实现这一步骤:

1
2
3
4
5
6
7
8
9
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()

N-gram

N-gram与Count Vectors类似,不过加入了相邻单词组合成为新的单词,并进行计数。

如果N取值为2,则句子1和句子2就变为:

1
2
句子1:我爱 爱北 北京 京天 天安 安门
句子2:我喜 喜欢 欢上 上海

TF-IDF

TF-IDF 分数由两部分组成:第一部分是词语频率(Term Frequency),第二部分是逆文档频率(Inverse Document Frequency)。其中计算语料库中文档总数除以含有该词语的文档数量,然后再取对数就是逆文档频率。

1
2
TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数
IDF(t)= log_e(文档总数 / 出现该词语的文档总数)

基于机器学习的文本分类

接下来我们将对比不同文本表示算法的精度,通过本地构建验证集计算F1得分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Count Vectors + RidgeClassifier

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=15000)

vectorizer = CountVectorizer(max_features=3000)
train_test = vectorizer.fit_transform(train_df['text'])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
1
0.741494277019762

基于深度学习的文本分类-Word2Vec

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import logging
import random

import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# split data to 10 fold
fold_num = 10
data_file = './data/train_set.csv'
import pandas as pd


def all_data2fold(fold_num, num=10000):
fold_data = []
f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
# texts: ['2967 6758 339 2021 1854 3731 4109 3792 4149 1519 ...]
texts = f['text'].tolist()[:num]
# labels: [2, 11, 3, 2, 3, 9, 3, 10, 12, 3, 0, 7, 4, 0, 0 ...]
labels = f['label'].tolist()[:num]

total = len(labels)

# 打乱index顺序,使all_texts和all_labels随机(一一对应)。
index = list(range(total))
# index: [3447, 966, 593, 5029, 4382, 2345, 974, 3786, 2249, ...]
np.random.shuffle(index)

all_texts = []
all_labels = []
for i in index:
all_texts.append(texts[i])
all_labels.append(labels[i])
# all_texts: ['600 3373 2828 2515 5026 245 3743 26 2396 6122 3720 14 ...]
# all_labels: [2, 4, 8, 7, 0, 0, 2, 1, 12, 0, ...]

# 创建key:value字典,其中key是label,value是label对应的index列表。
label2id = {}
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)
# label2id: {'2': [0, 6, 14, 21, 27, 28, 32, 39, ...], '4': [...], ...}

# 将同一label划分为10折。
all_index = [[] for _ in range(fold_num)]
for label, data in label2id.items():
# print(label, len(data))
# 设data = 105
# fold_num = 10
# batch_size = 10
# other = 5
"""
i=0, cur_batch_size = 11, datch_data = [data[0], data[1], ... , data[10]]
i=1, cur_batch_size = 11, datch_data = [data[10], data[11], ... , data[20]]
......
i=5, cur_batch_size = 10, datch_data = [data[50], data[51], ... , data[59]]
i=6, cur_batch_size = 10, datch_data = [data[60], data[61], ... , data[69]]
......
i=9, cur_batch_size = 10, datch_data = [data[90], data[91], ... , data[99]]
"""
# print(label, len(data))
batch_size = int(len(data) / fold_num)
other = len(data) - batch_size * fold_num
for i in range(fold_num):
cur_batch_size = batch_size + 1 if i < other else batch_size
# print(cur_batch_size)
batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
all_index[i].extend(batch_data)

batch_size = int(total / fold_num)
other_texts = []
other_labels = []
other_num = 0
start = 0
# 将10折分类后的all_index按照index分别划分
for fold in range(fold_num):
num = len(all_index[fold])
texts = [all_texts[i] for i in all_index[fold]]
labels = [all_labels[i] for i in all_index[fold]]

# 单折数量>10折平均数时,文本数截断至平均数batch_size,并降截断的文本分至other_texts,标签fold_labels同上
if num > batch_size:
fold_texts = texts[:batch_size]
other_texts.extend(texts[batch_size:])
fold_labels = labels[:batch_size]
other_labels.extend(labels[batch_size:])
other_num += num - batch_size
# 单折数量<10折平均数时,原有的文本加other_size中batch_size-num的文本数,标签fold_labels同上
elif num < batch_size:
end = start + batch_size - num
fold_texts = texts + other_texts[start: end]
fold_labels = labels + other_labels[start: end]
start = end
# 否则,文本数和标签数不变。
else:
fold_texts = texts
fold_labels = labels

assert batch_size == len(fold_labels)

# shuffle,对10折文本和标签重新刷新。
index = list(range(batch_size))
np.random.shuffle(index)

shuffle_fold_texts = []
shuffle_fold_labels = []
for i in index:
shuffle_fold_texts.append(fold_texts[i])
shuffle_fold_labels.append(fold_labels[i])

data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
fold_data.append(data)

logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

return fold_data


fold_data = all_data2fold(10)
1
2
3
4
5
6
7
8
9
# build train data for word2vec
fold_id = 9

train_texts = []
for i in range(0, fold_id):
data = fold_data[i]
train_texts.extend(data['text'])

logging.info('Total %d docs.' % len(train_texts))
1
2
3
4
5
6
7
8
9
10
11
12
logging.info('Start training...')
from gensim.models.word2vec import Word2Vec

num_features = 100 # Word vector dimensionality
num_workers = 8 # Number of threads to run in parallel

train_texts = list(map(lambda x: list(x.split()), train_texts))
model = Word2Vec(train_texts, workers=num_workers, size=num_features)
model.init_sims(replace=True)

# save model
model.save("./save/word2vec.bin")
1
2
3
4
5
# load model
model = Word2Vec.load("./save/word2vec.bin")

# convert format
model.wv.save_word2vec_format('./save/word2vec.txt', binary=False)

基于深度学习的文本分类-TextRNN

TextRNN利用RNN(循环神经网络)进行文本特征抽取,由于文本本身是一种序列,而LSTM天然适合建模序列数据。TextRNN将句子中每个词的词向量依次输入到双向双层LSTM,分别将两个方向最后一个有效位置的隐藏层拼接成一个向量作为文本的表示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import logging
import random

import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)

# set cuda
gpu = 0
use_cuda = gpu >= 0 and torch.cuda.is_available()
if use_cuda:
torch.cuda.set_device(gpu)
device = torch.device("cuda", gpu)
else:
device = torch.device("cpu")
logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# split data to 10 fold
fold_num = 10
data_file = '../data/train_set.csv'
import pandas as pd


def all_data2fold(fold_num, num=1000):
fold_data = []
f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
texts = f['text'].tolist()[:num]
labels = f['label'].tolist()[:num]

total = len(labels)

index = list(range(total))
np.random.shuffle(index)

all_texts = []
all_labels = []
for i in index:
all_texts.append(texts[i])
all_labels.append(labels[i])

label2id = {}
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)

all_index = [[] for _ in range(fold_num)]
for label, data in label2id.items():
# print(label, len(data))
batch_size = int(len(data) / fold_num)
other = len(data) - batch_size * fold_num
for i in range(fold_num):
cur_batch_size = batch_size + 1 if i < other else batch_size
# print(cur_batch_size)
batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
all_index[i].extend(batch_data)

batch_size = int(total / fold_num)
other_texts = []
other_labels = []
other_num = 0
start = 0
for fold in range(fold_num):
num = len(all_index[fold])
texts = [all_texts[i] for i in all_index[fold]]
labels = [all_labels[i] for i in all_index[fold]]

if num > batch_size:
fold_texts = texts[:batch_size]
other_texts.extend(texts[batch_size:])
fold_labels = labels[:batch_size]
other_labels.extend(labels[batch_size:])
other_num += num - batch_size
elif num < batch_size:
end = start + batch_size - num
fold_texts = texts + other_texts[start: end]
fold_labels = labels + other_labels[start: end]
start = end
else:
fold_texts = texts
fold_labels = labels

assert batch_size == len(fold_labels)

# shuffle
index = list(range(batch_size))
np.random.shuffle(index)

shuffle_fold_texts = []
shuffle_fold_labels = []
for i in index:
shuffle_fold_texts.append(fold_texts[i])
shuffle_fold_labels.append(fold_labels[i])

data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
fold_data.append(data)

logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

return fold_data


fold_data = all_data2fold(10)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# build vocab
from collections import Counter
from transformers import BasicTokenizer

basic_tokenizer = BasicTokenizer()


class Vocab():
def __init__(self, train_data):
self.min_count = 5
self.pad = 0
self.unk = 1
self._id2word = ['[PAD]', '[UNK]']
self._id2extword = ['[PAD]', '[UNK]']

self._id2label = []
self.target_names = []

self.build_vocab(train_data)

reverse = lambda x: dict(zip(x, range(len(x))))

# _word2id = {'[PAD]': 0, '[UNK]': 1}
self._word2id = reverse(self._id2word)
self._label2id = reverse(self._id2label)

logging.info("Build vocab: words %d, labels %d." % (self.word_size, self.label_size))

def build_vocab(self, data):
self.word_counter = Counter()

for text in data['text']:
words = text.split()
for word in words:
self.word_counter[word] += 1

for word, count in self.word_counter.most_common():
if count >= self.min_count:
self._id2word.append(word)

label2name = {0: '科技', 1: '股票', 2: '体育', 3: '娱乐', 4: '时政', 5: '社会', 6: '教育', 7: '财经',
8: '家居', 9: '游戏', 10: '房产', 11: '时尚', 12: '彩票', 13: '星座'}

self.label_counter = Counter(data['label'])

for label in range(len(self.label_counter)):
count = self.label_counter[label]
self._id2label.append(label)
self.target_names.append(label2name[label])

def load_pretrained_embs(self, embfile):
with open(embfile, encoding='utf-8') as f:
lines = f.readlines()
items = lines[0].split()
word_count, embedding_dim = int(items[0]), int(items[1])

index = len(self._id2extword)
embeddings = np.zeros((word_count + index, embedding_dim))
for line in lines[1:]:
values = line.split()
self._id2extword.append(values[0])
vector = np.array(values[1:], dtype='float64')
embeddings[self.unk] += vector
embeddings[index] = vector
index += 1

embeddings[self.unk] = embeddings[self.unk] / word_count
embeddings = embeddings / np.std(embeddings)

reverse = lambda x: dict(zip(x, range(len(x))))
self._extword2id = reverse(self._id2extword)

assert len(set(self._id2extword)) == len(self._id2extword)

return embeddings

def word2id(self, xs):
if isinstance(xs, list):
return [self._word2id.get(x, self.unk) for x in xs]
return self._word2id.get(xs, self.unk)

def extword2id(self, xs):
if isinstance(xs, list):
return [self._extword2id.get(x, self.unk) for x in xs]
return self._extword2id.get(xs, self.unk)

def label2id(self, xs):
if isinstance(xs, list):
return [self._label2id.get(x, self.unk) for x in xs]
return self._label2id.get(xs, self.unk)

@property
def word_size(self):
return len(self._id2word)

@property
def extword_size(self):
return len(self._id2extword)

@property
def label_size(self):
return len(self._id2label)


vocab = Vocab(train_data)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# build module
import torch.nn as nn
import torch.nn.functional as F


class Attention(nn.Module):
def __init__(self, hidden_size):
super(Attention, self).__init__()
self.weight = nn.Parameter(torch.Tensor(hidden_size, hidden_size))
self.weight.data.normal_(mean=0.0, std=0.05)

self.bias = nn.Parameter(torch.Tensor(hidden_size))
b = np.zeros(hidden_size, dtype=np.float32)
self.bias.data.copy_(torch.from_numpy(b))

self.query = nn.Parameter(torch.Tensor(hidden_size))
self.query.data.normal_(mean=0.0, std=0.05)

def forward(self, batch_hidden, batch_masks):
# batch_hidden: b x len x hidden_size (2 * hidden_size of lstm)
# batch_masks: b x len

# linear
key = torch.matmul(batch_hidden, self.weight) + self.bias # b x len x hidden

# compute attention
outputs = torch.matmul(key, self.query) # b x len

masked_outputs = outputs.masked_fill((1 - batch_masks).bool(), float(-1e32))

attn_scores = F.softmax(masked_outputs, dim=1) # b x len

# 对于全零向量,-1e32的结果为 1/len, -inf为nan, 额外补0
masked_attn_scores = attn_scores.masked_fill((1 - batch_masks).bool(), 0.0)

# sum weighted sources
batch_outputs = torch.bmm(masked_attn_scores.unsqueeze(1), key).squeeze(1) # b x hidden

return batch_outputs, attn_scores


# build word encoder
word2vec_path = '../emb/word2vec.txt'
dropout = 0.15
word_hidden_size = 128
word_num_layers = 2


class WordLSTMEncoder(nn.Module):
def __init__(self, vocab):
super(WordLSTMEncoder, self).__init__()
self.dropout = nn.Dropout(dropout)
self.word_dims = 100

self.word_embed = nn.Embedding(vocab.word_size, self.word_dims, padding_idx=0)

extword_embed = vocab.load_pretrained_embs(word2vec_path)
extword_size, word_dims = extword_embed.shape
logging.info("Load extword embed: words %d, dims %d." % (extword_size, word_dims))

self.extword_embed = nn.Embedding(extword_size, word_dims, padding_idx=0)
self.extword_embed.weight.data.copy_(torch.from_numpy(extword_embed))
self.extword_embed.weight.requires_grad = False

input_size = self.word_dims

self.word_lstm = nn.LSTM(
input_size=input_size,
hidden_size=word_hidden_size,
num_layers=word_num_layers,
batch_first=True,
bidirectional=True
)

def forward(self, word_ids, extword_ids, batch_masks):
# word_ids: sen_num x sent_len
# extword_ids: sen_num x sent_len
# batch_masks sen_num x sent_len

word_embed = self.word_embed(word_ids) # sen_num x sent_len x 100
extword_embed = self.extword_embed(extword_ids)
batch_embed = word_embed + extword_embed

if self.training:
batch_embed = self.dropout(batch_embed)

hiddens, _ = self.word_lstm(batch_embed) # sen_num x sent_len x hidden*2
hiddens = hiddens * batch_masks.unsqueeze(2)

if self.training:
hiddens = self.dropout(hiddens)

return hiddens


# build sent encoder
sent_hidden_size = 256
sent_num_layers = 2


class SentEncoder(nn.Module):
def __init__(self, sent_rep_size):
super(SentEncoder, self).__init__()
self.dropout = nn.Dropout(dropout)

self.sent_lstm = nn.LSTM(
input_size=sent_rep_size,
hidden_size=sent_hidden_size,
num_layers=sent_num_layers,
batch_first=True,
bidirectional=True
)

def forward(self, sent_reps, sent_masks):
# sent_reps: b x doc_len x sent_rep_size
# sent_masks: b x doc_len

sent_hiddens, _ = self.sent_lstm(sent_reps) # b x doc_len x hidden*2
sent_hiddens = sent_hiddens * sent_masks.unsqueeze(2)

if self.training:
sent_hiddens = self.dropout(sent_hiddens)

return sent_hiddens
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# build model
class Model(nn.Module):
def __init__(self, vocab):
super(Model, self).__init__()
self.sent_rep_size = word_hidden_size * 2
self.doc_rep_size = sent_hidden_size * 2
self.all_parameters = {}
parameters = []
self.word_encoder = WordLSTMEncoder(vocab)
self.word_attention = Attention(self.sent_rep_size)
parameters.extend(list(filter(lambda p: p.requires_grad, self.word_encoder.parameters())))
parameters.extend(list(filter(lambda p: p.requires_grad, self.word_attention.parameters())))

self.sent_encoder = SentEncoder(self.sent_rep_size)
self.sent_attention = Attention(self.doc_rep_size)
parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_encoder.parameters())))
parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_attention.parameters())))

self.out = nn.Linear(self.doc_rep_size, vocab.label_size, bias=True)
parameters.extend(list(filter(lambda p: p.requires_grad, self.out.parameters())))

if use_cuda:
self.to(device)

if len(parameters) > 0:
self.all_parameters["basic_parameters"] = parameters

logging.info('Build model with lstm word encoder, lstm sent encoder.')

para_num = sum([np.prod(list(p.size())) for p in self.parameters()])
logging.info('Model param num: %.2f M.' % (para_num / 1e6))

def forward(self, batch_inputs):
# batch_inputs(batch_inputs1, batch_inputs2): b x doc_len x sent_len
# batch_masks : b x doc_len x sent_len
batch_inputs1, batch_inputs2, batch_masks = batch_inputs
batch_size, max_doc_len, max_sent_len = batch_inputs1.shape[0], batch_inputs1.shape[1], batch_inputs1.shape[2]
batch_inputs1 = batch_inputs1.view(batch_size * max_doc_len, max_sent_len) # sen_num x sent_len
batch_inputs2 = batch_inputs2.view(batch_size * max_doc_len, max_sent_len) # sen_num x sent_len
batch_masks = batch_masks.view(batch_size * max_doc_len, max_sent_len) # sen_num x sent_len

batch_hiddens = self.word_encoder(batch_inputs1, batch_inputs2,
batch_masks) # sen_num x sent_len x sent_rep_size
sent_reps, atten_scores = self.word_attention(batch_hiddens, batch_masks) # sen_num x sent_rep_size

sent_reps = sent_reps.view(batch_size, max_doc_len, self.sent_rep_size) # b x doc_len x sent_rep_size
batch_masks = batch_masks.view(batch_size, max_doc_len, max_sent_len) # b x doc_len x max_sent_len
sent_masks = batch_masks.bool().any(2).float() # b x doc_len

sent_hiddens = self.sent_encoder(sent_reps, sent_masks) # b x doc_len x doc_rep_size
doc_reps, atten_scores = self.sent_attention(sent_hiddens, sent_masks) # b x doc_rep_size

batch_outputs = self.out(doc_reps) # b x num_labels

return batch_outputs


model = Model(vocab)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# build optimizer
learning_rate = 2e-4
decay = .75
decay_step = 1000


class Optimizer:
def __init__(self, model_parameters):
self.all_params = []
self.optims = []
self.schedulers = []

for name, parameters in model_parameters.items():
if name.startswith("basic"):
optim = torch.optim.Adam(parameters, lr=learning_rate)
self.optims.append(optim)

l = lambda step: decay ** (step // decay_step)
scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=l)
self.schedulers.append(scheduler)
self.all_params.extend(parameters)

else:
Exception("no nameed parameters.")

self.num = len(self.optims)

def step(self):
for optim, scheduler in zip(self.optims, self.schedulers):
optim.step()
scheduler.step()
optim.zero_grad()

def zero_grad(self):
for optim in self.optims:
optim.zero_grad()

def get_lr(self):
lrs = tuple(map(lambda x: x.get_lr()[-1], self.schedulers))
lr = ' %.5f' * self.num
res = lr % lrs
return res
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# build dataset
def sentence_split(text, vocab, max_sent_len=256, max_segment=16):
words = text.strip().split()
document_len = len(words)

index = list(range(0, document_len, max_sent_len))
index.append(document_len)

segments = []
for i in range(len(index) - 1):
segment = words[index[i]: index[i + 1]]
assert len(segment) > 0
segment = [word if word in vocab._id2word else '<UNK>' for word in segment]
segments.append([len(segment), segment])

assert len(segments) > 0
if len(segments) > max_segment:
segment_ = int(max_segment / 2)
return segments[:segment_] + segments[-segment_:]
else:
return segments


def get_examples(data, vocab, max_sent_len=256, max_segment=8):
label2id = vocab.label2id
examples = []

for text, label in zip(data['text'], data['label']):
# label
id = label2id(label)

# words
sents_words = sentence_split(text, vocab, max_sent_len, max_segment)
doc = []
for sent_len, sent_words in sents_words:
word_ids = vocab.word2id(sent_words)
extword_ids = vocab.extword2id(sent_words)
doc.append([sent_len, word_ids, extword_ids])
examples.append([id, len(doc), doc])

logging.info('Total %d docs.' % len(examples))
return examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# build loader

def batch_slice(data, batch_size):
batch_num = int(np.ceil(len(data) / float(batch_size)))
for i in range(batch_num):
cur_batch_size = batch_size if i < batch_num - 1 else len(data) - batch_size * i
docs = [data[i * batch_size + b] for b in range(cur_batch_size)]

yield docs


def data_iter(data, batch_size, shuffle=True, noise=1.0):
"""
randomly permute data, then sort by source length, and partition into batches
ensure that the length of sentences in each batch
"""

batched_data = []
if shuffle:
np.random.shuffle(data)

lengths = [example[1] for example in data]
noisy_lengths = [- (l + np.random.uniform(- noise, noise)) for l in lengths]
sorted_indices = np.argsort(noisy_lengths).tolist()
sorted_data = [data[i] for i in sorted_indices]

batched_data.extend(list(batch_slice(sorted_data, batch_size)))

if shuffle:
np.random.shuffle(batched_data)

for batch in batched_data:
yield batch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# some function
from sklearn.metrics import f1_score, precision_score, recall_score


def get_score(y_ture, y_pred):
y_ture = np.array(y_ture)
y_pred = np.array(y_pred)
f1 = f1_score(y_ture, y_pred, average='macro') * 100
p = precision_score(y_ture, y_pred, average='macro') * 100
r = recall_score(y_ture, y_pred, average='macro') * 100

return str((reformat(p, 2), reformat(r, 2), reformat(f1, 2))), reformat(f1, 2)


def reformat(num, n):
return float(format(num, '0.' + str(n) + 'f'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# build trainer

import time
from sklearn.metrics import classification_report

clip = 5.0
epochs = 1
early_stops = 3
log_interval = 200

test_batch_size = 16
train_batch_size = 16

save_model = '../save/rnn.bin'
save_test = '../save/rnn.csv'


class Trainer():
def __init__(self, model, vocab):
self.model = model
self.report = True

self.train_data = get_examples(train_data, vocab)
self.batch_num = int(np.ceil(len(self.train_data) / float(train_batch_size)))
self.dev_data = get_examples(dev_data, vocab)
self.test_data = get_examples(test_data, vocab)

# criterion
self.criterion = nn.CrossEntropyLoss()

# label name
self.target_names = vocab.target_names

# optimizer
self.optimizer = Optimizer(model.all_parameters)

# count
self.step = 0
self.early_stop = -1
self.best_train_f1, self.best_dev_f1 = 0, 0
self.last_epoch = epochs

def train(self):
logging.info('Start training...')
for epoch in range(1, epochs + 1):
train_f1 = self._train(epoch)

dev_f1 = self._eval(epoch)

if self.best_dev_f1 <= dev_f1:
logging.info(
"Exceed history dev = %.2f, current dev = %.2f" % (self.best_dev_f1, dev_f1))
torch.save(self.model.state_dict(), save_model)

self.best_train_f1 = train_f1
self.best_dev_f1 = dev_f1
self.early_stop = 0
else:
self.early_stop += 1
if self.early_stop == early_stops:
logging.info(
"Eearly stop in epoch %d, best train: %.2f, dev: %.2f" % (
epoch - early_stops, self.best_train_f1, self.best_dev_f1))
self.last_epoch = epoch
break

def test(self):
self.model.load_state_dict(torch.load(save_model))
self._eval(self.last_epoch + 1, test=True)

def _train(self, epoch):
self.optimizer.zero_grad()
self.model.train()

start_time = time.time()
epoch_start_time = time.time()
overall_losses = 0
losses = 0
batch_idx = 1
y_pred = []
y_true = []
for batch_data in data_iter(self.train_data, train_batch_size, shuffle=True):
torch.cuda.empty_cache()
batch_inputs, batch_labels = self.batch2tensor(batch_data)
batch_outputs = self.model(batch_inputs)
loss = self.criterion(batch_outputs, batch_labels)
loss.backward()

loss_value = loss.detach().cpu().item()
losses += loss_value
overall_losses += loss_value

y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
y_true.extend(batch_labels.cpu().numpy().tolist())

nn.utils.clip_grad_norm_(self.optimizer.all_params, max_norm=clip)
for optimizer, scheduler in zip(self.optimizer.optims, self.optimizer.schedulers):
optimizer.step()
scheduler.step()
self.optimizer.zero_grad()

self.step += 1

if batch_idx % log_interval == 0:
elapsed = time.time() - start_time

lrs = self.optimizer.get_lr()
logging.info(
'| epoch {:3d} | step {:3d} | batch {:3d}/{:3d} | lr{} | loss {:.4f} | s/batch {:.2f}'.format(
epoch, self.step, batch_idx, self.batch_num, lrs,
losses / log_interval,
elapsed / log_interval))

losses = 0
start_time = time.time()

batch_idx += 1

overall_losses /= self.batch_num
during_time = time.time() - epoch_start_time

# reformat
overall_losses = reformat(overall_losses, 4)
score, f1 = get_score(y_true, y_pred)

logging.info(
'| epoch {:3d} | score {} | f1 {} | loss {:.4f} | time {:.2f}'.format(epoch, score, f1,
overall_losses,
during_time))
if set(y_true) == set(y_pred) and self.report:
report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
logging.info('\n' + report)

return f1

def _eval(self, epoch, test=False):
self.model.eval()
start_time = time.time()

y_pred = []
y_true = []
with torch.no_grad():
for batch_data in data_iter(self.dev_data, test_batch_size, shuffle=False):
torch.cuda.empty_cache()
batch_inputs, batch_labels = self.batch2tensor(batch_data)
batch_outputs = self.model(batch_inputs)
y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
y_true.extend(batch_labels.cpu().numpy().tolist())

score, f1 = get_score(y_true, y_pred)

during_time = time.time() - start_time

if test:
df = pd.DataFrame({'label': y_pred})
df.to_csv(save_test, index=False, sep=',')
else:
logging.info(
'| epoch {:3d} | dev | score {} | f1 {} | time {:.2f}'.format(epoch, score, f1,
during_time))
if set(y_true) == set(y_pred) and self.report:
report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
logging.info('\n' + report)

return f1

def batch2tensor(self, batch_data):
'''
[[label, doc_len, [[sent_len, [sent_id0, ...], [sent_id1, ...]], ...]]
'''
batch_size = len(batch_data)
doc_labels = []
doc_lens = []
doc_max_sent_len = []
for doc_data in batch_data:
doc_labels.append(doc_data[0])
doc_lens.append(doc_data[1])
sent_lens = [sent_data[0] for sent_data in doc_data[2]]
max_sent_len = max(sent_lens)
doc_max_sent_len.append(max_sent_len)

max_doc_len = max(doc_lens)
max_sent_len = max(doc_max_sent_len)

batch_inputs1 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
batch_inputs2 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
batch_masks = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.float32)
batch_labels = torch.LongTensor(doc_labels)

for b in range(batch_size):
for sent_idx in range(doc_lens[b]):
sent_data = batch_data[b][2][sent_idx]
for word_idx in range(sent_data[0]):
batch_inputs1[b, sent_idx, word_idx] = sent_data[1][word_idx]
batch_inputs2[b, sent_idx, word_idx] = sent_data[2][word_idx]
batch_masks[b, sent_idx, word_idx] = 1

if use_cuda:
batch_inputs1 = batch_inputs1.to(device)
batch_inputs2 = batch_inputs2.to(device)
batch_masks = batch_masks.to(device)
batch_labels = batch_labels.to(device)

return (batch_inputs1, batch_inputs2, batch_masks), batch_labels
1
2
3
# train
trainer = Trainer(model, vocab)
trainer.train()
1
2
# test
trainer.test()

基于深度学习的文本分类-BERT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
import logging
import random

import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# set seed
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)

# set cuda
gpu = 0
use_cuda = gpu >= 0 and torch.cuda.is_available()
if use_cuda:
torch.cuda.set_device(gpu)
device = torch.device("cuda", gpu)
else:
device = torch.device("cpu")
logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)

# split data to 10 fold
fold_num = 10
data_file = '../data/train_set.csv'
import pandas as pd


def all_data2fold(fold_num, num=10000):
fold_data = []
f = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
texts = f['text'].tolist()[:num]
labels = f['label'].tolist()[:num]

total = len(labels)

index = list(range(total))
np.random.shuffle(index)

all_texts = []
all_labels = []
for i in index:
all_texts.append(texts[i])
all_labels.append(labels[i])

label2id = {}
for i in range(total):
label = str(all_labels[i])
if label not in label2id:
label2id[label] = [i]
else:
label2id[label].append(i)

all_index = [[] for _ in range(fold_num)]
for label, data in label2id.items():
# print(label, len(data))
batch_size = int(len(data) / fold_num)
other = len(data) - batch_size * fold_num
for i in range(fold_num):
cur_batch_size = batch_size + 1 if i < other else batch_size
# print(cur_batch_size)
batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
all_index[i].extend(batch_data)

batch_size = int(total / fold_num)
other_texts = []
other_labels = []
other_num = 0
start = 0
for fold in range(fold_num):
num = len(all_index[fold])
texts = [all_texts[i] for i in all_index[fold]]
labels = [all_labels[i] for i in all_index[fold]]

if num > batch_size:
fold_texts = texts[:batch_size]
other_texts.extend(texts[batch_size:])
fold_labels = labels[:batch_size]
other_labels.extend(labels[batch_size:])
other_num += num - batch_size
elif num < batch_size:
end = start + batch_size - num
fold_texts = texts + other_texts[start: end]
fold_labels = labels + other_labels[start: end]
start = end
else:
fold_texts = texts
fold_labels = labels

assert batch_size == len(fold_labels)

# shuffle
index = list(range(batch_size))
np.random.shuffle(index)

shuffle_fold_texts = []
shuffle_fold_labels = []
for i in index:
shuffle_fold_texts.append(fold_texts[i])
shuffle_fold_labels.append(fold_labels[i])

data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
fold_data.append(data)

logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

return fold_data


fold_data = all_data2fold(10)

# build train, dev, test data
fold_id = 9

# dev
dev_data = fold_data[fold_id]

# train
train_texts = []
train_labels = []
for i in range(0, fold_id):
data = fold_data[i]
train_texts.extend(data['text'])
train_labels.extend(data['label'])

train_data = {'label': train_labels, 'text': train_texts}

# test
test_data_file = '../data/test_a.csv'
f = pd.read_csv(test_data_file, sep='\t', encoding='UTF-8')
texts = f['text'].tolist()
test_data = {'label': [0] * len(texts), 'text': texts}

# build vocab
from collections import Counter
from transformers import BasicTokenizer

# 分词器
basic_tokenizer = BasicTokenizer()


class Vocab():
def __init__(self, train_data):
self.min_count = 5
self.pad = 0
self.unk = 1
self._id2word = ['[PAD]', '[UNK]']
self._id2extword = ['[PAD]', '[UNK]']

self._id2label = []
self.target_names = []

self.build_vocab(train_data)

reverse = lambda x: dict(zip(x, range(len(x))))
self._word2id = reverse(self._id2word)
self._label2id = reverse(self._id2label)

logging.info("Build vocab: words %d, labels %d." % (self.word_size, self.label_size))

def build_vocab(self, data):
self.word_counter = Counter()

for text in data['text']:
words = text.split()
for word in words:
self.word_counter[word] += 1

for word, count in self.word_counter.most_common():
if count >= self.min_count:
self._id2word.append(word)

label2name = {0: '科技', 1: '股票', 2: '体育', 3: '娱乐', 4: '时政', 5: '社会', 6: '教育', 7: '财经',
8: '家居', 9: '游戏', 10: '房产', 11: '时尚', 12: '彩票', 13: '星座'}

self.label_counter = Counter(data['label'])

for label in range(len(self.label_counter)):
count = self.label_counter[label]
self._id2label.append(label)
self.target_names.append(label2name[label])

def load_pretrained_embs(self, embfile):
with open(embfile, encoding='utf-8') as f:
lines = f.readlines()
items = lines[0].split()
word_count, embedding_dim = int(items[0]), int(items[1])

index = len(self._id2extword)
embeddings = np.zeros((word_count + index, embedding_dim))
for line in lines[1:]:
values = line.split()
self._id2extword.append(values[0])
vector = np.array(values[1:], dtype='float64')
embeddings[self.unk] += vector
embeddings[index] = vector
index += 1

embeddings[self.unk] = embeddings[self.unk] / word_count
embeddings = embeddings / np.std(embeddings)

reverse = lambda x: dict(zip(x, range(len(x))))
self._extword2id = reverse(self._id2extword)

assert len(set(self._id2extword)) == len(self._id2extword)

return embeddings

def word2id(self, xs):
if isinstance(xs, list):
return [self._word2id.get(x, self.unk) for x in xs]
return self._word2id.get(xs, self.unk)

def extword2id(self, xs):
if isinstance(xs, list):
return [self._extword2id.get(x, self.unk) for x in xs]
return self._extword2id.get(xs, self.unk)

def label2id(self, xs):
if isinstance(xs, list):
return [self._label2id.get(x, self.unk) for x in xs]
return self._label2id.get(xs, self.unk)

@property
def word_size(self):
return len(self._id2word)

@property
def extword_size(self):
return len(self._id2extword)

@property
def label_size(self):
return len(self._id2label)


vocab = Vocab(train_data)

# build module
import torch.nn as nn
import torch.nn.functional as F


class Attention(nn.Module):
def __init__(self, hidden_size):
super(Attention, self).__init__()
self.weight = nn.Parameter(torch.Tensor(hidden_size, hidden_size))
self.weight.data.normal_(mean=0.0, std=0.05)

self.bias = nn.Parameter(torch.Tensor(hidden_size))
b = np.zeros(hidden_size, dtype=np.float32)
self.bias.data.copy_(torch.from_numpy(b))

self.query = nn.Parameter(torch.Tensor(hidden_size))
self.query.data.normal_(mean=0.0, std=0.05)

def forward(self, batch_hidden, batch_masks):
# batch_hidden: b x len x hidden_size (2 * hidden_size of lstm)
# batch_masks: b x len

# linear
key = torch.matmul(batch_hidden, self.weight) + self.bias # b x len x hidden

# compute attention
outputs = torch.matmul(key, self.query) # b x len

masked_outputs = outputs.masked_fill((1 - batch_masks).bool(), float(-1e32))

attn_scores = F.softmax(masked_outputs, dim=1) # b x len

# 对于全零向量,-1e32的结果为 1/len, -inf为nan, 额外补0
masked_attn_scores = attn_scores.masked_fill((1 - batch_masks).bool(), 0.0)

# sum weighted sources
batch_outputs = torch.bmm(masked_attn_scores.unsqueeze(1), key).squeeze(1) # b x hidden

return batch_outputs, attn_scores


# build word encoder
bert_path = './emb/bert-mini/'
dropout = 0.15

from transformers import BertModel


class WordBertEncoder(nn.Module):
def __init__(self):
super(WordBertEncoder, self).__init__()
self.dropout = nn.Dropout(dropout)

self.tokenizer = WhitespaceTokenizer()
self.bert = BertModel.from_pretrained(bert_path)

self.pooled = False
logging.info('Build Bert encoder with pooled {}.'.format(self.pooled))

def encode(self, tokens):
tokens = self.tokenizer.tokenize(tokens)
return tokens

def get_bert_parameters(self):
no_decay = ['bias', 'LayerNorm.weight']
optimizer_parameters = [
{'params': [p for n, p in self.bert.named_parameters() if not any(nd in n for nd in no_decay)],
'weight_decay': 0.01},
{'params': [p for n, p in self.bert.named_parameters() if any(nd in n for nd in no_decay)],
'weight_decay': 0.0}
]
return optimizer_parameters

def forward(self, input_ids, token_type_ids):
# input_ids: sen_num x bert_len
# token_type_ids: sen_num x bert_len

# sen_num x bert_len x 256, sen_num x 256
sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids)

if self.pooled:
reps = pooled_output
else:
reps = sequence_output[:, 0, :] # sen_num x 256

if self.training:
reps = self.dropout(reps)

return reps


class WhitespaceTokenizer():
"""WhitespaceTokenizer with vocab."""

def __init__(self):
vocab_file = bert_path + 'vocab.txt'
self._token2id = self.load_vocab(vocab_file)
self._id2token = {v: k for k, v in self._token2id.items()}
self.max_len = 256
self.unk = 1

logging.info("Build Bert vocab with size %d." % (self.vocab_size))

def load_vocab(self, vocab_file):
f = open(vocab_file, 'r')
lines = f.readlines()
lines = list(map(lambda x: x.strip(), lines))
vocab = dict(zip(lines, range(len(lines))))
return vocab

def tokenize(self, tokens):
assert len(tokens) <= self.max_len - 2
tokens = ["[CLS]"] + tokens + ["[SEP]"]
output_tokens = self.token2id(tokens)
return output_tokens

def token2id(self, xs):
if isinstance(xs, list):
return [self._token2id.get(x, self.unk) for x in xs]
return self._token2id.get(xs, self.unk)

@property
def vocab_size(self):
return len(self._id2token)


# build sent encoder
sent_hidden_size = 256
sent_num_layers = 2


class SentEncoder(nn.Module):
def __init__(self, sent_rep_size):
super(SentEncoder, self).__init__()
self.dropout = nn.Dropout(dropout)

self.sent_lstm = nn.LSTM(
input_size=sent_rep_size,
hidden_size=sent_hidden_size,
num_layers=sent_num_layers,
batch_first=True,
bidirectional=True
)

def forward(self, sent_reps, sent_masks):
# sent_reps: b x doc_len x sent_rep_size
# sent_masks: b x doc_len

sent_hiddens, _ = self.sent_lstm(sent_reps) # b x doc_len x hidden*2
sent_hiddens = sent_hiddens * sent_masks.unsqueeze(2)

if self.training:
sent_hiddens = self.dropout(sent_hiddens)

return sent_hiddens


# build model
class Model(nn.Module):
def __init__(self, vocab):
super(Model, self).__init__()
self.sent_rep_size = 256
self.doc_rep_size = sent_hidden_size * 2
self.all_parameters = {}
parameters = []
self.word_encoder = WordBertEncoder()
bert_parameters = self.word_encoder.get_bert_parameters()

self.sent_encoder = SentEncoder(self.sent_rep_size)
self.sent_attention = Attention(self.doc_rep_size)
parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_encoder.parameters())))
parameters.extend(list(filter(lambda p: p.requires_grad, self.sent_attention.parameters())))

self.out = nn.Linear(self.doc_rep_size, vocab.label_size, bias=True)
parameters.extend(list(filter(lambda p: p.requires_grad, self.out.parameters())))

if use_cuda:
self.to(device)

if len(parameters) > 0:
self.all_parameters["basic_parameters"] = parameters
self.all_parameters["bert_parameters"] = bert_parameters

logging.info('Build model with bert word encoder, lstm sent encoder.')

para_num = sum([np.prod(list(p.size())) for p in self.parameters()])
logging.info('Model param num: %.2f M.' % (para_num / 1e6))

def forward(self, batch_inputs):
# batch_inputs(batch_inputs1, batch_inputs2): b x doc_len x sent_len
# batch_masks : b x doc_len x sent_len
batch_inputs1, batch_inputs2, batch_masks = batch_inputs
batch_size, max_doc_len, max_sent_len = batch_inputs1.shape[0], batch_inputs1.shape[1], batch_inputs1.shape[2]
batch_inputs1 = batch_inputs1.view(batch_size * max_doc_len, max_sent_len) # sen_num x sent_len
batch_inputs2 = batch_inputs2.view(batch_size * max_doc_len, max_sent_len) # sen_num x sent_len
batch_masks = batch_masks.view(batch_size * max_doc_len, max_sent_len) # sen_num x sent_len

sent_reps = self.word_encoder(batch_inputs1, batch_inputs2) # sen_num x sent_rep_size

sent_reps = sent_reps.view(batch_size, max_doc_len, self.sent_rep_size) # b x doc_len x sent_rep_size
batch_masks = batch_masks.view(batch_size, max_doc_len, max_sent_len) # b x doc_len x max_sent_len
sent_masks = batch_masks.bool().any(2).float() # b x doc_len

sent_hiddens = self.sent_encoder(sent_reps, sent_masks) # b x doc_len x doc_rep_size
doc_reps, atten_scores = self.sent_attention(sent_hiddens, sent_masks) # b x doc_rep_size

batch_outputs = self.out(doc_reps) # b x num_labels

return batch_outputs


model = Model(vocab)

# build optimizer
learning_rate = 2e-4
bert_lr = 5e-5
decay = .75
decay_step = 1000
from transformers import AdamW, get_linear_schedule_with_warmup


class Optimizer:
def __init__(self, model_parameters, steps):
self.all_params = []
self.optims = []
self.schedulers = []

for name, parameters in model_parameters.items():
if name.startswith("basic"):
optim = torch.optim.Adam(parameters, lr=learning_rate)
self.optims.append(optim)

l = lambda step: decay ** (step // decay_step)
scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=l)
self.schedulers.append(scheduler)
self.all_params.extend(parameters)
elif name.startswith("bert"):
optim_bert = AdamW(parameters, bert_lr, eps=1e-8)
self.optims.append(optim_bert)

scheduler_bert = get_linear_schedule_with_warmup(optim_bert, 0, steps)
self.schedulers.append(scheduler_bert)

for group in parameters:
for p in group['params']:
self.all_params.append(p)
else:
Exception("no nameed parameters.")

self.num = len(self.optims)

def step(self):
for optim, scheduler in zip(self.optims, self.schedulers):
optim.step()
scheduler.step()
optim.zero_grad()

def zero_grad(self):
for optim in self.optims:
optim.zero_grad()

def get_lr(self):
lrs = tuple(map(lambda x: x.get_lr()[-1], self.schedulers))
lr = ' %.5f' * self.num
res = lr % lrs
return res

# build dataset
def sentence_split(text, vocab, max_sent_len=256, max_segment=16):
words = text.strip().split()
document_len = len(words)

index = list(range(0, document_len, max_sent_len))
index.append(document_len)

segments = []
for i in range(len(index) - 1):
segment = words[index[i]: index[i + 1]]
assert len(segment) > 0
segment = [word if word in vocab._id2word else '<UNK>' for word in segment]
segments.append([len(segment), segment])

assert len(segments) > 0
if len(segments) > max_segment:
segment_ = int(max_segment / 2)
return segments[:segment_] + segments[-segment_:]
else:
return segments


def get_examples(data, word_encoder, vocab, max_sent_len=256, max_segment=8):
label2id = vocab.label2id
examples = []

for text, label in zip(data['text'], data['label']):
# label
id = label2id(label)

# words
sents_words = sentence_split(text, vocab, max_sent_len-2, max_segment)
doc = []
for sent_len, sent_words in sents_words:
token_ids = word_encoder.encode(sent_words)
sent_len = len(token_ids)
token_type_ids = [0] * sent_len
doc.append([sent_len, token_ids, token_type_ids])
examples.append([id, len(doc), doc])

logging.info('Total %d docs.' % len(examples))
return examples


# build loader

def batch_slice(data, batch_size):
batch_num = int(np.ceil(len(data) / float(batch_size)))
for i in range(batch_num):
cur_batch_size = batch_size if i < batch_num - 1 else len(data) - batch_size * i
docs = [data[i * batch_size + b] for b in range(cur_batch_size)]

yield docs


def data_iter(data, batch_size, shuffle=True, noise=1.0):
"""
randomly permute data, then sort by source length, and partition into batches
ensure that the length of sentences in each batch
"""

batched_data = []
if shuffle:
np.random.shuffle(data)

lengths = [example[1] for example in data]
noisy_lengths = [- (l + np.random.uniform(- noise, noise)) for l in lengths]
sorted_indices = np.argsort(noisy_lengths).tolist()
sorted_data = [data[i] for i in sorted_indices]
else:
sorted_data = data

batched_data.extend(list(batch_slice(sorted_data, batch_size)))

if shuffle:
np.random.shuffle(batched_data)

for batch in batched_data:
yield batch

# some function
from sklearn.metrics import f1_score, precision_score, recall_score


def get_score(y_ture, y_pred):
y_ture = np.array(y_ture)
y_pred = np.array(y_pred)
f1 = f1_score(y_ture, y_pred, average='macro') * 100
p = precision_score(y_ture, y_pred, average='macro') * 100
r = recall_score(y_ture, y_pred, average='macro') * 100

return str((reformat(p, 2), reformat(r, 2), reformat(f1, 2))), reformat(f1, 2)


def reformat(num, n):
return float(format(num, '0.' + str(n) + 'f'))


# build trainer

import time
from sklearn.metrics import classification_report

clip = 5.0
epochs = 1
early_stops = 3
log_interval = 50

test_batch_size = 16
train_batch_size = 16

save_model = './bert.bin'
save_test = './bert.csv'


class Trainer():
def __init__(self, model, vocab):
self.model = model
self.report = True

self.train_data = get_examples(train_data, model.word_encoder, vocab)
self.batch_num = int(np.ceil(len(self.train_data) / float(train_batch_size)))
self.dev_data = get_examples(dev_data, model.word_encoder, vocab)
self.test_data = get_examples(test_data, model.word_encoder, vocab)

# criterion
self.criterion = nn.CrossEntropyLoss()

# label name
self.target_names = vocab.target_names

# optimizer
self.optimizer = Optimizer(model.all_parameters, steps=self.batch_num * epochs)

# count
self.step = 0
self.early_stop = -1
self.best_train_f1, self.best_dev_f1 = 0, 0
self.last_epoch = epochs

def train(self):
logging.info('Start training...')
for epoch in range(1, epochs + 1):
train_f1 = self._train(epoch)

dev_f1 = self._eval(epoch)

if self.best_dev_f1 <= dev_f1:
logging.info(
"Exceed history dev = %.2f, current dev = %.2f" % (self.best_dev_f1, dev_f1))
torch.save(self.model.state_dict(), save_model)

self.best_train_f1 = train_f1
self.best_dev_f1 = dev_f1
self.early_stop = 0
else:
self.early_stop += 1
if self.early_stop == early_stops:
logging.info(
"Eearly stop in epoch %d, best train: %.2f, dev: %.2f" % (
epoch - early_stops, self.best_train_f1, self.best_dev_f1))
self.last_epoch = epoch
break

def test(self):
self.model.load_state_dict(torch.load(save_model))
self._eval(self.last_epoch + 1, test=True)

def _train(self, epoch):
self.optimizer.zero_grad()
self.model.train()

start_time = time.time()
epoch_start_time = time.time()
overall_losses = 0
losses = 0
batch_idx = 1
y_pred = []
y_true = []
for batch_data in data_iter(self.train_data, train_batch_size, shuffle=True):
torch.cuda.empty_cache()
batch_inputs, batch_labels = self.batch2tensor(batch_data)
batch_outputs = self.model(batch_inputs)
loss = self.criterion(batch_outputs, batch_labels)
loss.backward()

loss_value = loss.detach().cpu().item()
losses += loss_value
overall_losses += loss_value

y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
y_true.extend(batch_labels.cpu().numpy().tolist())

nn.utils.clip_grad_norm_(self.optimizer.all_params, max_norm=clip)
for optimizer, scheduler in zip(self.optimizer.optims, self.optimizer.schedulers):
optimizer.step()
scheduler.step()
self.optimizer.zero_grad()

self.step += 1

if batch_idx % log_interval == 0:
elapsed = time.time() - start_time

lrs = self.optimizer.get_lr()
logging.info(
'| epoch {:3d} | step {:3d} | batch {:3d}/{:3d} | lr{} | loss {:.4f} | s/batch {:.2f}'.format(
epoch, self.step, batch_idx, self.batch_num, lrs,
losses / log_interval,
elapsed / log_interval))

losses = 0
start_time = time.time()

batch_idx += 1

overall_losses /= self.batch_num
during_time = time.time() - epoch_start_time

# reformat
overall_losses = reformat(overall_losses, 4)
score, f1 = get_score(y_true, y_pred)

logging.info(
'| epoch {:3d} | score {} | f1 {} | loss {:.4f} | time {:.2f}'.format(epoch, score, f1,
overall_losses,
during_time))
if set(y_true) == set(y_pred) and self.report:
report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
logging.info('\n' + report)

return f1

def _eval(self, epoch, test=False):
self.model.eval()
start_time = time.time()
data = self.test_data if test else self.dev_data
y_pred = []
y_true = []
with torch.no_grad():
for batch_data in data_iter(data, test_batch_size, shuffle=False):
torch.cuda.empty_cache()
batch_inputs, batch_labels = self.batch2tensor(batch_data)
batch_outputs = self.model(batch_inputs)
y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
y_true.extend(batch_labels.cpu().numpy().tolist())

score, f1 = get_score(y_true, y_pred)

during_time = time.time() - start_time

if test:
df = pd.DataFrame({'label': y_pred})
df.to_csv(save_test, index=False, sep=',')
else:
logging.info(
'| epoch {:3d} | dev | score {} | f1 {} | time {:.2f}'.format(epoch, score, f1,
during_time))
if set(y_true) == set(y_pred) and self.report:
report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
logging.info('\n' + report)

return f1

def batch2tensor(self, batch_data):
'''
[[label, doc_len, [[sent_len, [sent_id0, ...], [sent_id1, ...]], ...]]
'''
batch_size = len(batch_data)
doc_labels = []
doc_lens = []
doc_max_sent_len = []
for doc_data in batch_data:
doc_labels.append(doc_data[0])
doc_lens.append(doc_data[1])
sent_lens = [sent_data[0] for sent_data in doc_data[2]]
max_sent_len = max(sent_lens)
doc_max_sent_len.append(max_sent_len)

max_doc_len = max(doc_lens)
max_sent_len = max(doc_max_sent_len)

batch_inputs1 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
batch_inputs2 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
batch_masks = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.float32)
batch_labels = torch.LongTensor(doc_labels)

for b in range(batch_size):
for sent_idx in range(doc_lens[b]):
sent_data = batch_data[b][2][sent_idx]
for word_idx in range(sent_data[0]):
batch_inputs1[b, sent_idx, word_idx] = sent_data[1][word_idx]
batch_inputs2[b, sent_idx, word_idx] = sent_data[2][word_idx]
batch_masks[b, sent_idx, word_idx] = 1

if use_cuda:
batch_inputs1 = batch_inputs1.to(device)
batch_inputs2 = batch_inputs2.to(device)
batch_masks = batch_masks.to(device)
batch_labels = batch_labels.to(device)

return (batch_inputs1, batch_inputs2, batch_masks), batch_labels

# train
trainer = Trainer(model, vocab)
trainer.train()

# test
trainer.test()

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!