Commit a5eedaaf authored by Zhouxingyu's avatar Zhouxingyu

产品名称分类

parents
Pipeline #160 failed with stages
.DS_Store
data/cnews
data/thucnews
__pycache__
cnn_keras.py
.idea
*.pyc
checkpoints
tensorboard
.vscode
\ No newline at end of file
MIT License
Copyright (c) 2017 dzkang
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# Text Classification with CNN and RNN
使用卷积神经网络进行中文文本分类
## 环境
- Python 2/3
- TensorFlow 1.3以上
- numpy
- scikit-learn
- scipy
## 数据集
本训练集由ZI_database中VW_ProductALLState表的66万条产品名称和对应分类组成。
## 预处理
`data_prepare.py`保持联网,运行该程序,即可根据数据库,生成指定的训练,测试,验证集。
`data/cnews_loader.py`为数据的预处理文件。
- `read_file()`: 读取文件数据;
- `build_vocab()`: 构建词汇表,使用字符级的表示,这一函数会将词汇表存储下来,避免每一次重复处理;
- `read_vocab()`: 读取上一步存储的词汇表,转换为`{词:id}`表示;
- `read_category()`: 将分类目录固定,转换为`{类别: id}`表示;
- `to_words()`: 将一条由id表示的数据重新转换为文字;
- `process_file()`: 将数据集从文字转换为固定长度的id序列表示;
- `batch_iter()`: 为神经网络的训练准备经过shuffle的批次的数据。
经过数据预处理,数据的格式如下:
| Data | Shape | Data | Shape |
| :---------- | :---------- | :---------- | :---------- |
| x_train | [50000, 600] | y_train | [50000, 10] |
| x_val | [5000, 600] | y_val | [5000, 10] |
| x_test | [10000, 600] | y_test | [10000, 10] |
## CNN卷积神经网络
### 配置项
CNN可配置的参数如下所示,在`cnn_model.py`中。
```python
class TCNNConfig(object):
"""CNN配置参数"""
embedding_dim = 64 # 词向量维度
seq_length = 600 # 序列长度
num_classes = len(set(table['SubCategoryName'].tolist())) # 类别数
num_filters = 128 # 卷积核数目
kernel_size = 5 # 卷积核尺寸
vocab_size = 5000 # 词汇表达小
hidden_dim = 128 # 全连接层神经元(随着分类数增加,此层可以自行适当扩大加深,并且加入dropout)
dropout_keep_prob = 0.5 # dropout保留比例
learning_rate = 1e-3 # 学习率
batch_size = 64 # 每批训练大小
num_epochs = 10 # 总迭代轮次
print_per_batch = 100 # 每多少轮输出一次结果
save_per_batch = 10 # 每多少轮存入tensorboard
```
### CNN模型
具体参看`cnn_model.py`的实现。
### 训练与验证
#### 注意!以上例子为十分类的,上千分类可能效果有巨大偏差!
运行 `python run_cnn.py train`,可以开始训练。
> 若之前进行过训练,请把tensorboard/textcnn删除,避免TensorBoard多次训练结果重叠。
```
Configuring CNN model...
Configuring TensorBoard and Saver...
Loading training and validation data...
Time usage: 0:00:14
Training and evaluating...
Epoch: 1
Iter: 0, Train Loss: 2.3, Train Acc: 10.94%, Val Loss: 2.3, Val Acc: 8.92%, Time: 0:00:01 *
Iter: 100, Train Loss: 0.88, Train Acc: 73.44%, Val Loss: 1.2, Val Acc: 68.46%, Time: 0:00:04 *
Iter: 200, Train Loss: 0.38, Train Acc: 92.19%, Val Loss: 0.75, Val Acc: 77.32%, Time: 0:00:07 *
Iter: 300, Train Loss: 0.22, Train Acc: 92.19%, Val Loss: 0.46, Val Acc: 87.08%, Time: 0:00:09 *
Iter: 400, Train Loss: 0.24, Train Acc: 90.62%, Val Loss: 0.4, Val Acc: 88.62%, Time: 0:00:12 *
Iter: 500, Train Loss: 0.16, Train Acc: 96.88%, Val Loss: 0.36, Val Acc: 90.38%, Time: 0:00:15 *
Iter: 600, Train Loss: 0.084, Train Acc: 96.88%, Val Loss: 0.35, Val Acc: 91.36%, Time: 0:00:17 *
Iter: 700, Train Loss: 0.21, Train Acc: 93.75%, Val Loss: 0.26, Val Acc: 92.58%, Time: 0:00:20 *
Epoch: 2
Iter: 800, Train Loss: 0.07, Train Acc: 98.44%, Val Loss: 0.24, Val Acc: 94.12%, Time: 0:00:23 *
Iter: 900, Train Loss: 0.092, Train Acc: 96.88%, Val Loss: 0.27, Val Acc: 92.86%, Time: 0:00:25
Iter: 1000, Train Loss: 0.17, Train Acc: 95.31%, Val Loss: 0.28, Val Acc: 92.82%, Time: 0:00:28
Iter: 1100, Train Loss: 0.2, Train Acc: 93.75%, Val Loss: 0.23, Val Acc: 93.26%, Time: 0:00:31
Iter: 1200, Train Loss: 0.081, Train Acc: 98.44%, Val Loss: 0.25, Val Acc: 92.96%, Time: 0:00:33
Iter: 1300, Train Loss: 0.052, Train Acc: 100.00%, Val Loss: 0.24, Val Acc: 93.58%, Time: 0:00:36
Iter: 1400, Train Loss: 0.1, Train Acc: 95.31%, Val Loss: 0.22, Val Acc: 94.12%, Time: 0:00:39
Iter: 1500, Train Loss: 0.12, Train Acc: 98.44%, Val Loss: 0.23, Val Acc: 93.58%, Time: 0:00:41
Epoch: 3
Iter: 1600, Train Loss: 0.1, Train Acc: 96.88%, Val Loss: 0.26, Val Acc: 92.34%, Time: 0:00:44
Iter: 1700, Train Loss: 0.018, Train Acc: 100.00%, Val Loss: 0.22, Val Acc: 93.46%, Time: 0:00:47
Iter: 1800, Train Loss: 0.036, Train Acc: 100.00%, Val Loss: 0.28, Val Acc: 92.72%, Time: 0:00:50
No optimization for a long time, auto-stopping...
```
在验证集上的最佳效果为94.12%,且只经过了3轮迭代就已经停止。
准确率和误差如图所示:
![images](images/acc_loss.png)
### 测试
#### 注意!类别过多,混淆矩阵会大到爆炸,这个功能慎用!
运行 `python run_cnn.py test` 在测试集上进行测试。
```
Configuring CNN model...
Loading test data...
Testing...
Test Loss: 0.14, Test Acc: 96.04%
Precision, Recall and F1-Score...
precision recall f1-score support
体育 0.99 0.99 0.99 1000
财经 0.96 0.99 0.97 1000
房产 1.00 1.00 1.00 1000
家居 0.95 0.91 0.93 1000
教育 0.95 0.89 0.92 1000
科技 0.94 0.97 0.95 1000
时尚 0.95 0.97 0.96 1000
时政 0.94 0.94 0.94 1000
游戏 0.97 0.96 0.97 1000
娱乐 0.95 0.98 0.97 1000
avg / total 0.96 0.96 0.96 10000
Confusion Matrix...
[[991 0 0 0 2 1 0 4 1 1]
[ 0 992 0 0 2 1 0 5 0 0]
[ 0 1 996 0 1 1 0 0 0 1]
[ 0 14 0 912 7 15 9 29 3 11]
[ 2 9 0 12 892 22 18 21 10 14]
[ 0 0 0 10 1 968 4 3 12 2]
[ 1 0 0 9 4 4 971 0 2 9]
[ 1 16 0 4 18 12 1 941 1 6]
[ 2 4 1 5 4 5 10 1 962 6]
[ 1 0 1 6 4 3 5 0 1 979]]
Time usage: 0:00:05
```
在测试集上的准确率达到了96.04%,且各类的precision, recall和f1-score都超过了0.9。
从混淆矩阵也可以看出分类效果非常优秀。
### 预测
运行 `python run_cnn.py predict` 在预测集上进行预测。
预测集命名为`name2category.predict.txt`,放入data中的name2category文件夹,每行一个产品名称。
输出在目录文件夹,名称为`predicted_data.txt`
## RNN循环神经网络
### 和上面类似,代码尚未修改,可以根据CNN代码自行修改为训练自己模型的格式。
\ No newline at end of file
# coding: utf-8
import tensorflow as tf
class TCNNConfig(object):
"""CNN配置参数"""
embedding_dim = 64 # 词向量维度
seq_length = 300 # 序列长度
num_classes = 1078 # 类别数
num_filters = 256 # 卷积核数目
kernel_size = 5 # 卷积核尺寸
vocab_size = 5000 # 词汇表大小
hidden_dim = 128 # 全连接层神经元
dropout_keep_prob = 0.5 # dropout保留比例
learning_rate = 1e-3 # 学习率
batch_size = 64 # 每批训练大小
num_epochs = 10 # 总迭代轮次
print_per_batch = 100 # 每多少轮输出一次结果
save_per_batch = 10 # 每多少轮存入tensorboard
class TextCNN(object):
"""文本分类,CNN模型"""
def __init__(self, config):
self.config = config
# 三个待输入的数据
self.input_x = tf.placeholder(tf.int32, [None, self.config.seq_length], name='input_x')
self.input_y = tf.placeholder(tf.float32, [None, self.config.num_classes], name='input_y')
self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
self.cnn()
def cnn(self):
"""CNN模型"""
# 词向量映射
with tf.device('/cpu:0'):
embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim])
embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x)
with tf.name_scope("cnn"):
# CNN layer
conv = tf.layers.conv1d(embedding_inputs, self.config.num_filters, self.config.kernel_size, name='conv')
# global max pooling layer
gmp = tf.reduce_max(conv, reduction_indices=[1], name='gmp')
with tf.name_scope("score"):
# 全连接层,后面接dropout以及relu激活
fc = tf.layers.dense(gmp, self.config.hidden_dim, name='fc1')
fc = tf.contrib.layers.dropout(fc, self.keep_prob)
fc = tf.nn.relu(fc)
# 分类器
self.logits = tf.layers.dense(fc, self.config.num_classes, name='fc2')
self.y_pred_cls = tf.argmax(tf.nn.softmax(self.logits), 1) # 预测类别
with tf.name_scope("optimize"):
# 损失函数,交叉熵
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.input_y)
self.loss = tf.reduce_mean(cross_entropy)
# 优化器
self.optim = tf.train.AdamOptimizer(learning_rate=self.config.learning_rate).minimize(self.loss)
with tf.name_scope("accuracy"):
# 准确率
correct_pred = tf.equal(tf.argmax(self.input_y, 1), self.y_pred_cls)
self.acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
# coding: utf-8
import sys
from collections import Counter
import numpy as np
import tensorflow.keras as kr
import pandas as pd
if sys.version_info[0] > 2:
is_py3 = True
else:
reload(sys)
sys.setdefaultencoding("utf-8")
is_py3 = False
def native_word(word, encoding='utf-8'):
"""如果在python2下面使用python3训练的模型,可考虑调用此函数转化一下字符编码"""
if not is_py3:
return word.encode(encoding)
else:
return word
def native_content(content):
if not is_py3:
return content.decode('utf-8')
else:
return content
def open_file(filename, mode='r'):
"""
常用文件操作,可在python2和python3间切换.
mode: 'r' or 'w' for read or write
"""
if is_py3:
return open(filename, mode, encoding='utf-8', errors='ignore')
else:
return open(filename, mode)
def read_file(filename):
"""读取文件数据"""
contents, labels = [], []
with open_file(filename) as f:
for line in f:
try:
label, content = line.strip().split('\t')
if content:
contents.append(list(native_content(content)))
labels.append(native_content(label))
except:
pass
return contents, labels
def build_vocab(train_dir, vocab_dir, vocab_size=5000):
"""根据训练集构建词汇表,存储"""
data_train, _ = read_file(train_dir)
all_data = []
for content in data_train:
all_data.extend(content)
counter = Counter(all_data)
count_pairs = counter.most_common(vocab_size - 1)
words, _ = list(zip(*count_pairs))
# 添加一个 <PAD> 来将所有文本pad为同一长度
words = ['<PAD>'] + list(words)
open_file(vocab_dir, mode='w').write('\n'.join(words) + '\n')
def read_vocab(vocab_dir):
"""读取词汇表"""
# words = open_file(vocab_dir).read().strip().split('\n')
with open_file(vocab_dir) as fp:
# 如果是py2 则每个值都转化为unicode
words = [native_content(_.strip()) for _ in fp.readlines()]
word_to_id = dict(zip(words, range(len(words))))
return words, word_to_id
def read_category():
"""读取分类目录,固定"""
table = pd.read_excel('VW_ProductALLState.xlsx')
categories = list(set(table['SubCategoryName'].tolist()))
#categories = ['0', '1']
categories = [native_content(x) for x in categories]
cat_to_id = dict(zip(categories, range(len(categories))))
return categories, cat_to_id
def to_words(content, words):
"""将id表示的内容转换为文字"""
return ''.join(words[x] for x in content)
def process_file(filename, word_to_id, cat_to_id, max_length=600):
"""将文件转换为id表示"""
contents, labels = read_file(filename)
data_id, label_id = [], []
for i in range(len(contents)):
data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
label_id.append(cat_to_id[labels[i]])
# 使用keras提供的pad_sequences来将文本pad为固定长度
x_pad = kr.preprocessing.sequence.pad_sequences(data_id, max_length)
y_pad = kr.utils.to_categorical(label_id, num_classes=len(cat_to_id)) # 将标签转换为one-hot表示
return x_pad, y_pad
def batch_iter(x, y, batch_size=64):
"""生成批次数据"""
data_len = len(x)
num_batch = int((data_len - 1) / batch_size) + 1
indices = np.random.permutation(np.arange(data_len))
x_shuffle = x[indices]
y_shuffle = y[indices]
for i in range(num_batch):
start_id = i * batch_size
end_id = min((i + 1) * batch_size, data_len)
yield x_shuffle[start_id:end_id], y_shuffle[start_id:end_id]
#!/bin/bash
# copy MAXCOUNT files from each directory
MAXCOUNT=6500
for category in $( ls THUCNews); do
echo item: $category
dir=THUCNews/$category
newdir=data/thucnews/$category
if [ -d $newdir ]; then
rm -rf $newdir
mkdir $newdir
fi
COUNTER=1
for i in $(ls $dir); do
cp $dir/$i $newdir
if [ $COUNTER -ge $MAXCOUNT ]
then
echo finished
break
fi
let COUNTER=COUNTER+1
done
done
格林王 GREENKING 英国原装进口 IPA印度麦啤 精酿啤酒印度淡色艾尔啤酒 500ML*6瓶 整箱装
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
<PAD>
0
1
2
A
5
E
S
M
I
3
L
-
N
C
O
/
G
4
D
T
6
8
B
R
P
H
7
U
K
9
o
n
F
r
W
g
w
X
Y
V
.
J
线
×
Z
*
+
(
)
Q
m
e
绿
i
便
a
l
c
&
s
西
t
湿
广
d
u
p
h
y
鹿
竿
b
k
,
v
稿
%
x
穿
f
#
宿
z
:
亿
齿
使
仿
·
耀
_
姿
鸿
?
'
°
[
]
尿
"
寿
;
\
Φ
j
轿
=
 
沿
 
~
怀
访
|
@
q
簿
鱿
驿
Α
退
φ
>
±
<
Ф
殿
!
屿
槿
Π
楿
`
ü
é
˙
Μ
廿
Ω
à
椿
Ó
É
α
À
Β
Δ
饿
Λ
π
Á
涿
ò
麿
{
}
Ò
調
÷
^
ф
Γ
è
ê
祿
Ü
羿
º
²
⽿

μ
import pandas as pd
from function import Index
table = pd.read_excel('VW_ProductALLState.xlsx')
#category_set = list(set(table['SubCategoryName'].tolist()))
#len(list(category_set))
f = open('D:/Users/86183/Desktop/text-classification-cnn-rnn/data/name2category/name2category.val.txt','w',encoding='utf-8')
g = open('D:/Users/86183/Desktop/text-classification-cnn-rnn/data/name2category/name2category.train.txt','w',encoding='utf-8')
h = open('D:/Users/86183/Desktop/text-classification-cnn-rnn/data/name2category/name2category.test.txt','w',encoding='utf-8')
m = 0
len_table = len(table)
index = Index()
for category,name in zip(table['SubCategoryName'],table['ProductName']):
name = str(name).replace('\t',' ')
if '错误子类' not in category:
if m%13 == 1:
f.write(f'{category}\t{name}\n')
elif m%13 in [2,3]:
h.write(f'{category}\t{name}\n')
else:
g.write(f'{category}\t{name}\n')
m += 1
s = m/len_table
print(index(m, len_table-1),end = '%')
\ No newline at end of file
# -*- coding: utf-8 -*-
import pymssql
import pandas as pd
import re
from lxml import etree
from sqlalchemy import create_engine
'''
class product():
def __init__(self, product_name, product_SKU, product_class_num = 'na'):
self.product_name = str(product_name)
self.product_SKU = str(product_SKU)
self.product_class = str(product_class_num)
def get_parameter(self, **kwargs):
'''
class sql_find():
def __init__ (self, database='ZI_DataBase', localhost=True):
if localhost:
self.conn = pymssql.connect(host='localhost', user='zgc',password='1234',database=database,autocommit=True)
self.engine = create_engine(f'mssql+pymssql://zgc:1234@localhost/{database}')
else:
self.conn = pymssql.connect(host='123.56.115.207', user='zgcprice3311',password='admin@2018@)!*',database=database,autocommit=True)
self.engine = create_engine(f'mssql+pymssql://zgcprice3311:admin@2018@)!*@123.56.115.207/{database}')
self.cursor = self.conn.cursor()
class mysql_find():
def __init__ (self, database='ZI_DataBase', localhost=True):
if localhost:
self.conn = pymssql.connect(host='localhost', user='zgc',password='1234',database=database,autocommit=True)
else:
self.conn = pymssql.connect(host='59.110.219.171', user='root',password='qwertyuiop1',database=database,autocommit=True)
self.cursor = self.conn.cursor()
'''
def execute(self, sql_sentence):
self.cursor.execute(sql_sentence)
return self.cursor
'''
def BN(brand):
brand = str(brand)
try:
country = brand.split('[')[1].split(']')[-2]
brand = brand.replace(country,'')
except IndexError:
pass
res = re.findall(r'[0-9\u4E00-\u9FA5]', brand)
new_res = ''.join(res)
if new_res.isdigit():
new_res = ''
#print(len(new_res))
if len(new_res) == 0:
res1 = re.findall(r'[a-zA-Z0-9]', brand)
new_res = ''.join(res1)
new_res = new_res.upper()
return new_res
class Index(object):
def __init__(self, number=50, decimal=2):
"""
:param decimal: 你保留的保留小数位
:param number: # 号的 个数
"""
self.decimal = decimal
self.number = number
self.a = 100/number # 在百分比 为几时增加一个 # 号
def __call__(self, now, total):
# 1. 获取当前的百分比数
percentage = self.percentage_number(now, total)
# 2. 根据 现在百分比计算
well_num = int(percentage / self.a)
# print("well_num: ", well_num, percentage)
# 3. 打印字符进度条
progress_bar_num = self.progress_bar(well_num)
# 4. 完成的进度条
result = "\r%s %s" % (progress_bar_num, percentage)
return result
def percentage_number(self, now, total):
"""
计算百分比
:param now: 现在的数
:param total: 总数
:return: 百分
"""
return round(now / total * 100, self.decimal)
def progress_bar(self, num):
"""
显示进度条位置
:param num: 拼接的 “#” 号的
:return: 返回的结果当前的进度条
"""
# 1. "#" 号个数
well_num = "#" * num
# 2. 空格的个数
space_num = " " * (self.number - num)
return '[%s%s]' % (well_num, space_num)
def brand_table_create():
sql_ZIdatabase = sql_find('ZI_DataBase', False)
sql_ZIdatabase.cursor.execute('select BrandID,BrandName from ZI_BrandList')
brand_table = sql_ZIdatabase.cursor.fetchall()
brand_table = pd.DataFrame(brand_table,columns=[tuple[0] for tuple in sql_ZIdatabase.cursor.description])
chinese_brand_lyst = []
english_brand_lyst = []
for brandname in brand_table['BrandName']:
if '错误品牌' in brandname:
chinese_brand_lyst.append('该条跳过!')
english_brand_lyst.append('该条跳过!')
elif '/' in brandname:
chinese_brand_lyst.append(brandname.split('/')[0])
english_brand_lyst.append(brandname.split('/')[1])
else:
chinese_brand_lyst.append(brandname)
english_brand_lyst.append('该条跳过!')
brand_table['中文品牌'] = chinese_brand_lyst
brand_table['英文品牌'] = english_brand_lyst
return brand_table
class tool():
def __init__(self):
self.peijian_table = pd.read_excel('是否需要配件.xlsx')
print('生成品牌表中。。。')
self.brand_table = brand_table_create()
print('生成品牌表完成。')
def judge_brand(self, brand, brandcode_original):
brandcode_original = str(brandcode_original).zfill(5)[-5:]
#print(brandcode_original)
#print(self.brand_table[self.brand_table['BrandID']==brandcode_original]['BrandName'].tolist())
if brandcode_original == '应指数品牌' or '错误品牌' in self.brand_table[self.brand_table['BrandID']==brandcode_original]['BrandName'].tolist()[0]:
BRANDID = '没有对应指数品牌'
for ID,Chinese_brand,English_brand in zip(self.brand_table['BrandID'], self.brand_table['中文品牌'], self.brand_table['英文品牌']):
if brand == Chinese_brand:
BRANDID = str(ID).zfill(5)
elif BN(brand) == English_brand:
BRANDID = str(ID).zfill(5)
else:
BRANDID = brandcode_original
return BRANDID
def judge_peijian(self, data_table):
ispeijian_lyst = []
isunique_lyst = []
for class_code in data_table['指数子类编码']:
mark = '0'
mark2 = '0'
class_code = str(class_code).zfill(4)
if class_code != '没有匹配的指数子类编码':
for categorycode, ispeijian, isunique in zip(self.peijian_table['categorycode'], self.peijian_table['ispeijian'], self.peijian_table['isunique']):
if class_code == str(categorycode).zfill(4):
if str(ispeijian) != '0':
mark = '1'
if str(isunique) != '0':
mark2 = '1'
break
ispeijian_lyst.append(mark)
isunique_lyst.append(mark2)
else:
ispeijian_lyst.append(mark)
isunique_lyst.append(mark2)
#print(len(ispeijian_lyst), len(data_table['指数子类编码']))
data_table['有无配件'] = ispeijian_lyst
data_table['型号_only'] = isunique_lyst
return data_table
def judge_unit(string):
unit_list = {'MM','CM', 'DM', 'ML', 'W', 'KW'}
if not string[0].isdigit():
return True
m = 0
for char in string:
if char.isdigit() or char == '.':
m += 1
continue
elif char.isalpha():
if string[m:].upper() in unit_list:
return False
else:
return True
return True
def type_extract_JD(name, params, brand):
#params = eval(params)
try:
brand_remove = re.findall(r"[A-Za-z0-9]+", brand)[0].upper()
except IndexError:
brand_remove = '没有英文品牌!'
param_xinghao = 'NA'
if '产品型号' in params:
param_xinghao = params['产品型号']
if '型号' in params:
param_xinghao = params['型号']
elif r'\t型号\t' in params:
param_xinghao = params[r'\t型号\t']
name_xinghao_lyst = list(filter(lambda x: len(x) >= 2, re.findall(r"[A-Za-z0-9-+/.*]+", name)))
for i in range(len(name_xinghao_lyst)):
name_xinghao_lyst[i] = name_xinghao_lyst[i].upper()
try:
name_xinghao_lyst.remove(brand_remove)
except ValueError:
pass
if len(name_xinghao_lyst) == 0:
#type_lyst.append(param_xinghao.upper())
return param_xinghao.upper()
else:
if param_xinghao in name_xinghao_lyst:
#type_lyst.append(param_xinghao.upper())
return param_xinghao.upper()
else:
xinghao_data = max(name_xinghao_lyst, key=len)
for xinghao in name_xinghao_lyst:
if len(xinghao) > 2 and '*' not in xinghao and judge_unit(xinghao):
xinghao_data = xinghao
break
if not judge_unit(xinghao_data):
xinghao_data == 'NA'
#type_lyst.append(xinghao_data.upper())
return xinghao_data
def type_extract(name, params):
#params = eval(params)
param_xinghao = 'NA'
if '型号' in params:
param_xinghao = params['型号']
elif r'\t型号\t' in params:
param_xinghao = params[r'\t型号\t']
name_xinghao_lyst = list(filter(lambda x: len(x) >= 2, re.findall(r"[A-Za-z0-9-+/.*]+", name)))
if len(name_xinghao_lyst) == 0:
#type_lyst.append(param_xinghao.upper())
return param_xinghao.upper()
else:
if param_xinghao in name_xinghao_lyst:
#type_lyst.append(param_xinghao.upper())
return param_xinghao.upper()
else:
xinghao_data = max(name_xinghao_lyst, key=len)
for xinghao in name_xinghao_lyst:
if len(xinghao) > 2 and '*' not in xinghao and judge_unit(xinghao):
xinghao_data = xinghao
break
if not judge_unit(xinghao_data):
xinghao_data == 'NA'
#type_lyst.append(xinghao_data.upper())
return xinghao_data.upper()
def param_load(product_id, xml_string):
"""
传入sku,和xml原始代码
:param product_id:sku
:param xml_string:xml数据
:return:csv
"""
xml_str = etree.HTML(xml_string)
#title = xml_str.xpath("//th[@class='tdTitle']")
secend = xml_str.xpath("//td[@class='tdTitle']")
zhi = xml_str.xpath("//tr//td[position()>1]")
data_dict = {}
for j, k in zip(secend, zhi):
#item = i.xpath("./text()")[0]
sec = j.xpath("./text()")[0]
value = k.xpath("./text()")[0]
data_dict[sec] = value
return data_dict
if __name__ == '__main__':
a = brand_table_create()
'错误品牌' in a[a['BrandID']=='08358']['BrandName'].tolist()[0]
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
将文本整合到 train、test、val 三个文件中
"""
import os
def _read_file(filename):
"""读取一个文件并转换为一行"""
with open(filename, 'r', encoding='utf-8') as f:
return f.read().replace('\n', '').replace('\t', '').replace('\u3000', '')
def save_file(dirname):
"""
将多个文件整合并存到3个文件中
dirname: 原数据目录
文件内容格式: 类别\t内容
"""
f_train = open('data/cnews/cnews.train.txt', 'w', encoding='utf-8')
f_test = open('data/cnews/cnews.test.txt', 'w', encoding='utf-8')
f_val = open('data/cnews/cnews.val.txt', 'w', encoding='utf-8')
for category in os.listdir(dirname): # 分类目录
cat_dir = os.path.join(dirname, category)
if not os.path.isdir(cat_dir):
continue
files = os.listdir(cat_dir)
count = 0
for cur_file in files:
filename = os.path.join(cat_dir, cur_file)
content = _read_file(filename)
if count < 5000:
f_train.write(category + '\t' + content + '\n')
elif count < 6000:
f_test.write(category + '\t' + content + '\n')
else:
f_val.write(category + '\t' + content + '\n')
count += 1
print('Finished:', category)
f_train.close()
f_test.close()
f_val.close()
if __name__ == '__main__':
save_file('data/thucnews')
print(len(open('data/cnews/cnews.train.txt', 'r', encoding='utf-8').readlines()))
print(len(open('data/cnews/cnews.test.txt', 'r', encoding='utf-8').readlines()))
print(len(open('data/cnews/cnews.val.txt', 'r', encoding='utf-8').readlines()))
#!/bin/bash
# copy MAXCOUNT files from each directory
MAXCOUNT=6500
for category in $( ls THUCNews); do
echo item: $category
dir=THUCNews/$category
newdir=data/thucnews/$category
if [ -d $newdir ]; then
rm -rf $newdir
mkdir $newdir
fi
COUNTER=1
for i in $(ls $dir); do
cp $dir/$i $newdir
if [ $COUNTER -ge $MAXCOUNT ]
then
echo finished
break
fi
let COUNTER=COUNTER+1
done
done
# coding: utf-8
from __future__ import print_function
import os
import tensorflow as tf
import tensorflow.contrib.keras as kr
from cnn_model import TCNNConfig, TextCNN
from data.cnews_loader import read_category, read_vocab
try:
bool(type(unicode))
except NameError:
unicode = str
base_dir = 'data/cnews'
vocab_dir = os.path.join(base_dir, 'cnews.vocab.txt')
save_dir = 'checkpoints/textcnn'
save_path = os.path.join(save_dir, 'best_validation') # 最佳验证结果保存路径
class CnnModel:
def __init__(self):
self.config = TCNNConfig()
self.categories, self.cat_to_id = read_category()
self.words, self.word_to_id = read_vocab(vocab_dir)
self.config.vocab_size = len(self.words)
self.model = TextCNN(self.config)
self.session = tf.Session()
self.session.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.restore(sess=self.session, save_path=save_path) # 读取保存的模型
def predict(self, message):
# 支持不论在python2还是python3下训练的模型都可以在2或者3的环境下运行
content = unicode(message)
data = [self.word_to_id[x] for x in content if x in self.word_to_id]
feed_dict = {
self.model.input_x: kr.preprocessing.sequence.pad_sequences([data], self.config.seq_length),
self.model.keep_prob: 1.0
}
y_pred_cls = self.session.run(self.model.y_pred_cls, feed_dict=feed_dict)
return self.categories[y_pred_cls[0]]
if __name__ == '__main__':
cnn_model = CnnModel()
test_demo = ['三星ST550以全新的拍摄方式超越了以往任何一款数码相机',
'热火vs骑士前瞻:皇帝回乡二番战 东部次席唾手可得新浪体育讯北京时间3月30日7:00']
for i in test_demo:
print(cnn_model.predict(i))
迷你植物 格林王 GREENKING 英国原装进口 IPA印度麦啤 精酿啤酒印度淡色艾尔啤酒 500ML*6瓶 整箱装
\ No newline at end of file
scikit-learn
scipy
numpy
\ No newline at end of file
#!/usr/bin/python
# -*- coding: utf-8 -*-
import tensorflow as tf
class TRNNConfig(object):
"""RNN配置参数"""
# 模型参数
embedding_dim = 64 # 词向量维度
seq_length = 600 # 序列长度
num_classes = 8 # 类别数
vocab_size = 5000 # 词汇表达小
num_layers= 2 # 隐藏层层数
hidden_dim = 128 # 隐藏层神经元
rnn = 'gru' # lstm 或 gru
dropout_keep_prob = 0.8 # dropout保留比例
learning_rate = 1e-3 # 学习率
batch_size = 128 # 每批训练大小
num_epochs = 10 # 总迭代轮次
print_per_batch = 100 # 每多少轮输出一次结果
save_per_batch = 10 # 每多少轮存入tensorboard
class TextRNN(object):
"""文本分类,RNN模型"""
def __init__(self, config):
self.config = config
# 三个待输入的数据
self.input_x = tf.placeholder(tf.int32, [None, self.config.seq_length], name='input_x')
self.input_y = tf.placeholder(tf.float32, [None, self.config.num_classes], name='input_y')
self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')
self.rnn()
def rnn(self):
"""rnn模型"""
def lstm_cell(): # lstm核
return tf.contrib.rnn.BasicLSTMCell(self.config.hidden_dim, state_is_tuple=True)
def gru_cell(): # gru核
return tf.contrib.rnn.GRUCell(self.config.hidden_dim)
def dropout(): # 为每一个rnn核后面加一个dropout层
if (self.config.rnn == 'lstm'):
cell = lstm_cell()
else:
cell = gru_cell()
return tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=self.keep_prob)
# 词向量映射
with tf.device('/cpu:0'):
embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim])
embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x)
with tf.name_scope("rnn"):
# 多层rnn网络
cells = [dropout() for _ in range(self.config.num_layers)]
rnn_cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True)
_outputs, _ = tf.nn.dynamic_rnn(cell=rnn_cell, inputs=embedding_inputs, dtype=tf.float32)
last = _outputs[:, -1, :] # 取最后一个时序输出作为结果
with tf.name_scope("score"):
# 全连接层,后面接dropout以及relu激活
fc = tf.layers.dense(last, self.config.hidden_dim, name='fc1')
fc = tf.contrib.layers.dropout(fc, self.keep_prob)
fc = tf.nn.relu(fc)
# 分类器
self.logits = tf.layers.dense(fc, self.config.num_classes, name='fc2')
self.y_pred_cls = tf.argmax(tf.nn.softmax(self.logits), 1) # 预测类别
with tf.name_scope("optimize"):
# 损失函数,交叉熵
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.input_y)
self.loss = tf.reduce_mean(cross_entropy)
# 优化器
self.optim = tf.train.AdamOptimizer(learning_rate=self.config.learning_rate).minimize(self.loss)
with tf.name_scope("accuracy"):
# 准确率
correct_pred = tf.equal(tf.argmax(self.input_y, 1), self.y_pred_cls)
self.acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import print_function
import os
import sys
import time
from datetime import timedelta
import numpy as np
import tensorflow as tf
from sklearn import metrics
from cnn_model import TCNNConfig, TextCNN
from data.cnews_loader import read_vocab, read_category, batch_iter, process_file, build_vocab
import pandas as pd
base_dir = 'data/name2category'
train_dir = os.path.join(base_dir, 'name2category.train.txt')
test_dir = os.path.join(base_dir, 'name2category.test.txt')
val_dir = os.path.join(base_dir, 'name2category.val.txt')
vocab_dir = os.path.join(base_dir, 'name2category.vocab.txt')
pred_dir = os.path.join(base_dir, 'name2category.predict.txt')
save_dir = 'checkpoints/textcnn'
save_path = os.path.join(save_dir, 'best_validation') # 最佳验证结果保存路径
def get_time_dif(start_time):
"""获取已使用时间"""
end_time = time.time()
time_dif = end_time - start_time
return timedelta(seconds=int(round(time_dif)))
def feed_data(x_batch, y_batch, keep_prob):
feed_dict = {
model.input_x: x_batch,
model.input_y: y_batch,
model.keep_prob: keep_prob
}
return feed_dict
def evaluate(sess, x_, y_):
"""评估在某一数据上的准确率和损失"""
data_len = len(x_)
batch_eval = batch_iter(x_, y_, 128)
total_loss = 0.0
total_acc = 0.0
for x_batch, y_batch in batch_eval:
batch_len = len(x_batch)
feed_dict = feed_data(x_batch, y_batch, 1.0)
loss, acc = sess.run([model.loss, model.acc], feed_dict=feed_dict)
total_loss += loss * batch_len
total_acc += acc * batch_len
return total_loss / data_len, total_acc / data_len
def train():
print("Configuring TensorBoard and Saver...")
# 配置 Tensorboard,重新训练时,请将tensorboard文件夹删除,不然图会覆盖
tensorboard_dir = 'tensorboard/textcnn'
if not os.path.exists(tensorboard_dir):
os.makedirs(tensorboard_dir)
tf.summary.scalar("loss", model.loss)
tf.summary.scalar("accuracy", model.acc)
merged_summary = tf.summary.merge_all()
writer = tf.summary.FileWriter(tensorboard_dir)
# 配置 Saver
saver = tf.train.Saver()
if not os.path.exists(save_dir):
os.makedirs(save_dir)
print("Loading training and validation data...")
# 载入训练集与验证集
start_time = time.time()
x_train, y_train = process_file(train_dir, word_to_id, cat_to_id, config.seq_length)
x_val, y_val = process_file(val_dir, word_to_id, cat_to_id, config.seq_length)
time_dif = get_time_dif(start_time)
print("Time usage:", time_dif)
# 创建session
session = tf.Session()
session.run(tf.global_variables_initializer())
writer.add_graph(session.graph)
print('Training and evaluating...')
start_time = time.time()
total_batch = 0 # 总批次
best_acc_val = 0.0 # 最佳验证集准确率
last_improved = 0 # 记录上一次提升批次
require_improvement = 1000 # 如果超过1000轮未提升,提前结束训练
flag = False
for epoch in range(config.num_epochs):
print('Epoch:', epoch + 1)
batch_train = batch_iter(x_train, y_train, config.batch_size)
for x_batch, y_batch in batch_train:
feed_dict = feed_data(x_batch, y_batch, config.dropout_keep_prob)
if total_batch % config.save_per_batch == 0:
# 每多少轮次将训练结果写入tensorboard scalar
s = session.run(merged_summary, feed_dict=feed_dict)
writer.add_summary(s, total_batch)
if total_batch % config.print_per_batch == 0:
# 每多少轮次输出在训练集和验证集上的性能
feed_dict[model.keep_prob] = 1.0
loss_train, acc_train = session.run([model.loss, model.acc], feed_dict=feed_dict)
loss_val, acc_val = evaluate(session, x_val, y_val) # todo
if acc_val > best_acc_val:
# 保存最好结果
best_acc_val = acc_val
last_improved = total_batch
saver.save(sess=session, save_path=save_path)
improved_str = '*'
else:
improved_str = ''
time_dif = get_time_dif(start_time)
msg = 'Iter: {0:>6}, Train Loss: {1:>6.2}, Train Acc: {2:>7.2%},' \
+ ' Val Loss: {3:>6.2}, Val Acc: {4:>7.2%}, Time: {5} {6}'
print(msg.format(total_batch, loss_train, acc_train, loss_val, acc_val, time_dif, improved_str))
feed_dict[model.keep_prob] = config.dropout_keep_prob
session.run(model.optim, feed_dict=feed_dict) # 运行优化
total_batch += 1
if total_batch - last_improved > require_improvement:
# 验证集正确率长期不提升,提前结束训练
print("No optimization for a long time, auto-stopping...")
flag = True
break # 跳出循环
if flag: # 同上
break
def test():
print("Loading test data...")
start_time = time.time()
x_test, y_test = process_file(test_dir, word_to_id, cat_to_id, config.seq_length)
session = tf.Session()
session.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.restore(sess=session, save_path=save_path) # 读取保存的模型
print('Testing...')
loss_test, acc_test = evaluate(session, x_test, y_test)
msg = 'Test Loss: {0:>6.2}, Test Acc: {1:>7.2%}'
print(msg.format(loss_test, acc_test))
batch_size = 128
data_len = len(x_test)
num_batch = int((data_len - 1) / batch_size) + 1
y_test_cls = np.argmax(y_test, 1)
y_pred_cls = np.zeros(shape=len(x_test), dtype=np.int32) # 保存预测结果
for i in range(num_batch): # 逐批次处理
start_id = i * batch_size
end_id = min((i + 1) * batch_size, data_len)
feed_dict = {
model.input_x: x_test[start_id:end_id],
model.keep_prob: 1.0
}
y_pred_cls[start_id:end_id] = session.run(model.y_pred_cls, feed_dict=feed_dict)
# 评估
print("Precision, Recall and F1-Score...")
print(metrics.classification_report(y_test_cls, y_pred_cls, target_names=categories))
# 混淆矩阵
print("Confusion Matrix...")
cm = metrics.confusion_matrix(y_test_cls, y_pred_cls)
print(cm)
time_dif = get_time_dif(start_time)
print("Time usage:", time_dif)
def predict():
import tensorflow.keras as kr
table = pd.read_excel('VW_ProductALLState.xlsx')
categories = list(set(table['SubCategoryName'].tolist()))
print("Loading predicted data...")
f = open(pred_dir, 'r', encoding='utf-8', errors='ignore')
contents = []
for line in f:
try:
if line:
contents.append(list(line))
except:
pass
f.close()
data_id = []
for i in range(len(contents)):
data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
x_pred = kr.preprocessing.sequence.pad_sequences(data_id, config.seq_length)
session = tf.Session()
session.run(tf.global_variables_initializer())
saver = tf.train.Saver()
saver.restore(sess=session, save_path=save_path) # 读取保存的模型
batch_size = 128
data_len = len(x_pred)
num_batch = int((data_len - 1) / batch_size) + 1
y_pred_cls = np.zeros(shape=len(x_pred), dtype=np.int32) # 保存预测结果
for i in range(num_batch): # 逐批次处理
start_id = i * batch_size
end_id = min((i + 1) * batch_size, data_len)
feed_dict = {
model.input_x: x_pred[start_id:end_id],
model.keep_prob: 1.0
}
y_pred_cls[start_id:end_id] = session.run(model.y_pred_cls, feed_dict=feed_dict) # y_pred_cls为预测的list。
y_pred_list = []
for m in range(len(y_pred_cls)):
y_pred_list.append(categories[y_pred_cls[m]])
f = open('predicted_data.txt','a+', encoding='utf-8', errors='ignore')
for n in range(len(contents)):
f.write(y_pred_list[n] + "\t")
f.write(''.join(contents[n]))
f.close()
return 0
if __name__ == '__main__':
if len(sys.argv) != 2 or sys.argv[1] not in ['train', 'test', 'predict']:
raise ValueError("""usage: python run_cnn.py [train / test]""")
print('Configuring CNN model...')
config = TCNNConfig()
if not os.path.exists(vocab_dir): # 如果不存在词汇表,重建
build_vocab(train_dir, vocab_dir, config.vocab_size)
categories, cat_to_id = read_category()
words, word_to_id = read_vocab(vocab_dir)
config.vocab_size = len(words)
#print(config.vocab_size)
model = TextCNN(config)
if sys.argv[1] == 'train':
train()
elif sys.argv[1] == 'predict':
predict()
else:
test()
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment