concrete_NLP

虽然存在丰富的精细和抽象的NLP技术，但集群和分类应始终作为处理这类数据时使用的第一种技术。除了在生产中最容易扩展之外，它们的易用性可以迅速帮助企业解决一系列应用问题：

如何自动区分不同类别的句子？
如何找到一个数据集中最相似的句子？
如何提取一个丰富而简洁的表示，然后可以用于一系列其他任务？
如何快速找到这些任务是否可以在你的数据集上？

这篇文章的作用是提供一个简单方式的寻找句子表示，以便将它们分类或组合在一起。

收集数据

数据集

每个机器学习问题都是从数据开始，例如电子邮件、帖子或推文。文本信息常见来源包括：

产品评论（来自亚马逊，Yelp和各种原因商店）
用户发布的内容（推文，facebook帖子，stackOverflow问题）
故障排除（客户请求，聊天记录）

本文我们使用的是由 Crowdflower提供的名为“社交媒体中出现的灾难”数据集，其中：

投稿人查看了超过10000条推文，包括“着火”，“隔离”和“混乱”等各种检索，然后看推文是否是指灾难事件（排除掉用单词或电影评论的笑话，一些非灾难性的）。

我们将尝试正确预测关于灾难的推文。这是一个非常相关的问题，因为：

任何试图从噪音中获得信号的人都可以采取行动（比如在这种情况下的警察部门）
这是棘手的，因为依赖于关键字比在大多数情况下像垃圾邮件更难

import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
import keras
import nltk
import pandas as pd
import numpy as np
import re
import codecs

数据清理

让我们确保我们的推文只有我们想要的字符。我们删除’＃’字符，但保留’＃’后的单词，因为它们可能是相关的（例如：#disaster）

input_file = codecs.open("socialmedia_relevant_cols.csv", "r",encoding='utf-8', errors='replace')
output_file = open("socialmedia_relevant_cols_clean.csv", "w")
def sanitize_characters(raw, clean):    
    for line in input_file:
        out = line
        output_file.write(line)
sanitize_characters(input_file, output_file)

检查数据

它看起来很稳固，但我们并不需要网址，我们希望我们的文字都是小写字母（Hello和HELLO对我们的任务来说非常相似）

1
2
3

questions = pd.read_csv("socialmedia_relevant_cols_clean.csv")
questions.columns=['text', 'choose_one', 'class_label']
questions.head()

	text	choose_one	class_label
0	Just happened a terrible car crash	Relevant	1.0
1	Our Deeds are the Reason of this #earthquake M…	Relevant	1.0
2	Heard about #earthquake is different cities, s…	Relevant	1.0
3	there is a forest fire at spot pond, geese are…	Relevant	1.0
4	Forest fire near La Ronge Sask. Canada	Relevant	1.0

1	questions.tail()

	text	choose_one	class_label
10871	M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt…	Relevant	1.0
10872	Police investigating after an e-bike collided …	Relevant	1.0
10873	The Latest: More Homes Razed by Northern Calif…	Relevant	1.0
10874	MEG issues Hazardous Weather Outlook (HWO)	Relevant	1.0
10875	CityofCalgary has activated its Muni	NaN	NaN

1	questions.describe()

	class_label
count	10875.000000
mean	0.432552
std	0.498414
min	0.000000
25%	0.000000
50%	0.000000
75%	1.000000
max	2.000000

让我们使用一些正则表达式来清理不必要的数据，并将其保存回csv文件供将来使用

def standardize_text(ques, text_field):
    ques[text_field] = ques[text_field].str.replace(r"http\S+", "")
    ques[text_field] = ques[text_field].str.replace(r"http", "")
    ques[text_field] = ques[text_field].str.replace(r"@\S+", "")
    ques[text_field] = ques[text_field].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
    ques[text_field] = ques[text_field].str.replace(r"@", "at")
    ques[text_field] = ques[text_field].str.lower()
    return ques
questions = standardize_text(questions, "text")
questions.to_csv("clean_data.csv")
questions.head()

	text	choose_one	class_label
0	just happened a terrible car crash	Relevant	1.0
1	our deeds are the reason of this earthquake m…	Relevant	1.0
2	heard about earthquake is different cities, s…	Relevant	1.0
3	there is a forest fire at spot pond, geese are…	Relevant	1.0
4	forest fire near la ronge sask canada	Relevant	1.0

1 2	clean_questions = pd.read_csv("clean_data.csv") clean_questions.tail()

	text	choose_one	class_label
10871	m1 94 01 04 utc ?5km s of volcano hawaii	Relevant	1.0
10872	police investigating after an e bike collided …	Relevant	1.0
10873	the latest more homes razed by northern calif…	Relevant	1.0
10874	meg issues hazardous weather outlook (hwo)	Relevant	1.0
10875	cityofcalgary has activated its muni	NaN	NaN

数据概述

让我们来看看我们的class_label

1	clean_questions.groupby("class_label").count()

	Unnamed: 0	text	choose_one
class_label
0.0	6187	6187	6187
1.0	4672	4672	4672
2.0	16	16	16

我们可以看到我们的类非常平衡，对“不相关的”类稍加过度抽样。

我们的数据很干净，现在需要做好准备

现在我们的输入数据是更合理的，让我们以我们的模型可以理解的方式来改变我们的输入数据。

这意味着：

将句子标记为单独的单词列表
创建训练集好测试集
仔细检查我们的数据来验证结果

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
clean_questions["tokens"] = clean_questions["text"].apply(tokenizer.tokenize)
clean_questions.head()

	Unnamed: 0	text	choose_one	class_label	tokens
0	0	just happened a terrible car crash	Relevant	1.0	[just, happened, a, terrible, car, crash]
1	1	our deeds are the reason of this earthquake m…	Relevant	1.0	[our, deeds, are, the, reason, of, this, earth…
2	2	heard about earthquake is different cities, s…	Relevant	1.0	[heard, about, earthquake, is, different, citi…
3	3	there is a forest fire at spot pond, geese are…	Relevant	1.0	[there, is, a, forest, fire, at, spot, pond, g…
4	4	forest fire near la ronge sask canada	Relevant	1.0	[forest, fire, near, la, ronge, sask, canada]

深入挖掘数据

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
all_words = [word for tokens in clean_questions['tokens'] for word in tokens]
sentence_lengths = [len(tokens) for tokens in clean_questions["tokens"]]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))
print("Max sentence length is %s" % max(sentence_lengths))

154721 words total, with a vocabulary size of 18102
Max sentence length is 34

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10, 10)) 
plt.xlabel('Sentence length')
plt.ylabel('Number of sentences')
plt.hist(sentence_lengths)
plt.show()

现在我们的数据是干净的，准备好了，让我们进入机器学习部分

Embeddings

机器学习中图像可以使用原始像素作为输入，那么NLP可以使用什么？

为计算机表示文本的一种自然方式是对每个字符进行单独编码，这似乎不足以表示和理解语言。我们的目标是首先为我们的数据集中的每个句子（或推文）创建一个有用的嵌入，然后使用这些嵌入来准确地预测相关类别。

我们可以从最简单的方法开始，使用一袋文字模型，并在上面应用逻辑回归。一袋单词只是将一个索引与我们词汇表中的每个单词相关联，并且将每个句子嵌入一个0的列表，每个索引中的每个索引对应于该句子中出现的单词。

Bag of Words Counts

rom sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def cv(data):
    count_vectorizer = CountVectorizer()
    emb = count_vectorizer.fit_transform(data)
    return emb, count_vectorizer
list_corpus = clean_questions["text"].tolist()
list_labels = clean_questions["class_label"].tolist()
x_train, x_test, y_train, y_test = train_test_split(list_corpus, list_labels, test_size=0.2, random_state=40)
x_train_count, count_vectorizer = cv(x_train)
x_test_count = count_vectorizer.transform(x_test)

可视化embeddings

from sklearn.decomposition import PCA, TruncatedSVD
import matplotlib
import matplotlib.patches as patches
def plot_LSA(test_data, test_labels, savepath="PCA_demo.csv", plot=True):
    lsa = TruncatedSVD(n_components=2)
    lsa.fit(test_data)
    lsa_scores = lsa.transform(test_data)
    color_mapper = {label: idx for idx, label in enumerate(set(test_labels))}
    color_column = [color_mapper[label] for label in test_labels]
    colors = ['orange', 'blue', 'blue']
    if plot:
        plt.scatter(lsa_scores[:, 0], lsa_scores[:, 1], s=8, alpha=.8, c=test_labels,
                    cmap=matplotlib.colors.ListedColormap(colors))
        red_patch = patches.Patch(color='orange', label='Irrelevant')
        blue_patch = patches.Patch(color='blue', label='Disaster')
        plt.legend(handles=[red_patch, blue_patch], prop={'size': 30})
fig = plt.figure(figsize=(20, 20))          
plot_LSA(X_train_counts, y_train)
plt.show()

raw_distribution