Exploring Word Vectors with GloVe

在处理文字时，处理庞大但是稀少的语言是很困难的。即使对于一个晓得语料库，神经网络也需要支持数以千计的离散输入和输出。

除了原始数字外，将单词表示为one-hot向量的方法无法捕获任何有关单词之间关系的信息。

Ｗord Vector 通过在多维向量空间中表示单词来解决这个问题。这样就可以将问题的维度数十万减少到数百。而且向量空间能够从距离向量之间的夹角来捕获单词之间的语义关系。

analogy

现已有一些创建Ｗord Vector的技巧。word2vec算法可预测上下文中的单词(例如”the cat”最可能出现的单词是”the mouse”)，而Glove向量则基于整个语料库的全局计数。glove最大的特点就是可以轻松的下载多套预先训练好的词向量。

Loading word vectors

Torchtext包括下载GloVe（和其他）嵌入的函数。

1 2	import torch import torchtext.vocab as vocab

1	glove = vocab.GloVe(name='6B', dim=100)

[out]: 98%|█████████▊| 391063/400000 [00:19<00:00, 19964.76it/s]
98%|█████████▊| 393086/400000 [00:19<00:00, 19965.52it/s]
99%|█████████▉| 395108/400000 [00:19<00:00, 19966.79it/s]
99%|█████████▉| 397132/400000 [00:19<00:00, 19968.13it/s]
100%|█████████▉| 399154/400000 [00:19<00:00, 19968.55it/s]
100%|██████████| 400000/400000 [00:20<00:00, 19969.45it/s]

返回的GloVe对象包含以下属性：

stoi string-to-index returns a dictionary of words to indexes
itos index-to-string returns an array of words by index
vectors returns the actual vectors. To get a word vector get the index to get the vector

1
2
3

def get_word(word):
    return glove.vectors[glove.stoi[word]]
#get_word('google')  输出size为100的向量

Finding closest vectors

def closest(vec, n=10):
    """
    Find the closest words for a given vector
    """
    all_dists = [(w, torch.dist(vec, get_word(w))) for w in glove.itos]
    return sorted(all_dists, key=lambda t: t[1])[:n]
def print_tuples(tuples):
    for tuple in tuples:
        print('(%.4f) %s' % (tuple[1], tuple[0]))
print_tuples(closest(get_word('google')))

[out]: (0.0000) google
(3.0772) yahoo
(3.8836) microsoft
(4.1048) web
(4.1082) aol
(4.1165) facebook
(4.3917) ebay
(4.4122) msn
(4.4540) internet
(4.4651) netscape

Word analogies with vector arithmetic

训练有素的单词向量空间的最有趣的特征是可以用正则向量算法来捕捉某些语义关系（不仅仅是单词的紧密性）。

anology_ex

# In the form w1 : w2 :: w3 : ?
def analogy(w1, w2, w3, n=5, filter_given=True):
    print('\n[%s : %s :: %s : ?]' % (w1, w2, w3))
   
    # w2 - w1 + w3 = w4
    closest_words = closest(get_word(w2) - get_word(w1) + get_word(w3))
    
    # Optionally filter out given words
    if filter_given:
        closest_words = [t for t in closest_words if t[0] not in [w1, w2, w3]]
        
    print_tuples(closest_words[:n])

1	analogy('king', 'man', 'queen')

[out]: [king : man :: queen : ?]
(4.0811) woman
(4.6916) girl
(5.2703) she
(5.2788) teenager
(5.3084) boy