Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

75. Natural Language Processing Framework Expansion#

75.1. Introduction#

Previously, we introduced recurrent neural networks and used deep learning frameworks to complete text classification tasks. In fact, deep learning frameworks are not as proficient in natural language processing as they are in computer vision. In this experiment, we focus on understanding and learning several commonly used tools in natural language processing. These tools often become powerful weapons for further exploring natural language processing.

75.2. Key Points#

Natural Language Toolkit
PyTorch Flair
Natural Language Processing Tools

As the saying goes, “If you want to do a good job, you must first sharpen your tools.” With the idea of not reinventing the wheel, we will teach you how to use common tools in the process of machine learning applications in the course. For example, the NumPy scientific computing library, the Pandas data analysis library, the scikit-learn machine learning library, etc. And the TensorFlow and PyTorch frameworks used in deep learning.

Today’s natural language processing mostly incorporates deep learning methods. However, the commonly used TensorFlow and PyTorch are more designed for computer vision, especially providing a large number of preprocessing methods for images. Different from computer vision, natural language processing has its own difficulties. The most obvious problem is that the methods for different languages may vary. Therefore, in this experiment, we will learn several tools specifically designed for natural language processing, which is also a supplement to the previous NLP experiment content.

75.3. Natural Language Toolkit#

The abbreviation of Natural Language Toolkit is NLTK, which, as the name implies, is a natural language processing toolkit. Currently, NLTK is mainly used for processing English and other Latin text.

For example, we can use NLTK to perform word segmentation on English texts. First, we need to download the NLTK extension packages. You can use nltk.download() to selectively download the required extension packages or directly use python -m nltk.downloader all to download all data extension packages.

Due to the slow access to overseas networks, here we download the Punkt Tokenizer Models extension required for English word segmentation from the course mirror server.

                          import nltk
nltk.download('punkt')  # 下载英文分词所需拓展包

True

Next, use nltk.tokenize.word_tokenize to complete the English text word segmentation process.

                          from nltk.tokenize import word_tokenize

text = """
[English] is a West Germanic language that was first spoken in early 
medieval England and eventually became a global lingua franca.
It is named after the <Angles>, one of the Germanic tribes that 
migrated to the area of Great Britain that later took their name, 
as England.
"""

tokens = word_tokenize(text)
print(tokens)

                        

                          ['[', 'English', ']', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', 'It', 'is', 'named', 'after', 'the', '<', 'Angles', '>', ',', 'one', 'of', 'the', 'Germanic', 'tribes', 'that', 'migrated', 'to', 'the', 'area', 'of', 'Great', 'Britain', 'that', 'later', 'took', 'their', 'name', ',', 'as', 'England', '.']

                        

If you only need to segment text into sentences, you can use the nltk.sent_tokenize method.

                          from nltk import sent_tokenize

sent_tokenize(text)

                          ['\n[English] is a West Germanic language that was first spoken in early \nmedieval England and eventually became a global lingua franca.',
 'It is named after the <Angles>, one of the Germanic tribes that \nmigrated to the area of Great Britain that later took their name, \nas England.']

For the word segmentation results, we can also use NLTK to complete text filtering. For example, to remove punctuation marks from the text, by traversing the word segmentation results and only retaining the English content. Here we use the .isalpha string processing method provided by Python.

                          tokens = word_tokenize(text)
# 仅保留 alphabetic
words = [word for word in tokens if word.isalpha()]
print(words)

                        

                          ['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', 'It', 'is', 'named', 'after', 'the', 'Angles', 'one', 'of', 'the', 'Germanic', 'tribes', 'that', 'migrated', 'to', 'the', 'area', 'of', 'Great', 'Britain', 'that', 'later', 'took', 'their', 'name', 'as', 'England']

                        

Of course, we can also remove English stop words. Here, we need to download the stop words extension package and use nltk.corpus.stopwords to load it. The stop word data in the experiment is already included in the data packet downloaded at the beginning. The following code can be used to load it locally:

                          from nltk.corpus import stopwords

nltk.download('stopwords')  # 安装停用词拓展包
stop_words = stopwords.words("english")  # 加载英文停用词
print(stop_words)

                          ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

                        

Currently, this extension package supports stop words in languages such as Dutch, German, Italian, Portuguese, Swedish, Arabic, English, Greek, Kazakh, Romanian, Turkish, Azerbaijani, Finnish, Hungarian, Nepali, Russian, Danish, French, Indonesian, Norwegian, Spanish, etc. Unfortunately, it does not provide common Chinese stop words.

Similarly, we can remove stop words by traversing.

                          words_ = [w for w in words if not w in stop_words]
print(words_)

                          ['English', 'West', 'Germanic', 'language', 'first', 'spoken', 'early', 'medieval', 'England', 'eventually', 'became', 'global', 'lingua', 'franca', 'It', 'named', 'Angles', 'one', 'Germanic', 'tribes', 'migrated', 'area', 'Great', 'Britain', 'later', 'took', 'name', 'England']

                        

In addition, NLTK can easily perform word frequency statistics. nltk.FreqDist can return a word frequency dictionary in descending order.

                          from nltk import FreqDist

FreqDist(tokens)

                          FreqDist({'that': 3, 'the': 3, 'is': 2, 'a': 2, 'Germanic': 2, 'England': 2, '.': 2, ',': 2, 'of': 2, '[': 1, ...})

                        

A large number of extension packages provided by NLTK can achieve more advanced applications. For example, PorterStemmer can be used to extract the stems of words in a sentence. Stem extraction is a concept in linguistic morphology. The purpose of stem extraction is to remove affixes to obtain the root word. For example, the stem of “Germanic” is “german”.

                          from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed)

                          ['[', 'english', ']', 'is', 'a', 'west', 'german', 'languag', 'that', 'wa', 'first', 'spoken', 'in', 'earli', 'mediev', 'england', 'and', 'eventu', 'becam', 'a', 'global', 'lingua', 'franca', '.', 'it', 'is', 'name', 'after', 'the', '<', 'angl', '>', ',', 'one', 'of', 'the', 'german', 'tribe', 'that', 'migrat', 'to', 'the', 'area', 'of', 'great', 'britain', 'that', 'later', 'took', 'their', 'name', ',', 'as', 'england', '.']

                        

Finally, it is highly recommended that you read and practice Analyzing Text with the Natural Language Toolkit published by the official NLTK. The content in it is very comprehensive. If you often need to process English text content, this book will be of great help to you. You can also purchase the Chinese translation version of this book, Python Natural Language Processing - Posts & Telecom Press, in China.

75.4. Flair#

Flair is a natural language processing framework that has emerged in recent years and belongs to the PyTorch ecosystem. Flair has the following main features.

First of all, Flair supports word tokenization, part-of-speech tagging, and named entity recognition for English text. Among them, part-of-speech tagging means marking a word as a noun, verb, adjective, etc. according to grammar. For Chinese, the Jieba tool can be used, and for English, NLTK or Flair can be used. Named entity recognition refers to identifying entities with specific meanings in the text, mainly including person names, place names, organization names, proper nouns, etc. Flair has achieved good results in these tasks zalandoresearch/flair.

Secondly, Flair provides a variety of pre-trained models for multi-lingual word embeddings, facilitating various word embedding tasks. For example: Flair embeddings, BERT embeddings and ELMo embeddings, etc. Finally, Flair builds a complete framework based on Pytorch, which is very convenient for tasks such as text classification.

Next, let’s learn how to use Flair. The standard text type in Flair is flair.data.Sentence. For English text, a Sentence can be created through the following example.

                          from flair.data import Sentence

text = """
[English] is a West Germanic language that was first spoken in early 
medieval England and eventually became a global lingua franca.
It is named after the <Angles>, one of the Germanic tribes that 
migrated to the area of Great Britain that later took their name, 
as England.
"""

sentence = Sentence(text)
sentence

                        

                          Sentence[55]: " [English] is a West Germanic language that was first spoken in early  medieval England and eventually became a global lingua franca. It is named after the <Angles>, one of the Germanic tribes that  migrated to the area of Great Britain that later took their name,  as England."

                        

Flair automatically recognizes the number of Tokens contained in each Sentence according to the spaces. You can output these phrases by traversing. If we use Chinese text with words segmented by spaces, Flair also supports recognizing the number of Tokens.

                          for token in sentence:
    print(token)

                          Token[0]: "["
Token[1]: "English"
Token[2]: "]"
Token[3]: "is"
Token[4]: "a"
Token[5]: "West"
Token[6]: "Germanic"
Token[7]: "language"
Token[8]: "that"
Token[9]: "was"
Token[10]: "first"
Token[11]: "spoken"
Token[12]: "in"
Token[13]: "early"
Token[14]: "medieval"
Token[15]: "England"
Token[16]: "and"
Token[17]: "eventually"
Token[18]: "became"
Token[19]: "a"
Token[20]: "global"
Token[21]: "lingua"
Token[22]: "franca"
Token[23]: "."
Token[24]: "It"
Token[25]: "is"
Token[26]: "named"
Token[27]: "after"
Token[28]: "the"
Token[29]: "<"
Token[30]: "Angles"
Token[31]: ">"
Token[32]: ","
Token[33]: "one"
Token[34]: "of"
Token[35]: "the"
Token[36]: "Germanic"
Token[37]: "tribes"
Token[38]: "that"
Token[39]: "migrated"
Token[40]: "to"
Token[41]: "the"
Token[42]: "area"
Token[43]: "of"
Token[44]: "Great"
Token[45]: "Britain"
Token[46]: "that"
Token[47]: "later"
Token[48]: "took"
Token[49]: "their"
Token[50]: "name"
Token[51]: ","
Token[52]: "as"
Token[53]: "England"
Token[54]: "."

                        

As you can see, actually these phrases are not the expected word segmentation results, for example, there is a representation like [English]. Of course, in Flair, you only need to specify use_tokenizer=True, and it will automatically call segtok to complete English word segmentation. (It does not support Chinese word segmentation processing)

                          sentence = Sentence(text, use_tokenizer=True)
for token in sentence:
    print(token)

                        

                          Token[0]: "["
Token[1]: "English"
Token[2]: "]"
Token[3]: "is"
Token[4]: "a"
Token[5]: "West"
Token[6]: "Germanic"
Token[7]: "language"
Token[8]: "that"
Token[9]: "was"
Token[10]: "first"
Token[11]: "spoken"
Token[12]: "in"
Token[13]: "early"
Token[14]: "medieval"
Token[15]: "England"
Token[16]: "and"
Token[17]: "eventually"
Token[18]: "became"
Token[19]: "a"
Token[20]: "global"
Token[21]: "lingua"
Token[22]: "franca"
Token[23]: "."
Token[24]: "It"
Token[25]: "is"
Token[26]: "named"
Token[27]: "after"
Token[28]: "the"
Token[29]: "<"
Token[30]: "Angles"
Token[31]: ">"
Token[32]: ","
Token[33]: "one"
Token[34]: "of"
Token[35]: "the"
Token[36]: "Germanic"
Token[37]: "tribes"
Token[38]: "that"
Token[39]: "migrated"
Token[40]: "to"
Token[41]: "the"
Token[42]: "area"
Token[43]: "of"
Token[44]: "Great"
Token[45]: "Britain"
Token[46]: "that"
Token[47]: "later"
Token[48]: "took"
Token[49]: "their"
Token[50]: "name"
Token[51]: ","
Token[52]: "as"
Token[53]: "England"
Token[54]: "."

                        

Next, you can perform Word Embedding on the Sentence. For Chinese word embedding, Flair uses the pre-trained model provided by FastText, which is trained using the Wikipedia corpus paper. You can use flair.embeddings.WordEmbeddings('zh') to load the word vectors.

                          from flair.embeddings import WordEmbeddings

# 初始化 embedding
embedding = WordEmbeddings('zh')  # 自行实现时请替换为 `zh`
# 创建 sentence，中文需传入分词后用空格间隔的语句，才能被 Flair 识别出 tokens
sentence = Sentence("机器 学习 是 一个 好 工具")
# 词嵌入
embedding.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)  # 输出词嵌入后向量

                        

                          2023-11-14 10:46:34,630 https://flair.informatik.hu-berlin.de/resources/embeddings/token/zh-wiki-fasttext-300d-1M.vectors.npy not found in cache, downloading to /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmptm1wnan_

                        

                          2023-11-14 10:48:27,701 copying /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmptm1wnan_ to cache at /Users/huhuhang/.flair/embeddings/zh-wiki-fasttext-300d-1M.vectors.npy

                        

                          2023-11-14 10:48:28,046 removing temp file /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmptm1wnan_
2023-11-14 10:48:30,424 https://flair.informatik.hu-berlin.de/resources/embeddings/token/zh-wiki-fasttext-300d-1M not found in cache, downloading to /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmpt5nod9rk

                          2023-11-14 10:48:37,381 copying /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmpt5nod9rk to cache at /Users/huhuhang/.flair/embeddings/zh-wiki-fasttext-300d-1M
2023-11-14 10:48:37,394 removing temp file /var/folders/tc/9kxpg1x95sl6cm2lc2jwpgt80000gn/T/tmpt5nod9rk

                          Token[0]: "机器"
tensor([ 0.0143,  0.5811, -0.6224,  0.2212,  0.7530, -0.1874, -0.6064,  0.0659,
         0.2285,  0.0601,  0.8601, -0.0285,  1.1349, -0.5639, -0.1153, -0.0566,
         0.7801, -0.0867, -0.6968, -0.5147, -0.3374,  1.1837,  0.7827,  0.0867,
         0.4255,  0.1987,  0.8387, -0.0374,  0.3309, -0.0280,  0.8692, -0.9097,
        -0.8766, -0.6566, -0.4730,  1.0071,  0.7562, -0.4000, -0.0652,  0.9994,
         0.9919,  0.4734,  0.8127, -0.4761, -0.1291, -0.5706, -0.7824, -0.3793,
         0.1278, -1.0881,  0.6386, -0.4776, -0.7002, -0.8154,  0.1790, -0.6806,
        -1.2060, -1.0734, -2.0394,  0.4766,  0.9346,  0.0028,  0.5399,  0.8536,
         0.1003,  0.5261, -0.6837, -0.5685, -0.5339,  0.1208,  0.8826, -0.4829,
         1.1641, -0.2419, -0.7891, -0.1125, -0.1593, -0.8578, -0.6621, -1.1855,
         0.0431,  0.0583,  0.7011, -0.7517, -0.7582, -0.9517, -0.0285,  0.3103,
         0.1624, -0.9033, -0.7867,  0.4230, -0.2775, -0.0805, -0.3226,  0.7330,
         0.3128,  0.1851, -0.1853, -0.0596,  0.4414,  0.2600, -0.7027,  0.8328,
        -0.4970,  0.3798,  0.2092, -0.7503, -0.5770, -0.0128, -0.1826, -0.1387,
         0.3124,  0.0187,  0.0387,  0.3218, -0.6264, -0.0517, -0.8444, -0.2013,
        -0.5843,  0.4578,  0.3557,  0.3344, -0.3998, -0.3747,  0.8146,  0.5117,
        -0.8563,  0.2704,  0.3490,  0.5117, -0.7002,  0.7740, -1.1578,  0.4763,
        -0.1603, -0.2892, -0.6538, -0.2876,  0.0559,  0.3469,  0.6359, -1.0277,
        -0.3468,  0.6848, -1.1921, -0.2028, -0.2787, -0.0375,  0.3030,  0.2835,
        -0.4877,  0.5015,  0.6387, -0.7885,  0.5213, -0.1034, -0.4846,  0.7212,
         0.1653,  0.1738,  0.3371,  0.2896, -0.5367, -0.0286, -0.3308,  0.4364,
         0.8609,  0.2810,  0.4085,  0.3831,  1.1185, -0.0573,  0.1359, -0.0791,
         0.4720,  0.7635, -0.0476, -0.3413, -0.4208, -0.4342,  0.0646,  0.7787,
        -0.2778,  0.5125, -0.1750, -0.0067, -0.6191,  0.6051, -0.3996, -0.1800,
        -0.3747, -0.5957, -0.0768, -0.3267,  0.1453, -0.8712, -0.0167, -0.2440,
         0.0361, -0.1182, -0.0665, -0.2876,  0.3599,  0.2551,  0.3388, -0.4155,
         0.2375,  0.2611, -0.9597,  0.6817, -0.2156,  0.1333, -0.4112,  0.4606,
         0.2891, -0.1833,  0.6388,  0.4233, -0.3933,  0.9661,  0.4108, -0.4213,
        -0.5075,  0.4503, -0.3346, -0.2201, -0.3898,  0.0812, -0.8379, -0.5047,
         0.2715, -0.3409, -0.4785,  1.0817,  0.3356, -0.6770,  0.0201, -0.2554,
         0.6776, -0.4254,  0.1542, -0.8496, -0.3390,  0.2657, -0.7995, -0.1938,
        -0.5448,  0.7467, -0.6824, -0.0090, -1.0278,  0.4611, -0.2736, -0.2581,
         0.4046,  0.3983, -1.0149, -0.2133,  0.2896, -0.4678, -0.6328, -0.1350,
        -0.1115, -0.0133, -1.0436, -0.6583, -1.6365, -0.9621, -0.5890,  0.0709,
         1.2525, -0.6565,  0.2980, -0.3386,  0.3507,  0.5609,  0.2868, -0.8079,
         0.8194,  1.6471,  0.2568,  0.1511,  0.1190, -0.2075, -0.0378, -0.0687,
        -1.0916,  0.0909, -1.0063, -0.8164, -0.9036, -0.6002,  0.2261, -0.6284,
         0.1163,  0.2058,  1.0010, -0.0158])
Token[1]: "学习"
tensor([-3.2496e-01,  6.4774e-01, -3.8768e-01,  3.2548e-02,  7.6574e-01,
        -5.1088e-01, -7.9309e-01,  3.1974e-01,  1.4386e-01, -2.3689e-01,
         4.8327e-01, -6.3284e-02,  1.5617e+00, -6.6433e-01,  1.1605e-01,
        -2.2726e-01,  7.7193e-01, -2.1452e-01, -5.9669e-01, -6.7123e-01,
        -4.8646e-01,  1.0104e+00,  7.3959e-01, -6.9452e-02,  6.9977e-01,
        -5.0222e-01,  8.8357e-01, -2.6706e-02,  5.9556e-01,  2.4998e-01,
         8.0008e-01, -7.0294e-01, -6.3508e-01, -4.5956e-01, -7.2117e-01,
         6.0594e-01,  5.8690e-01, -3.0229e-01,  1.0712e-02,  1.4117e+00,
         9.3205e-01,  9.1340e-01,  7.4644e-01, -7.8001e-01, -3.1752e-01,
        -4.2588e-01, -6.7553e-01, -4.7500e-01,  1.2541e-01, -5.2286e-01,
         1.2321e+00, -4.2871e-01, -2.3411e-01, -7.4918e-01,  2.6468e-01,
        -9.1652e-01, -1.0878e+00, -1.3424e+00, -2.3272e+00,  2.0264e-01,
         7.5792e-01, -1.0008e-01,  6.2414e-01,  8.5643e-01,  2.8045e-01,
         6.6526e-01, -6.2577e-01, -4.4871e-01, -2.1029e-01,  5.2435e-03,
         9.8931e-01, -3.3600e-01,  1.0255e+00, -5.4760e-01, -4.6953e-01,
        -1.2598e-01, -1.1644e-01, -6.8195e-01, -2.6941e-01, -1.0325e+00,
        -4.2845e-01,  1.5560e-01,  8.3749e-01, -7.6646e-01, -6.7090e-01,
        -9.9258e-01, -1.9242e-01,  6.7083e-01,  2.2316e-01, -6.9702e-01,
        -5.0593e-01,  4.5782e-01, -1.6225e-01, -4.4178e-02, -4.5914e-01,
         9.3188e-01,  2.8645e-01,  3.5577e-01,  9.9708e-02, -5.0715e-03,
         3.0289e-01,  4.0837e-01, -3.1080e-01,  5.4059e-01, -5.3086e-01,
         4.7741e-02,  9.2715e-02, -7.3871e-01, -6.6761e-01, -2.4710e-01,
         1.1328e-02, -3.9275e-01, -8.4853e-03,  6.7699e-01,  4.8520e-01,
         2.2267e-01, -5.9829e-01, -1.2634e-01, -7.5148e-01, -4.0789e-01,
        -3.9861e-01,  4.6634e-01,  5.2215e-01,  9.5104e-02, -5.8386e-01,
        -3.6987e-01,  5.0411e-01,  2.6521e-01, -7.4881e-01,  1.3841e-01,
         5.5953e-01, -3.5650e-01, -5.7487e-01,  1.0268e+00, -1.0020e+00,
         4.0540e-01,  6.9844e-03, -4.0649e-02, -7.5194e-01, -1.7583e-01,
        -2.3509e-02,  6.2793e-01,  7.7491e-01, -9.5466e-01, -3.5790e-01,
         3.3733e-01, -9.7010e-01, -2.7844e-01, -4.7630e-01, -8.6698e-03,
         4.8556e-01,  5.4333e-01, -5.6352e-01,  4.5409e-01,  6.4429e-01,
        -8.2720e-01,  1.9464e-01, -3.3808e-02, -3.2662e-01,  6.3361e-01,
         5.6221e-01,  4.0578e-01, -5.3711e-03,  2.4223e-01, -7.3461e-02,
         2.6014e-01, -1.2481e-01,  1.1112e+00,  3.2438e-01,  2.6632e-01,
         4.4040e-01,  4.5628e-01,  1.1011e+00, -3.0905e-01,  2.0793e-01,
        -5.1031e-01,  8.0338e-01,  7.4910e-01, -1.2676e-01, -1.9419e-01,
        -5.3962e-01, -4.4887e-01,  5.0762e-02,  2.4368e-01, -1.9830e-01,
         5.7638e-01, -2.5450e-01, -2.6344e-01, -7.1175e-01,  8.4950e-01,
        -1.1203e-01, -5.1713e-02,  7.3786e-02, -4.6739e-01, -2.6118e-01,
        -1.0008e+00, -2.1247e-01, -7.9742e-01, -2.0798e-01,  2.9983e-01,
        -1.6070e-01, -1.1373e-01,  2.7128e-01, -9.5071e-01, -4.7413e-02,
         9.7961e-02,  1.5148e-01, -5.8094e-01,  1.8383e-02,  1.2603e-01,
        -7.1018e-01, -5.5209e-03, -3.5366e-01,  9.1064e-02, -8.4823e-01,
         1.4066e-01,  3.3259e-01, -4.3188e-01,  6.9093e-01,  5.9725e-01,
        -4.7339e-01,  1.4482e-02,  4.2865e-01, -1.2653e-01, -6.9798e-01,
         5.1845e-01,  1.8231e-02, -5.2561e-01, -7.6256e-01,  4.4506e-02,
        -7.8633e-01, -8.0979e-01,  1.5287e-01, -4.3581e-01, -4.7554e-01,
         5.5389e-01,  4.3198e-01, -1.1809e+00, -3.1034e-02, -1.5329e-01,
         6.3897e-01, -6.1526e-01,  6.6176e-01, -1.1912e-01,  6.6673e-02,
         1.5720e-01, -9.1384e-01, -7.0507e-02, -2.9597e-02,  8.7810e-01,
        -4.2138e-01,  6.7716e-02, -6.7661e-01,  6.9992e-01, -3.4975e-01,
        -6.0683e-02,  4.2290e-01,  6.0106e-01, -8.1242e-01, -1.7345e-01,
         8.1558e-01, -1.9420e-04, -2.6439e-01, -6.7547e-02, -4.9000e-01,
        -1.0618e-01, -4.5975e-01, -3.5768e-01, -1.3467e+00, -7.7125e-01,
        -3.1377e-01,  1.7904e-01,  7.0509e-01, -5.1039e-01, -1.6599e-01,
         1.1125e-02,  1.6963e-01,  2.1902e-01, -1.9176e-02, -6.2030e-01,
         5.8062e-01,  1.5223e+00,  3.6571e-02,  7.5735e-01,  5.1125e-01,
        -7.2188e-02, -8.3230e-03, -6.8004e-02, -1.3448e+00, -6.3490e-02,
        -8.8260e-01, -8.2197e-01, -7.2176e-01, -5.2066e-01, -3.0366e-01,
        -2.6454e-01,  6.6962e-01,  7.0614e-01,  9.2428e-01,  1.4701e-01])
Token[2]: "是"
tensor([-2.9470e-01,  9.6263e-01, -6.4208e-01,  2.6904e-01,  8.1669e-01,
        -4.1970e-01, -7.6458e-01,  2.2026e-01,  3.9823e-01, -4.4368e-02,
         9.0519e-01,  5.8803e-02,  1.5186e+00, -1.0610e+00,  3.7516e-02,
        -8.5236e-01,  7.1213e-01, -4.0752e-01, -1.4830e+00, -6.0035e-01,
        -1.1404e+00,  1.2292e+00,  1.2193e+00, -1.3205e-02,  7.5028e-01,
        -6.4071e-01,  1.0560e+00,  2.9746e-01,  6.5827e-01,  2.9021e-01,
         1.0365e+00, -8.8943e-01, -8.4054e-01, -9.1078e-01, -8.1180e-01,
         1.2037e+00,  1.1838e+00, -6.8467e-01,  1.3043e-01,  1.4297e+00,
         8.4812e-01,  1.2296e+00,  1.2645e+00, -1.3474e+00, -4.6647e-01,
        -5.0325e-01, -7.8006e-01, -2.9556e-01,  4.6694e-01, -8.6471e-01,
         1.1513e+00, -8.3127e-02, -5.6439e-01, -5.4749e-01,  2.4472e-01,
        -1.1987e+00, -1.1071e+00, -1.4940e+00, -3.5805e+00,  3.5054e-01,
         1.1822e+00, -1.4338e-01,  9.4363e-01,  1.4257e+00,  5.7719e-01,
         1.1155e+00, -1.1568e+00, -7.1422e-01, -5.9231e-01, -1.5217e-01,
         1.5964e+00, -8.5672e-01,  1.3706e+00, -7.5333e-01, -9.3702e-01,
         3.4216e-01, -1.8062e-02, -9.4624e-01, -4.6339e-01, -1.2441e+00,
        -2.4588e-01, -1.4186e-02,  9.0451e-01, -9.0349e-01, -8.4130e-01,
        -8.8856e-01,  1.9713e-01,  7.0343e-01,  6.6668e-02, -1.2076e+00,
        -9.1207e-01,  8.8654e-01, -1.5411e-01, -3.1390e-02, -8.8110e-01,
         1.3894e+00,  8.1058e-01,  6.1890e-01,  1.3928e-01,  1.5015e-01,
         6.2773e-01,  5.2557e-01, -6.0964e-01,  1.1134e+00, -6.4656e-01,
         4.3897e-01, -1.0560e-01, -8.1528e-01, -7.9294e-01,  1.1758e-02,
         2.2930e-01, -8.3421e-01, -1.2414e-01,  6.0844e-01,  2.7743e-01,
         4.4911e-01, -6.5487e-01,  1.2584e-01, -6.7680e-01, -1.1650e+00,
        -7.6231e-01,  1.0157e+00,  7.0824e-01,  7.2998e-02, -7.8838e-01,
        -7.8291e-01,  8.3730e-01,  8.5906e-01, -1.1207e+00,  3.9082e-01,
         7.1783e-01,  2.2138e-01, -9.7172e-01,  1.2464e+00, -1.2574e+00,
         3.4886e-01, -4.2288e-01, -4.6324e-01, -9.6657e-01, -6.5494e-01,
         3.6606e-02,  8.7073e-01,  1.4635e+00, -1.1964e+00, -9.0294e-01,
         2.8447e-01, -1.4719e+00, -1.9801e-01, -7.5471e-01, -4.4003e-01,
         5.0558e-01,  1.0695e+00, -2.0375e-01,  2.6232e-01,  1.0458e+00,
        -7.8943e-01,  2.9110e-01, -4.2868e-01,  5.1662e-04,  7.1790e-01,
         2.8826e-01,  3.6638e-01,  3.5106e-01, -2.6241e-01,  5.8454e-02,
         2.4675e-01, -3.5126e-01,  8.2303e-01,  7.0827e-01,  5.1579e-01,
         6.3620e-01,  6.3905e-01,  1.0002e+00, -2.7073e-01,  1.2566e-01,
         7.3826e-02,  7.0752e-01,  8.8663e-01,  1.8877e-02, -6.4460e-01,
        -7.2626e-01, -4.3245e-01,  2.9192e-01, -2.4789e-02, -5.5305e-01,
         8.3379e-01,  1.0214e-02, -7.8934e-01, -4.7006e-01,  8.4025e-01,
        -3.2540e-01, -5.3938e-02, -6.4136e-01, -4.4979e-01, -2.0078e-02,
        -3.9118e-01, -1.1196e+00, -7.6324e-01, -7.3820e-03,  5.5478e-01,
         4.8309e-02, -5.1268e-01, -1.5172e-01, -5.8643e-01,  1.2819e-01,
         2.3156e-01,  1.3604e-01, -5.1435e-01, -2.1252e-01,  2.9178e-01,
        -9.6110e-01,  2.3519e-01, -5.3226e-01,  4.5818e-01, -6.5265e-01,
         2.3695e-02,  4.7478e-01, -1.1026e-01,  1.0374e+00,  6.8898e-01,
        -6.5905e-01,  8.3337e-01,  3.8156e-01, -3.0748e-01, -1.1250e+00,
         8.5038e-01, -2.1900e-01, -1.5465e-01, -1.0664e+00,  1.8646e-01,
        -8.0464e-01, -2.3705e-01,  5.1165e-01, -8.5804e-01, -1.1827e-01,
         1.0432e+00,  5.5366e-01, -1.2645e+00, -7.6407e-02, -2.3647e-01,
         9.5911e-01, -1.0385e+00,  6.9627e-01, -3.1721e-02, -3.7161e-01,
         8.4316e-03, -6.1770e-01, -3.5963e-01, -8.2659e-01, -6.2814e-02,
        -4.9534e-01,  1.0803e-01, -8.3316e-01,  9.2093e-02, -4.6147e-01,
        -6.0965e-02,  5.5664e-01,  6.5784e-01, -1.5682e+00, -3.8237e-02,
         6.4389e-01, -3.5463e-01, -5.1613e-01, -3.1236e-01, -2.3338e-01,
        -6.3333e-02, -8.1927e-01, -8.9823e-01, -2.0185e+00, -1.4186e+00,
        -1.0503e-01,  4.2784e-01,  1.1059e+00, -7.7548e-01, -1.7601e-01,
        -1.9003e-01,  1.5087e-01,  1.0027e+00, -1.0793e-01, -7.0910e-01,
         6.4077e-01,  1.2628e+00,  1.6852e-01,  1.0921e+00,  3.0727e-02,
        -5.8835e-01,  3.9650e-01,  4.5892e-01, -8.2997e-01,  8.4672e-02,
        -7.3598e-01, -9.9018e-01, -8.3517e-01, -6.9301e-01,  1.2126e-01,
        -9.2381e-01,  6.5063e-01,  3.7096e-01,  8.6223e-01,  4.4922e-01])
Token[3]: "一个"
tensor([-0.0737,  0.5082, -0.6935, -0.0789,  0.3819, -0.2839, -0.5039,  0.0415,
         0.1331, -0.3137,  0.6677, -0.1970,  0.9833, -0.6265,  0.0319, -0.4018,
         0.5840,  0.3428, -1.0544, -0.6864, -0.5543,  0.7148,  0.5921, -0.1479,
         0.4243, -0.4255,  0.5889,  0.0828,  0.6716,  0.2270,  0.8467, -0.3782,
        -0.3355, -0.5189, -0.7975,  0.7559,  1.2961, -0.6540,  0.0932,  1.1062,
         0.7401,  0.8103,  0.6879, -0.7346, -0.5576, -0.1967, -0.6064, -0.2394,
        -0.0572, -0.7842,  0.6955, -0.1724, -0.6147, -0.8001,  0.4576, -0.6661,
        -0.9749, -1.3075, -1.7487,  0.3943,  0.7408, -0.1462,  0.6588,  0.7658,
         0.2726,  0.9435, -0.6261, -0.2634, -0.3476, -0.2783,  1.0468, -0.3697,
         0.7806, -0.3242, -0.7213, -0.0714,  0.1008, -0.9476, -0.1250, -0.8386,
        -0.2752,  0.0817,  0.3290, -0.5047, -0.5321, -0.8082,  0.1103,  0.6936,
        -0.0974, -0.9505, -0.5493,  0.7136, -0.0415, -0.1809, -0.4456,  1.1020,
         0.3636,  0.6953, -0.2080, -0.1619,  0.1121,  0.2361, -0.3169,  0.4728,
        -0.6969,  0.0054,  0.2143, -0.7620, -0.6029, -0.1606, -0.1066, -0.2429,
         0.0950,  0.3300,  0.0370,  0.0145, -0.3844,  0.0283, -0.5727, -0.5766,
        -0.5473,  0.4859,  0.1930,  0.2058, -0.6213, -0.4933,  0.1494,  0.3724,
        -0.8091,  0.1388,  0.5616,  0.1015, -0.7299,  0.8991, -0.9057,  0.1638,
        -0.0946, -0.2198, -0.7337, -0.4277,  0.0348,  0.7408,  0.8802, -0.8790,
        -0.5132,  0.3095, -1.1776,  0.1631, -0.4547, -0.0234,  0.3326,  0.4331,
        -0.1048,  0.2179,  0.7793, -0.5711,  0.3321,  0.0602, -0.3515,  0.4355,
         0.3082,  0.3437,  0.3638,  0.2393,  0.0977, -0.0503, -0.1972,  0.4905,
         0.6600,  0.3427,  0.3623,  0.2477,  1.0571, -0.2453, -0.0974,  0.1054,
         0.5306,  0.7143, -0.0602, -0.1745, -0.4580, -0.1708,  0.0271,  0.0232,
        -0.1978,  0.1574, -0.0032, -0.3223, -0.4930,  0.7378, -0.1691, -0.2887,
        -0.3597, -0.4060, -0.0719, -0.2593, -0.8465, -0.3307, -0.1629,  0.3394,
         0.0405, -0.4944, -0.1151, -0.4372,  0.2526, -0.1608,  0.3314, -0.6572,
        -0.0946,  0.2913, -0.6903,  0.2566, -0.0576,  0.0383, -0.4663,  0.1108,
         0.1470,  0.1758,  0.6486,  0.2232, -0.3011,  0.3376,  0.3911,  0.1646,
        -0.4874,  0.9991, -0.1202, -0.3505, -0.6318,  0.1634, -0.6370, -0.2258,
         0.4027, -0.4303, -0.1014,  0.7300,  0.4224, -0.6519,  0.2385, -0.0709,
         0.5122, -0.5937,  0.5370, -0.5672, -0.0740,  0.1552, -0.5209, -0.1279,
        -0.4684,  0.3429, -0.4396,  0.0586, -0.7579,  0.2429, -0.3678,  0.1680,
         0.5940,  0.2853, -1.1939, -0.1144,  0.0490, -0.2518, -0.2717, -0.4434,
        -0.2550, -0.2255, -0.8746, -0.8969, -0.9644, -1.0754, -0.5249,  0.1892,
         1.0731, -0.8210,  0.1116,  0.1603,  0.0713,  0.4434,  0.3396, -0.5897,
         0.5703,  1.1991, -0.0383,  0.5015,  0.0743, -0.3986, -0.0771,  0.0730,
        -0.3178,  0.5026, -0.7956, -0.7972, -0.4207, -0.4523,  0.2691, -0.0433,
         0.7785,  0.2873,  0.7858,  0.2248])
Token[4]: "好"
tensor([-9.8929e-02,  1.2491e+00, -8.4758e-01,  5.3167e-01,  1.0574e+00,
        -5.2330e-01, -6.3454e-01,  6.6102e-02,  2.8577e-01, -2.8122e-01,
         1.0163e+00,  9.8894e-02,  1.4946e+00, -1.1385e+00,  3.3889e-01,
        -7.4407e-01,  1.1756e+00, -4.6333e-01, -1.3732e+00, -9.0615e-01,
        -1.1597e+00,  1.2152e+00,  1.4630e+00, -9.5854e-02,  9.3928e-01,
        -2.7229e-01,  1.3321e+00, -4.2433e-02,  8.4669e-01,  3.3059e-01,
         1.0210e+00, -1.1766e+00, -1.1104e+00, -7.9354e-01, -9.3404e-01,
         1.0778e+00,  1.3256e+00, -8.2726e-01,  3.6350e-01,  1.4105e+00,
         7.5099e-01,  1.2105e+00,  1.0801e+00, -1.4571e+00, -4.3122e-01,
        -5.5185e-01, -8.6859e-01, -2.4268e-01,  6.3586e-01, -1.0281e+00,
         1.2762e+00, -4.9458e-01, -4.5010e-01, -4.6716e-01,  2.1048e-01,
        -1.1757e+00, -1.1276e+00, -1.5405e+00, -4.3190e+00,  4.7489e-01,
         1.3342e+00, -1.8817e-01,  8.5426e-01,  1.2849e+00,  3.4079e-01,
         7.8430e-01, -9.6167e-01, -5.7700e-01, -5.0483e-01, -4.4884e-01,
         1.4872e+00, -6.9987e-01,  1.5548e+00, -8.7175e-01, -7.7090e-01,
         7.0658e-01,  5.1195e-02, -1.0701e+00, -6.4907e-01, -1.3436e+00,
        -3.7279e-01,  2.2611e-01,  1.1259e+00, -7.7517e-01, -8.9178e-01,
        -1.2251e+00,  2.6167e-01,  6.9696e-01,  2.2853e-01, -1.0625e+00,
        -6.8215e-01,  6.7655e-01, -2.3570e-01, -3.6551e-01, -1.1400e+00,
         1.3774e+00,  7.5486e-01,  9.9982e-01,  7.5766e-02,  8.7429e-02,
         6.8929e-01,  9.4955e-01, -5.4822e-01,  1.1555e+00, -6.5018e-01,
         9.5104e-01, -3.0029e-02, -6.3037e-01, -1.1969e+00,  1.8153e-01,
         5.4985e-01, -8.9267e-01, -1.9155e-01,  6.1675e-01,  2.0949e-01,
         6.4118e-01, -7.3601e-01,  2.1812e-01, -9.4216e-01, -1.0408e+00,
        -9.5387e-01,  8.3657e-01,  7.9016e-01, -2.0435e-01, -7.4658e-01,
        -1.2543e+00,  1.0448e+00,  8.6996e-01, -1.0093e+00,  4.6790e-01,
         9.8485e-01,  3.6014e-01, -8.9233e-01,  1.0195e+00, -1.4158e+00,
         2.4657e-01, -5.8987e-01, -6.0092e-01, -1.0628e+00, -4.0283e-01,
         2.9971e-02,  8.7130e-01,  1.2866e+00, -1.2866e+00, -9.3108e-01,
         1.8117e-01, -1.4708e+00, -5.1765e-02, -7.2803e-01, -2.6360e-01,
         8.2880e-01,  1.0917e+00, -3.2472e-01,  1.7605e-01,  1.1482e+00,
        -9.1457e-01,  4.5168e-01, -4.1548e-01, -2.9865e-01,  8.7758e-01,
         5.7283e-01,  4.0329e-01,  4.1626e-01, -4.9962e-03, -3.0370e-01,
         2.3634e-01, -9.3532e-02,  1.2920e+00,  5.0149e-01,  4.6592e-01,
         6.0240e-01,  7.2581e-01,  1.3470e+00, -1.8658e-01,  6.0634e-02,
        -8.4577e-02,  7.9268e-01,  1.0364e+00, -1.8741e-01, -3.8072e-01,
        -6.4939e-01, -6.9535e-01,  3.1566e-01,  7.5466e-02, -4.4083e-01,
         8.4603e-01, -1.2895e-01, -7.4151e-01, -6.2922e-01,  9.0151e-01,
        -4.1163e-01, -2.1922e-01, -6.0280e-01, -8.4718e-01, -4.3876e-02,
        -8.6079e-01, -9.4760e-01, -8.7113e-01,  6.8971e-02,  5.3052e-01,
        -2.1489e-01, -5.0863e-01,  3.2898e-02, -8.4323e-01, -1.0329e-01,
         1.7324e-01,  4.5164e-01, -5.0305e-01,  2.7276e-02,  3.8356e-01,
        -1.0494e+00,  2.5880e-01, -5.8057e-01,  5.2955e-01, -9.6895e-01,
         4.0608e-01,  4.5989e-01, -3.9216e-01,  1.1082e+00,  7.3078e-01,
        -4.8874e-01,  1.0780e+00,  8.8200e-02, -3.1825e-04, -1.0974e+00,
         8.8000e-01, -3.7661e-01, -5.4520e-01, -1.0127e+00,  5.8841e-02,
        -9.7452e-01, -8.1897e-01,  4.0137e-01, -6.5215e-01, -1.9860e-01,
         1.2709e+00,  5.3455e-01, -1.2475e+00, -2.9769e-01, -5.5092e-01,
         7.9281e-01, -6.9952e-01,  7.4329e-01, -6.3450e-02, -3.0502e-01,
         8.0636e-02, -5.2637e-01, -3.3847e-01, -7.7989e-01,  2.5192e-01,
        -2.0292e-01, -1.5657e-01, -1.1926e+00,  3.0141e-01, -5.3737e-01,
        -1.6424e-01,  4.7523e-01,  5.2591e-01, -1.7125e+00,  7.9079e-03,
         5.0052e-01, -3.5621e-01, -3.8273e-01,  2.6603e-02, -4.5296e-01,
        -3.0302e-01, -8.4407e-01, -1.0795e+00, -2.1210e+00, -9.9769e-01,
         6.3477e-03,  4.1831e-01,  1.0916e+00, -6.7977e-01, -2.6044e-01,
        -4.0029e-01, -1.9433e-01,  8.1853e-01,  9.3558e-03, -7.8074e-01,
         6.8278e-01,  1.7029e+00,  1.8010e-01,  1.0896e+00, -1.8944e-01,
        -1.6505e-01,  6.6493e-01,  5.8992e-01, -1.1811e+00,  4.4594e-02,
        -8.6748e-01, -1.3192e+00, -6.1864e-01, -6.5932e-01, -9.3218e-02,
        -1.0111e+00,  7.4332e-02,  9.3045e-01,  1.1787e+00, -4.2524e-02])
Token[5]: "工具"
tensor([ 7.7626e-02,  3.4500e-01, -8.4805e-01,  6.4822e-01,  8.4264e-01,
        -1.5225e-01, -5.6491e-01,  7.3676e-02,  2.8573e-01, -3.7650e-01,
         7.1349e-01,  7.4695e-02,  1.0345e+00, -8.3189e-01,  2.7686e-01,
        -3.8717e-01,  5.5416e-01, -2.2696e-01, -7.5008e-01, -3.8934e-01,
        -7.8667e-01,  7.1259e-01,  8.0246e-01,  1.9120e-01,  6.8034e-01,
        -2.1005e-01,  7.8897e-01,  2.9680e-02,  4.6992e-01,  1.0709e-01,
         6.7642e-01, -5.0523e-01, -8.2933e-01, -1.9451e-01, -4.4002e-01,
         7.7130e-01,  8.2206e-01, -4.6257e-01,  2.5625e-01,  1.1233e+00,
         6.1146e-01,  6.2149e-01,  6.3909e-01, -1.0478e+00, -3.1095e-01,
        -3.7674e-01, -3.0834e-01, -1.0151e-01,  2.0864e-01, -6.2303e-01,
         9.2810e-01, -5.5475e-01, -2.2680e-01, -4.8252e-01,  2.0128e-01,
        -7.6970e-01, -8.0237e-01, -9.6166e-01, -2.5902e+00,  3.4217e-01,
         7.6201e-01,  2.2369e-01,  7.2643e-01,  8.1148e-01,  4.1677e-01,
         7.6627e-01, -7.6495e-01, -6.5571e-01, -5.0893e-01,  7.2132e-04,
         8.7514e-01, -3.1716e-01,  7.0562e-01, -5.0139e-01, -5.0363e-01,
         9.8056e-02,  9.5462e-02, -3.7544e-01, -5.1002e-01, -1.0472e+00,
        -1.8728e-01,  1.9773e-02,  9.3435e-01, -2.1193e-01, -7.1982e-01,
        -6.3043e-01,  3.9934e-01,  3.1178e-01,  2.8568e-01, -2.9688e-01,
        -4.2226e-01,  4.1676e-01, -3.2123e-01, -6.7405e-02, -6.3319e-01,
         1.0140e+00,  6.4152e-01,  3.5480e-01, -1.5926e-01,  2.8371e-01,
         4.9172e-01,  3.4924e-01, -3.7351e-01,  7.2828e-01, -6.3493e-01,
         6.4960e-01,  4.2411e-02, -6.6279e-01, -6.1671e-01,  1.1741e-01,
        -3.3209e-02, -3.5541e-01, -2.1580e-01,  1.2344e-01,  6.5102e-02,
         2.8030e-01, -3.7041e-01,  8.7519e-02, -2.9778e-01, -7.4646e-01,
        -4.6600e-01,  7.0127e-01,  7.1335e-01, -1.6944e-01, -6.1457e-01,
        -3.8265e-01,  6.4486e-01,  5.8761e-01, -7.5519e-01,  5.6657e-02,
         8.4512e-01,  3.8494e-01, -5.9174e-01,  6.1456e-01, -9.0165e-01,
         4.1715e-01, -2.1881e-01, -7.0829e-03, -5.1822e-01, -3.5971e-01,
         1.0889e-01,  5.2257e-01,  1.1090e+00, -1.1411e+00, -3.4349e-01,
         2.2106e-01, -1.3681e+00, -9.8239e-02, -9.5574e-01, -1.0046e-01,
         3.8280e-01,  4.3341e-01, -5.4691e-01,  7.9324e-02,  4.5687e-01,
        -6.1989e-01,  2.8172e-01, -1.7621e-01, -1.9494e-01,  4.9175e-01,
         1.8155e-01,  2.0121e-01,  3.5332e-01, -1.0972e-01, -2.4886e-01,
        -6.8137e-03,  1.7329e-02,  6.3899e-01,  5.2208e-01,  1.5924e-01,
         4.9881e-01,  3.2426e-02,  1.0455e+00, -1.0736e-01, -9.9695e-02,
         1.1880e-02,  1.6766e-01,  4.9924e-01, -4.2403e-01, -5.1307e-01,
        -5.0723e-01, -7.0273e-01,  1.2938e-01,  2.0587e-01, -2.4804e-01,
         4.3666e-01, -3.2109e-01, -4.9457e-01, -7.0252e-01,  4.8740e-01,
        -2.5155e-01, -1.2607e-01, -5.1706e-01, -1.3906e-01, -3.1923e-01,
        -5.9507e-01, -5.9941e-01, -5.9291e-01,  8.3765e-02,  2.2275e-01,
         5.0059e-02, -2.0300e-01, -1.8935e-01, -7.1984e-01,  3.8764e-01,
         1.5202e-01,  2.4962e-01,  5.9814e-02,  5.6661e-02,  4.8815e-01,
        -1.1389e+00,  3.6501e-01, -3.7702e-01,  3.8402e-01, -6.4953e-01,
         2.4076e-01,  2.2848e-01, -1.0029e-01,  3.6977e-01,  6.5855e-01,
        -5.9567e-01,  5.0346e-01,  4.5110e-01, -2.2932e-01, -7.9340e-01,
         7.2282e-01, -1.1029e-01, -2.9329e-01, -8.3050e-01, -1.7477e-01,
        -7.8086e-01, -3.5417e-01,  3.3621e-01, -6.3124e-01, -4.5237e-01,
         1.2538e+00,  5.0721e-01, -4.7086e-01, -3.9326e-01, -3.0356e-01,
        -2.5768e-02, -8.8868e-01,  2.1007e-01, -2.5571e-01, -4.9784e-01,
         1.9029e-01, -7.6648e-01, -5.4100e-01, -3.0252e-01,  5.7354e-01,
        -4.8085e-01, -3.4596e-03, -9.9778e-01,  6.6623e-02, -4.7766e-01,
        -1.2646e-01,  1.8771e-01,  1.5432e-01, -1.0605e+00, -1.3565e-01,
         5.2770e-01, -7.4257e-01, -1.7400e-01,  2.8087e-01, -4.0230e-01,
         3.1949e-02, -5.6044e-01, -6.3482e-01, -1.6851e+00, -1.0976e+00,
        -5.4208e-01,  3.5632e-01,  8.1533e-01, -3.7334e-01,  6.0636e-02,
        -3.8516e-01, -3.2269e-02,  4.6939e-01,  1.1107e-01, -6.4306e-01,
         4.4179e-01,  1.1898e+00,  6.1829e-02, -1.0223e-01,  4.0794e-02,
         2.7965e-03, -1.9893e-01,  3.7425e-01, -6.7358e-01,  1.4465e-02,
        -7.2238e-01, -1.0501e+00, -6.1100e-01, -7.7291e-01,  3.6434e-01,
        -6.0364e-01,  2.5729e-01,  1.8586e-01,  1.1608e+00,  4.6043e-02])

                        

As can be seen, the vectors after word embedding are output very conveniently. If a word is not included in the pre-trained model provided by FastText, then the word vectors will all be 0.

                          from flair.embeddings import WordEmbeddings

sentence = Sentence("人工智能")
embedding.embed(sentence)
for token in sentence:
    print(token)
    print(token.embedding)

                        

                          Token[0]: "人工智能"
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

                        

The length of the Chinese word embedding vectors provided by FastText is 300. Generally, the feature performance of word vectors trained with a large-scale corpus is better than that of those trained with a small corpus by oneself. This is very convenient for feature extraction in the process of natural language processing.

In addition, the BERT Embeddings in Flair also provides a word embedding model that supports Chinese. This is a pre-trained model provided by Google in 2018. You can load and use it with the following code:

                      from flair.embeddings import BertEmbeddings

embedding = BertEmbeddings('bert-base-chinese')

The corpus used by Google’s BERT pre-trained model is larger and the effect is better. However, due to the larger model, the length of the word embedding vectors is 3072, which brings greater performance consumption. It is recommended to use BertEmbeddings only with the support of powerful computing power.

As mentioned before, Flair builds a complete framework based on Pytorch, which is very convenient for tasks such as text classification. Therefore, below we use Flair to complete the previous fake news classification task.

First, download the data and stopwords:

                          wget -nc "https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv"  # 假新闻数据
wget -nc "https://cdn.aibydoing.com/aibydoing/files/stopwords.txt"  # 停用词词典

                          文件 “wsdm_mini.csv” 已经存在；不获取。

文件 “stopwords.txt” 已经存在；不获取。

There are not many changes in the preprocessing part. First, merge the two columns of text data.

                          import pandas as pd

df = pd.read_csv("wsdm_mini.csv")
df["text"] = df[["title1_zh", "title2_zh"]].apply(
    lambda x: "".join(x), axis=1
)  # 合并文本数据列
data = df.drop(df.columns[[0, 1]], axis=1)  # 删除原文本列
data.head()

                        

	label	text
0	disagreed	千叶湖八岁孩子不想去学英语，跳楼了「辟谣」千叶湖八岁孩子跳楼了为谣言信息
1	agreed	喝酸奶真的能补充益生菌吗？喝酸奶来补充益生菌，靠谱么？
2	agreed	刚刚马云终于出手了！房价要跌，扬言房地产中介都要失业了最新消息马云终于出手了！扬言房地产中介...
3	unrelated	直击“冯乡长”李正春追悼会：赵本山全程操办，赵四刘能现场祭奠昆明会议直击“活摘”谣言
4	disagreed	李雨桐爆薛之谦离婚内幕，说到底就是网红之间的恩怨情仇嘛薛之谦前女友李雨桐再次发微博爆料，薛之...

Flair provides a very high-level API for text classification, so here the data needs to be processed into a type supported by the API. Among them, the __label__ marker needs to be added to all columns of labels.

                          data["label"] = "__label__" + data["label"].astype(str)
data.head()

	label	text
0	__label__disagreed	千叶湖八岁孩子不想去学英语，跳楼了「辟谣」千叶湖八岁孩子跳楼了为谣言信息
1	__label__agreed	喝酸奶真的能补充益生菌吗？喝酸奶来补充益生菌，靠谱么？
2	__label__agreed	刚刚马云终于出手了！房价要跌，扬言房地产中介都要失业了最新消息马云终于出手了！扬言房地产中介...
3	__label__unrelated	直击“冯乡长”李正春追悼会：赵本山全程操办，赵四刘能现场祭奠昆明会议直击“活摘”谣言
4	__label__disagreed	李雨桐爆薛之谦离婚内幕，说到底就是网红之间的恩怨情仇嘛薛之谦前女友李雨桐再次发微博爆料，薛之...

Next, we perform word segmentation and stop word removal. This process can only be completed using Jieba Chinese word segmentation. Note that the data form after word segmentation should be separated by spaces, so that Flair can recognize it as the Sentence type.

                          import jieba
from tqdm.notebook import tqdm


def load_stopwords(file_path):
    with open(file_path, "r") as f:
        stopwords = [line.strip("\n") for line in f.readlines()]
    return stopwords


stopwords = load_stopwords("stopwords.txt")

corpus = []
for line in tqdm(data["text"]):
    words = []
    seg_list = list(jieba.cut(line))  # 分词
    for word in seg_list:
        if word in stopwords:  # 删除停用词
            continue
        words.append(word)
    corpus.append(" ".join(words))

data["text"] = corpus  # 将结果赋值到 DataFrame
data.head()

                        

	label	text
0	__label__disagreed	千叶湖八岁孩子不想去学英语跳楼「辟谣千叶湖八岁孩子跳楼谣言信息
1	__label__agreed	喝酸奶真的补充益生菌喝酸奶补充益生菌谱
2	__label__agreed	刚刚马云终于出手房价跌扬言房地产中介失业最新消息马云终于出手扬言...
3	__label__unrelated	直击冯乡长李正春追悼会赵本山全程操办赵四刘能现场祭奠昆明会议直击...
4	__label__disagreed	李雨桐爆薛之谦离婚内幕说到底网红之间恩怨情仇薛之谦前女友李雨桐发微...

Next, we divide the dataset into two parts: the training set and the test set. In fact, Flair’s text classification API also supports a validation set, but we won’t deal with it here for simplicity.

                          data.iloc[0 : int(len(data) * 0.8)].to_csv(
    "train.csv", sep="\t", index=False, header=False
)
data.iloc[int(len(data) * 0.8) :].to_csv(
    "test.csv", sep="\t", index=False, header=False
)

                        

It is worth noting that we need to store the dataset as CSV files respectively according to the requirements of the Flair API for easy subsequent calls. Set index=False, header=False to remove the index column and the data column names. At the same time, use \t to separate the label and the feature.

Next, we use flair.datasets.ClassificationCorpus to load the processed corpus data.

                          from flair.datasets import ClassificationCorpus
from pathlib import Path

corpus = ClassificationCorpus(Path("./"), test_file="test.csv", train_file="train.csv")
corpus

                          2023-11-14 10:48:59,383 Reading data from .
2023-11-14 10:48:59,384 Train: train.csv
2023-11-14 10:48:59,385 Dev: None
2023-11-14 10:48:59,386 Test: test.csv
2023-11-14 10:48:59,810 No dev split found. Using 0% (i.e. 1200 samples) of the train split as dev data
2023-11-14 10:48:59,810 Initialized corpus . (label type name is 'class')

                        

<flair.datasets.document_classification.ClassificationCorpus at 0x2a7e95ba0>

For example, you can preview the training set corpus through corpus.train, which is also convenient for checking whether it is loaded properly. The sign of proper loading is that Chinese characters are separated by spaces, and the number of X Tokens recognized and processed later.

                          corpus.train[0]  # 加载第一条训练语料

                        

                          Sentence[17]: "千叶 湖 八岁 孩子 不想 去学 英语 跳楼 「 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息" → disagreed (1.0)

                        

Next, we start the word embedding operation on the corpus. However, here we need to first introduce a concept in Flair called Document Embeddings. Document embeddings have actually been seen in previous text classification experiments. Previously, we simply summed the embedding vectors of each word in a piece of text as the feature of the entire text, which is actually the same concept as document embeddings here. It’s just that Flair has more sophisticated document embedding methods.

For example, Flair provides a document embedding method called Pooling. It actually takes the average, maximum, or minimum value of the word-embedded vectors as the embedding vector of the entire text. As shown in the example code below, Flair also supports passing different word embedding methods into the document embedding method through a list to achieve combined calls of multiple word embedding methods, which is very flexible.

                      from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentPoolEmbeddings

# initialize the word embeddings
glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

# initialize the document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([glove_embedding,
                                              flair_embedding_backward,
                                              flair_embedding_forward], model='mean')
document_embeddings.embed(sentence)

                    

However, here we only use one word embedding method, WordEmbeddings(), to improve the speed. Finally, we use the DocumentRNNEmbeddings RNN document embedding method to obtain the embedding vector of the text exercise. DocumentRNNEmbeddings actually constructs a simple RNN network, with the input being the word embedding vectors and the output being regarded as the document embedding.

                          from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings

word_embeddings = [WordEmbeddings("zh")]  # 词嵌入
document_embeddings = DocumentRNNEmbeddings(
    word_embeddings,
    hidden_size=512,
    reproject_words=True,
    reproject_words_dimension=256,
)  # 文档嵌入

                        

Next, we can use the text classification API provided by Flair to build a text classifier and complete the training.

                          from flair.models import TextClassifier
from flair.trainers import ModelTrainer

# 初始化分类器
classifier = TextClassifier(
    document_embeddings,
    label_dictionary=corpus.make_label_dictionary(label_type='class'),
    multi_label=False,
    label_type='class'
)
# 训练分类器
trainer = ModelTrainer(classifier, corpus)
trainer.train("./", max_epochs=1)  # 分类器模型及日志输出在当前目录

                        

2023-11-14 10:53:16,623 Computing label dictionary. Progress:

                          2023-11-14 10:53:17,566 Dictionary created for label 'class' with 3 values: disagreed (seen 3628 times), unrelated (seen 3596 times), agreed (seen 3576 times)
2023-11-14 10:53:17,579 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,579 Model: "TextClassifier(
  (embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings(
        'zh'
        (embedding): Embedding(332647, 300)
      )
    )
    (word_reprojection_map): Linear(in_features=300, out_features=256, bias=True)
    (rnn): GRU(256, 512, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=512, out_features=3, bias=True)
  (dropout): Dropout(p=0.0, inplace=False)
  (locked_dropout): LockedDropout(p=0.0)
  (word_dropout): WordDropout(p=0.0)
  (loss_function): CrossEntropyLoss()
  (weights): None
  (weight_tensor) None
)"

                        

                          2023-11-14 10:53:17,579 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,580 Corpus: 10800 train + 1200 dev + 3000 test sentences
2023-11-14 10:53:17,580 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,580 Train:  10800 sentences
2023-11-14 10:53:17,580         (train_with_dev=False, train_with_test=False)
2023-11-14 10:53:17,581 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,581 Training Params:
2023-11-14 10:53:17,581  - learning_rate: "0.1" 
2023-11-14 10:53:17,581  - mini_batch_size: "32"
2023-11-14 10:53:17,582  - max_epochs: "1"
2023-11-14 10:53:17,582  - shuffle: "True"
2023-11-14 10:53:17,582 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,582 Plugins:
2023-11-14 10:53:17,583  - AnnealOnPlateau | patience: '3', anneal_factor: '0.5', min_learning_rate: '0.0001'
2023-11-14 10:53:17,583 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,583 Final evaluation on model from best epoch (best-model.pt)
2023-11-14 10:53:17,583  - metric: "('micro avg', 'f1-score')"
2023-11-14 10:53:17,584 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,584 Computation:
2023-11-14 10:53:17,584  - compute on device: cpu
2023-11-14 10:53:17,585  - embedding storage: cpu
2023-11-14 10:53:17,585 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,585 Model training base path: "."
2023-11-14 10:53:17,585 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:17,586 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:20,609 epoch 1 - iter 33/338 - loss 1.24147453 - time (sec): 3.02 - samples/sec: 349.38 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:23,036 epoch 1 - iter 66/338 - loss 1.19262065 - time (sec): 5.45 - samples/sec: 387.56 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:25,995 epoch 1 - iter 99/338 - loss 1.17272384 - time (sec): 8.41 - samples/sec: 376.75 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:28,436 epoch 1 - iter 132/338 - loss 1.16206076 - time (sec): 10.85 - samples/sec: 389.33 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:30,844 epoch 1 - iter 165/338 - loss 1.15570651 - time (sec): 13.26 - samples/sec: 398.28 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:33,111 epoch 1 - iter 198/338 - loss 1.14712376 - time (sec): 15.53 - samples/sec: 408.11 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:35,864 epoch 1 - iter 231/338 - loss 1.14621714 - time (sec): 18.28 - samples/sec: 404.42 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:38,292 epoch 1 - iter 264/338 - loss 1.14031862 - time (sec): 20.71 - samples/sec: 408.00 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:40,654 epoch 1 - iter 297/338 - loss 1.13598081 - time (sec): 23.07 - samples/sec: 412.00 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:42,987 epoch 1 - iter 330/338 - loss 1.13634822 - time (sec): 25.40 - samples/sec: 415.73 - lr: 0.100000 - momentum: 0.000000
2023-11-14 10:53:44,147 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:44,148 EPOCH 1 done: loss 1.1351 - lr: 0.100000

                        

                          2023-11-14 10:53:44,997 DEV : loss 1.0034687519073486 - f1-score (micro avg)  0.4283

                        

                          2023-11-14 10:53:45,114  - 0 epochs without improvement
2023-11-14 10:53:45,114 saving best model
2023-11-14 10:53:47,711 ----------------------------------------------------------------------------------------------------
2023-11-14 10:53:47,715 Loading model from best epoch ...

                        

                          2023-11-14 10:53:51,879 
Results:
- F-score (micro) 0.4107
- F-score (macro) 0.3339
- Accuracy 0.4107

By class:
              precision    recall  f1-score   support

   unrelated     0.3432    0.8278    0.4852       999
   disagreed     0.6864    0.4141    0.5166       978
      agreed     0.0000    0.0000    0.0000      1023

    accuracy                         0.4107      3000
   macro avg     0.3432    0.4140    0.3339      3000
weighted avg     0.3380    0.4107    0.3300      3000

2023-11-14 10:53:51,880 ----------------------------------------------------------------------------------------------------

                        

{'test_score': 0.4106666666666667}

Without the support of a GPU, the above training process takes a long time. Since we only use Fasttext WordEmbeddings, the final classification accuracy of the fake news dataset is not ideal. You can choose to terminate the training and continue reading the following content.

After the training is completed, Flair will save the final model final-model.pt and the best model best-model.pt in the current directory for easy reuse later. In addition, some training log files and loss data records will also be saved in the current directory.

If you need to use the saved model for inference, you can use TextClassifier.load to load it.

                          classifier = TextClassifier.load("./best-model.pt")  # 加载最佳模型
sentence = Sentence("千叶 湖 八岁 孩子 不想 去学 英语 跳楼 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息")
classifier.predict(sentence)  # 模型推理
print(sentence.labels)  # 输出推理结果

                        

                          ['Sentence[16]: "千叶 湖 八岁 孩子 不想 去学 英语 跳楼 辟谣 千叶 湖 八岁 孩子 跳楼 谣言 信息"'/'disagreed' (0.5996)]

                        

Finally, the model can infer and output the category and the corresponding probability.

75.5. Summary#

75.6. Summary#

In this experiment, we mainly learned two commonly used tools in natural language processing: NLTK and Flair. In addition, you can also teach yourself third-party Python libraries such as FastText, spaCy, Pattern, TextBlob. In fact, most natural language processing tools support English and other Romance languages well, but the support for Chinese is not ideal. Of course, there are also some natural language processing tools open-sourced by domestic institutions, such as THULAC from Tsinghua University and LAC from Baidu.