74. Fake News Classification with Deep Learning#
74.1. Introduction#
Deep learning has very important applications in natural language processing. In fact, in the previous content on recurrent neural networks, relevant knowledge has already been covered. This challenge requires leveraging the knowledge learned previously to improve the accuracy of fake news text classification.
74.2. Key Points#
Text Classification
Deep Neural Networks
In the experiment, we used the WSDM fake news classification data to learn the process of text classification. However, the results of the experiment were not particularly ideal, and the accuracy of the test set was basically around 65%. In this challenge, you need to use the data preprocessing techniques learned in the text classification experiment and the relevant knowledge learned in the previous deep learning to reclassify the fake news data.
{exercise-start}
:label: chapter09_05_1
Open-ended Challenge
Challenge: Use text classification preprocessing and deep learning knowledge to build a deep neural network to classify fake news data.
Requirement: Split the provided data into an 8:2 ratio, and the final accuracy of the test set should be > 70%. You can freely choose text preprocessing methods, feature extraction techniques, and the structure of the deep neural network.
{exercise-end}
The challenge needs to use the fake news data provided in the experiment.
https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv # Fake news data
## 补充代码 ###
{solution-start} chapter09_05_1
:class: dropdown
wget -nc "https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv" # Fake news data
wget -nc "https://cdn.aibydoing.com/aibydoing/files/stopwords.txt" # Stopword dictionary
import pandas as pd
df = pd.read_csv("wsdm_mini.csv")
df['title_zh'] = df[['title1_zh', 'title2_zh']].apply(
lambda x: ''.join(x), axis=1) # Combine text data columns
df.head()
import jieba
from tqdm import tqdm_notebook
def load_stopwords(file_path):
with open(file_path, 'r') as f:
stopwords = [line.strip('\n') for line in f.readlines()]
return stopwords
stopwords = load_stopwords('stopwords.txt')
corpus = []
for line in tqdm_notebook(df['title_zh']):
words = []
seg_list = list(jieba.cut(line)) # Tokenize
for word in seg_list:
if word in stopwords: # Remove stopwords
continue
words.append(word)
corpus.append(" ".join(words))
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000)
tokenizer
tokenizer.fit_on_texts(corpus)
X_ = tokenizer.texts_to_sequences(corpus)
for seq in X_[:1]:
print([tokenizer.index_word[idx] for idx in seq])
X = tf.keras.preprocessing.sequence.pad_sequences(X_, maxlen=20)
X.shape
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
y_onehot = encoder.fit_transform(df.label.values.reshape(len(df), -1))
y_onehot
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.2)
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(10000, 16, input_length=20))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))
model.summary()
model.compile(optimizer='Adam', loss='categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, 64, 10, validation_data=(X_test, y_test))
{solution-end}