cover

74. Fake News Classification with Deep Learning#

74.1. Introduction#

Deep learning has very important applications in natural language processing. In fact, in the previous content on recurrent neural networks, relevant knowledge has already been covered. This challenge requires leveraging the knowledge learned previously to improve the accuracy of fake news text classification.

74.2. Key Points#

  • Text Classification

  • Deep Neural Networks

In the experiment, we used the WSDM fake news classification data to learn the process of text classification. However, the results of the experiment were not particularly ideal, and the accuracy of the test set was basically around 65%. In this challenge, you need to use the data preprocessing techniques learned in the text classification experiment and the relevant knowledge learned in the previous deep learning to reclassify the fake news data.

{exercise-start}
:label: chapter09_05_1

Open-ended Challenge

Challenge: Use text classification preprocessing and deep learning knowledge to build a deep neural network to classify fake news data.

Requirement: Split the provided data into an 8:2 ratio, and the final accuracy of the test set should be > 70%. You can freely choose text preprocessing methods, feature extraction techniques, and the structure of the deep neural network.

{exercise-end}

The challenge needs to use the fake news data provided in the experiment.

https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv  # Fake news data
## 补充代码 ###
{solution-start} chapter09_05_1
:class: dropdown
wget -nc "https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv"  # Fake news data
wget -nc "https://cdn.aibydoing.com/aibydoing/files/stopwords.txt"  # Stopword dictionary
import pandas as pd

df = pd.read_csv("wsdm_mini.csv")
df['title_zh'] = df[['title1_zh', 'title2_zh']].apply(
    lambda x: ''.join(x), axis=1)  # Combine text data columns
df.head()
import jieba
from tqdm import tqdm_notebook

def load_stopwords(file_path):
    with open(file_path, 'r') as f:
        stopwords = [line.strip('\n') for line in f.readlines()]
    return stopwords

stopwords = load_stopwords('stopwords.txt')

corpus = []
for line in tqdm_notebook(df['title_zh']):
    words = []
    seg_list = list(jieba.cut(line))  # Tokenize
    for word in seg_list:
        if word in stopwords:  # Remove stopwords
            continue
        words.append(word)
    corpus.append(" ".join(words))
import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=10000)
tokenizer
tokenizer.fit_on_texts(corpus)
X_ = tokenizer.texts_to_sequences(corpus)

for seq in X_[:1]:
    print([tokenizer.index_word[idx] for idx in seq])

X = tf.keras.preprocessing.sequence.pad_sequences(X_, maxlen=20)
X.shape
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
y_onehot = encoder.fit_transform(df.label.values.reshape(len(df), -1))
y_onehot
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.2)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(10000, 16, input_length=20))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.summary()

model.compile(optimizer='Adam', loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(X_train, y_train, 64, 10, validation_data=(X_test, y_test))
{solution-end}

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.