76. Google BERT Pre-training Technology#
76.1. Introduction#
In 2018, one of the biggest news in the field of natural language processing was the emergence of Google BERT. Google BERT claims to be the most advanced NLP pre-training technology, supporting Chinese and more languages. In the related paper, BERT demonstrated state-of-the-art results on 11 NLP tasks, including the Stanford Question Answering Dataset (SQUAD v1.1), achieving the best performance.
76.2. Key Points#
Google BERT
NLP Pre-training Technology
BERT is an NLP pre-training technology open-sourced by Google. Its full name is Bidirectional Encoder Representations from Transformers.
BERT builds on recent work on pre-training context-dependent language representations, including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike previous models, BERT is the first deep, bidirectional, unsupervised language representation model that is pre-trained using only an unlabeled text corpus (Wikipedia in this case).
Due to the complex model structure and large corpus, BERT takes up to several months to train on the current most powerful commercial GPUs. However, the good news is that Google has released the pre-trained base model BERT-Base that supports Chinese and multiple languages. With the help of the pre-trained model, we can easily complete a variety of NLP tasks.
Exercise 76.1
Open Challenge
Challenge: Use the Chinese pre-trained language model provided by Google BERT to complete the fake news data text classification task. We recommend that you split the provided data into an 8:2 ratio and finally obtain the accuracy on the test set.
Hint: Carefully read the instructions on the official open-source page and the known issues on the repository’s Issues page to solve the problems you encounter.
The challenge requires using the download link for the fake news data provided in the experiment.
https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv # Fake news data
https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip # Download link provided by Google
76.2.1. Instructions#
This challenge is not suitable to be completed through an online Notebook. It is recommended that you complete the challenge locally. Due to the high difficulty of the challenge, some instructions are provided here. Please note that the example code cannot be executed properly. Please carefully understand and make flexible adaptations.
Here, using BERT to complete text classification will utilize the fine-tuning method. First, it is necessary to download the pre-trained language model provided by Google. Then, clone the official BERT repository to directly utilize the model training source code provided by Google:
# Clone the official BERT repository
git clone https://github.com/google-research/bert.git
In the model training source code provided by Google, the
code for completing text classification is placed in
run_classifier.py
. Among them, Google provides test codes on 4 benchmark
datasets, which correspond to those in XnliProcessor,
MnliProcessor, MrpcProcessor, and ColaProcessor. Then, to
complete our own text classification task, we only need to
imitate and rewrite the class.
The process of rewriting the Processor class is very
simple. You only need to pass in the data according to
your own needs. It is recommended to refer to the
implementation process of the
XnliProcessor
class in the source code, which is quite similar to this
classification task. For example, here we define our own
text classification class as
WSDMProcessor
, which contains the following four functions:
class WSDMProcessor(DataProcessor):
# Pass in the training data
def get_train_examples(self, data_dir):
# Read the training data path
file_path = os.path.join(data_dir, 'train.csv')
# Use Pandas to read the data
df = pd.read_csv(file_path)
# Split the training data into 80% training set and 20% validation set
df_train, self.df_dev = train_test_split(df, test_size=0.2)
examples = []
# Process the data in the format recommended by BERT
for index, row in df_train.iterrows():
guid = 'train-%d' % index # Index
text_a = tokenization.convert_to_unicode(str(row[0])) # Text 1
text_b = tokenization.convert_to_unicode(str(row[1])) # Text 2
label = row[2] # Text label
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
# Pass in the validation data
def get_dev_examples(self, data_dir):
examples = []
for index, row in self.df_dev.iterrows():
guid = 'dev-%d' % index
text_a = tokenization.convert_to_unicode(str(row[0]))
text_b = tokenization.convert_to_unicode(str(row[1]))
label = row[2]
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
# Pass in the test data (for prediction)
def get_test_examples(self, data_dir):
file_path = os.path.join(data_dir, 'test.csv')
df_test = pd.read_csv(file_path)
examples = []
for index, row in df_test.iterrows():
guid = 'test-%d' % index
text_a = tokenization.convert_to_unicode(str(row[0]))
text_b = tokenization.convert_to_unicode(str(row[1]))
label = '0' # Arbitrarily specify the test data label
examples.append(InputExample(guid=guid, text_a=text_a,
text_b=text_b, label=label))
return examples
def get_labels(self):
return ['A', 'B', 'C'] # Data labels corresponding to the example three-classification task
You will find that the functions for passing in training, validation, and test data are basically the same. The key lies in processing the data into the required format according to BERT’s requirements.
Next, we also need to modify the
processors
in the
main()
function and add the custom
WSDMProcessor
we just created:
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"xnli": XnliProcessor,
"self": WSDMProcessor, # Add the custom Processor
}
After completing the above work, you can start performing the text classification task. Just run it according to the execution code in the BERT open-source repository example, for example:
python run_classifier.py \
--task_name=self \ # Execute the processor
--do_train=true \ # Enable model training
--do_eval=true \ # Enable model validation
--do_predict=true \ # Enable model testing
--data_dir=./dataset \ # Data path
--vocab_file=./chinese_L-12_H-768_A-12/vocab.txt \ # Pre-trained model download file path
--bert_config_file=./chinese_L-12_H-768_A-12/bert_config.json \
--init_checkpoint=./chinese_L-12_H-768_A-12/bert_model.ckpt \
--max_seq_length=128 \ # Model training parameters
--train_batch_size=32 \
--learning_rate=5e-5 \
--num_train_epochs=1.0 \
--output_dir=./dataset/output # Output file path
Note that only when the relevant functions are defined can
the corresponding options be enabled. For example, if the
validation set function is not defined in the Processor,
then set
--do_eval=false
. After the model training is completed, the
corresponding model and test results will be retained
under the output file path.
Now go and give it a try by yourself. Read the instructions on the official open-source page carefully and refer to the known issues on the Issues page of the repository to solve the problems you encounter.