cover

76. Google BERT Pre-training Technology#

76.1. Introduction#

In 2018, one of the biggest news in the field of natural language processing was the emergence of Google BERT. Google BERT claims to be the most advanced NLP pre-training technology, supporting Chinese and more languages. In the related paper, BERT demonstrated state-of-the-art results on 11 NLP tasks, including the Stanford Question Answering Dataset (SQUAD v1.1), achieving the best performance.

76.2. Key Points#

  • Google BERT

  • NLP Pre-training Technology

BERT is an NLP pre-training technology open-sourced by Google. Its full name is Bidirectional Encoder Representations from Transformers.

BERT builds on recent work on pre-training context-dependent language representations, including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike previous models, BERT is the first deep, bidirectional, unsupervised language representation model that is pre-trained using only an unlabeled text corpus (Wikipedia in this case).

Due to the complex model structure and large corpus, BERT takes up to several months to train on the current most powerful commercial GPUs. However, the good news is that Google has released the pre-trained base model BERT-Base that supports Chinese and multiple languages. With the help of the pre-trained model, we can easily complete a variety of NLP tasks.

Exercise 76.1

Open Challenge

Challenge: Use the Chinese pre-trained language model provided by Google BERT to complete the fake news data text classification task. We recommend that you split the provided data into an 8:2 ratio and finally obtain the accuracy on the test set.

Hint: Carefully read the instructions on the official open-source page and the known issues on the repository’s Issues page to solve the problems you encounter.

The challenge requires using the download link for the fake news data provided in the experiment.

https://cdn.aibydoing.com/aibydoing/files/wsdm_mini.csv  # Fake news data
https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip # Download link provided by Google

76.2.1. Instructions#

This challenge is not suitable to be completed through an online Notebook. It is recommended that you complete the challenge locally. Due to the high difficulty of the challenge, some instructions are provided here. Please note that the example code cannot be executed properly. Please carefully understand and make flexible adaptations.

Here, using BERT to complete text classification will utilize the fine-tuning method. First, it is necessary to download the pre-trained language model provided by Google. Then, clone the official BERT repository to directly utilize the model training source code provided by Google:

# Clone the official BERT repository
git clone https://github.com/google-research/bert.git

In the model training source code provided by Google, the code for completing text classification is placed in run_classifier.py. Among them, Google provides test codes on 4 benchmark datasets, which correspond to those in XnliProcessor, MnliProcessor, MrpcProcessor, and ColaProcessor. Then, to complete our own text classification task, we only need to imitate and rewrite the class.

The process of rewriting the Processor class is very simple. You only need to pass in the data according to your own needs. It is recommended to refer to the implementation process of the XnliProcessor class in the source code, which is quite similar to this classification task. For example, here we define our own text classification class as WSDMProcessor, which contains the following four functions:

class WSDMProcessor(DataProcessor):
    # Pass in the training data
    def get_train_examples(self, data_dir):
        # Read the training data path
        file_path = os.path.join(data_dir, 'train.csv')
        # Use Pandas to read the data
        df = pd.read_csv(file_path)
        # Split the training data into 80% training set and 20% validation set
        df_train, self.df_dev = train_test_split(df, test_size=0.2)
        examples = []
        # Process the data in the format recommended by BERT
        for index, row in df_train.iterrows():
            guid = 'train-%d' % index  # Index
            text_a = tokenization.convert_to_unicode(str(row[0]))  # Text 1
            text_b = tokenization.convert_to_unicode(str(row[1]))  # Text 2
            label = row[2]  # Text label
            examples.append(InputExample(guid=guid, text_a=text_a,
                                         text_b=text_b, label=label))
        return examples

    # Pass in the validation data
    def get_dev_examples(self, data_dir):
        examples = []
        for index, row in self.df_dev.iterrows():
            guid = 'dev-%d' % index
            text_a = tokenization.convert_to_unicode(str(row[0]))
            text_b = tokenization.convert_to_unicode(str(row[1]))
            label = row[2]
            examples.append(InputExample(guid=guid, text_a=text_a,
                                         text_b=text_b, label=label))
        return examples

    # Pass in the test data (for prediction)
    def get_test_examples(self, data_dir):
        file_path = os.path.join(data_dir, 'test.csv')
        df_test = pd.read_csv(file_path)
        examples = []
        for index, row in df_test.iterrows():
            guid = 'test-%d' % index
            text_a = tokenization.convert_to_unicode(str(row[0]))
            text_b = tokenization.convert_to_unicode(str(row[1]))
            label = '0'  # Arbitrarily specify the test data label
            examples.append(InputExample(guid=guid, text_a=text_a,
                                         text_b=text_b, label=label))
        return examples

    def get_labels(self):
        return ['A', 'B', 'C'] # Data labels corresponding to the example three-classification task

You will find that the functions for passing in training, validation, and test data are basically the same. The key lies in processing the data into the required format according to BERT’s requirements.

Next, we also need to modify the processors in the main() function and add the custom WSDMProcessor we just created:

def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)

    processors = {
        "cola": ColaProcessor,
        "mnli": MnliProcessor,
        "mrpc": MrpcProcessor,
        "xnli": XnliProcessor,
        "self": WSDMProcessor, # Add the custom Processor
    }

After completing the above work, you can start performing the text classification task. Just run it according to the execution code in the BERT open-source repository example, for example:

python run_classifier.py \
  --task_name=self \  # Execute the processor
  --do_train=true \  # Enable model training
  --do_eval=true \  # Enable model validation
  --do_predict=true \  # Enable model testing
  --data_dir=./dataset \  # Data path
  --vocab_file=./chinese_L-12_H-768_A-12/vocab.txt \  # Pre-trained model download file path
  --bert_config_file=./chinese_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=./chinese_L-12_H-768_A-12/bert_model.ckpt \
  --max_seq_length=128 \  # Model training parameters
  --train_batch_size=32 \
  --learning_rate=5e-5 \
  --num_train_epochs=1.0 \
  --output_dir=./dataset/output  # Output file path

Note that only when the relevant functions are defined can the corresponding options be enabled. For example, if the validation set function is not defined in the Processor, then set --do_eval=false. After the model training is completed, the corresponding model and test results will be retained under the output file path.

Now go and give it a try by yourself. Read the instructions on the official open-source page carefully and refer to the known issues on the Issues page of the repository to solve the problems you encounter.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.