Improving Mini BERT’s predictions using Multilingual Knowledge Distillation

Objective of this work:

Created a model which is able to predict COVID-19 related fake news in multiple languages without any need for COVID specific parallel data.
Have a model which is much smaller in size, therefore has the ability to do fast inference (even on a smartphone).

Approach Finetuned a smaller architecture (1) BERT model (L=4, H=512) using labels from a larger model (Sentence BERT) using Knowledge Distillation Ensured (2)shared embedding space between languages and (3)also learn the rich semantics of a larger SBERT model, which are very effective in sentence level classification tasks.</p>

Results

Because of (1) that we had a much faster inference time: 93.7% reduction (0.495 sec to 0.031 sec)
Because of (2) the accuracy in hindi was increased by ~35% (0.63 -> 0.845)
Because of (3) we got a side benefit - accuracy in english was increased by ~6%

This work got accepted in the AI4SG Workshop in IJCAI 2021

Slides - https://docs.google.com/presentation/d/1GNdTsntJjyAI1lX052_lIkPFLJ2UemTUjUgVrq-eGws/edit#slide=id.p1