Improving the Performance of Semantic Text Similarity Tasks on Short Text Pairs
Abstract
Training semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.Department
Computer SciencePublisher
IEEEae974a485f413a2113503eed53cd6c53
10.1109/ESOLEC54569.2022