Show simple item record

dc.contributor.authorElKafrawy, Passent
dc.contributor.authorGamal, Mohamed Taher
dc.date.accessioned2023-03-16T05:04:35Z
dc.date.available2023-03-16T05:04:35Z
dc.date.issued2023-01
dc.identifier.citationM. T. Gamal and P. M. El-Kafrawy, "Improving The Performance of Semantic Text Similarity Tasks on Short Text Pairs," 2022 20th International Conference on Language Engineering (ESOLEC), Cairo, Egypt, 2022, pp. 50-52,en_US
dc.identifier.doi10.1109/ESOLEC54569.2022en_US
dc.identifier.urihttp://hdl.handle.net/20.500.14131/688
dc.description.abstractTraining semantic similarity model to detect duplicate text pairs is a challenging task as almost all of datasets are imbalanced, by data nature positive samples are fewer than negative samples, this issue can easily lead to model bias. Using traditional pairwise loss functions like pairwise binary cross entropy or Contrastive loss on imbalanced data may lead to model bias, however triplet loss showed improved performance compared to other loss functions. In triplet loss-based models data is fed to the model as follow: anchor sentence, positive sentence and negative sentence. The original data is permutated to follow the input structure. The default structure of training samples data is 363,861 training samples (90% of the data) distributed as 134,336 positive samples and 229,524 negative samples. The triplet structured data helped to generate much larger amount of balanced training samples 456,219. The test results showed higher accuracy and f1 scores in testing. We fine-tunned RoBERTa pre trained model using Triplet loss approach, testing showed better results. The best model scored 89.51 F1 score, and 91.45 Accuracy compared to 86.74 F1 score and 87.45 Accuracy in the second-best Contrastive loss-based BERT model.en_US
dc.publisherIEEEen_US
dc.subjectTraining , Semantics , Bit error rate , Distributed databases , Data models , Entropy , Task analysisen_US
dc.titleImproving the Performance of Semantic Text Similarity Tasks on Short Text Pairsen_US
dc.contributor.researcherExternal Collaborationen_US
dc.contributor.labArtificial Intelligence & Cyber Security Laben_US
dc.subject.KSAICTen_US
dc.source.indexScopusen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.firstauthorGamal, Mohamed Taher
dc.conference.locationEgypten_US
dc.conference.name2022 20th International Conference on Language Engineering (ESOLEC)en_US
dc.conference.date2022-10-12


This item appears in the following Collection(s)

Show simple item record