Mohammad Rashedul Hasan
Date of this Version
Deuja, Rojina. (2021). Semantically Meaningful Sentence Embeddings (Master's thesis, University of Nebraska-Lincoln)
Text embedding is an approach used in Natural Language Processing (NLP) to represent words, phrases, sentences, and documents. It is the process of obtaining numeric representations of text to feed into machine learning models as vectors (arrays of numbers). One of the biggest challenges in text embedding is representing longer text segments like sentences. These representations should capture the meaning of the segment and the semantic relationship between its constituents. Such representations are known as semantically meaningful embeddings. In this thesis, we seek to improve upon the quality of sentence embeddings that capture semantic information.
The current state-of-the-art models are based on transformer networks that utilize attention mechanisms. Such networks use encoders that generate dense vectors to represent input sentences. While most of these models combine the dense vectors into fixed-sized embeddings, there is no evidence that such heuristic pooling techniques work best for capturing semantic relationships. We argue that processing the encoder output in such a way incorporates unwanted information into the embeddings. To capture the semantic relationship between words in a sentence and remove linguistic noise, we propose a modified version of the DeBERTa model with a novel pooling technique. Our encoder model uses FCNN-based pooling to reduce the size of the encoder output while enriching the expressiveness of semantic information in the embeddings. Our experiments show that the proposed model achieves significant improvement over existing sentence embedding methods on two different datasets - STS Benchmark (STS-B) and SICK-Relatedness (SICK-R). We also create a semantic search engine that encodes an input sentence and returns N most similar sentences.
Adviser: Stephen Scott and Mohammad Rashedul Hasan