Computer Science and Engineering, Department of

 

First Advisor

Stephen Scott

Second Advisor

Mohammad Rashedul Hasan

Date of this Version

Fall 12-2-2021

Citation

Deuja, Rojina. (2021). Semantically Meaningful Sentence Embeddings (Master's thesis, University of Nebraska-Lincoln)

Comments

A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professors Stephen Scott and Mohammad Rashedul Hasan. Lincoln, Nebraska: December, 2021

Copyright 2021 Rojina Deuja

Abstract

Text embedding is an approach used in Natural Language Processing (NLP) to represent words, phrases, sentences, and documents. It is the process of obtaining numeric representations of text to feed into machine learning models as vectors (arrays of numbers). One of the biggest challenges in text embedding is representing longer text segments like sentences. These representations should capture the meaning of the segment and the semantic relationship between its constituents. Such representations are known as semantically meaningful embeddings. In this thesis, we seek to improve upon the quality of sentence embeddings that capture semantic information.

The current state-of-the-art models are based on transformer networks that utilize attention mechanisms. Such networks use encoders that generate dense vectors to represent input sentences. While most of these models combine the dense vectors into fixed-sized embeddings, there is no evidence that such heuristic pooling techniques work best for capturing semantic relationships. We argue that processing the encoder output in such a way incorporates unwanted information into the embeddings. To capture the semantic relationship between words in a sentence and remove linguistic noise, we propose a modified version of the DeBERTa model with a novel pooling technique. Our encoder model uses FCNN-based pooling to reduce the size of the encoder output while enriching the expressiveness of semantic information in the embeddings. Our experiments show that the proposed model achieves significant improvement over existing sentence embedding methods on two different datasets - STS Benchmark (STS-B) and SICK-Relatedness (SICK-R). We also create a semantic search engine that encodes an input sentence and returns N most similar sentences.

Adviser: Stephen Scott and Mohammad Rashedul Hasan

Share

COinS