Computer Science and Engineering, Department of

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

First Advisor

Stephen Scott

Second Advisor

Mohammad Rashedul Hasan

Date of this Version

Fall 12-2-2021

Document Type

Article

Citation

Deuja, Rojina. (2021). Semantically Meaningful Sentence Embeddings (Master's thesis, University of Nebraska-Lincoln)

Comments

A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professors Stephen Scott and Mohammad Rashedul Hasan. Lincoln, Nebraska: December, 2021

Abstract

Text embedding is an approach used in Natural Language Processing (NLP) to represent words, phrases, sentences, and documents. It is the process of obtaining numeric representations of text to feed into machine learning models as vectors (arrays of numbers). One of the biggest challenges in text embedding is representing longer text segments like sentences. These representations should capture the meaning of the segment and the semantic relationship between its constituents. Such representations are known as semantically meaningful embeddings. In this thesis, we seek to improve upon the quality of sentence embeddings that capture semantic information.

The current state-of-the-art models are based on transformer networks that utilize attention mechanisms. Such networks use encoders that generate dense vectors to represent input sentences. While most of these models combine the dense vectors into fixed-sized embeddings, there is no evidence that such heuristic pooling techniques work best for capturing semantic relationships. We argue that processing the encoder output in such a way incorporates unwanted information into the embeddings. To capture the semantic relationship between words in a sentence and remove linguistic noise, we propose a modified version of the DeBERTa model with a novel pooling technique. Our encoder model uses FCNN-based pooling to reduce the size of the encoder output while enriching the expressiveness of semantic information in the embeddings. Our experiments show that the proposed model achieves significant improvement over existing sentence embedding methods on two different datasets - STS Benchmark (STS-B) and SICK-Relatedness (SICK-R). We also create a semantic search engine that encodes an input sentence and returns N most similar sentences.

Adviser: Stephen Scott and Mohammad Rashedul Hasan

Download

Included in

Computer Engineering Commons, Computer Sciences Commons

COinS

DigitalCommons@University of Nebraska - Lincoln

Computer Science and Engineering, Department of

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Semantically Meaningful Sentence Embeddings

First Advisor

Second Advisor

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Search

Browse

Author Corner

Links

DigitalCommons@University of Nebraska - Lincoln

Computer Science and Engineering, Department of

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Semantically Meaningful Sentence Embeddings

Authors

First Advisor

Second Advisor

Date of this Version

Document Type

Citation

Comments

Abstract

Included in

Share

Search

Browse

Author Corner

Links