TITLE:
Word Embeddings and Semantic Spaces in Natural Language Processing
AUTHORS:
Peter J. Worth
KEYWORDS:
Natural Language Processing, Vector Space Models, Semantic Spaces, Word Embeddings, Representation Learning, Text Vectorization, Machine Learning, Deep Learning
JOURNAL NAME:
International Journal of Intelligence Science,
Vol.13 No.1,
January
17,
2023
ABSTRACT: One of
the critical hurdles, and breakthroughs, in the field of Natural Language Processing
(NLP) in the last two decades has been the development of techniques for text
representation that solves the so-called curse of dimensionality, a problem which plagues NLP in
general given that the feature set for learning starts as a function of the
size of the language in question, upwards of hundreds of thousands of terms
typically. As such, much of the research and development in NLP in the last two
decades has been in finding and optimizing solutions to this problem, to
feature selection in NLP effectively. This paper looks at the development of
these various techniques, leveraging a variety of statistical methods which
rest on linguistic theories that were advanced in the middle of the last
century, namely the distributional hypothesis which suggests that
words that are found in similar contexts generally have similar meanings. In
this survey paper we look at the development of some of the most popular of
these techniques from a mathematical as well as data structure perspective,
from Latent Semantic Analysis to Vector Space Models to their more modern
variants which are typically referred to as word embeddings. In this
review of algoriths such as Word2Vec, GloVe, ELMo and BERT, we explore the idea
of semantic spaces more generally beyond applicability to NLP.