Word embedding

Method in natural language processing
Part of a series on
Machine learning
and data mining
Scatterplot featuring a linear support vector machine's decision boundary (dashed line)
  • Classification
  • Clustering
  • Regression
  • Anomaly detection
  • Data Cleaning
  • AutoML
  • Association rules
  • Reinforcement learning
  • Structured prediction
  • Feature engineering
  • Feature learning
  • Online learning
  • Semi-supervised learning
  • Unsupervised learning
  • Learning to rank
  • Grammar induction
Learning with humans
Model diagnostics
  • Learning curve
Machine-learning venues
  • v
  • t
  • e

In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.[1] Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

Methods to generate this mapping include neural networks,[2] dimensionality reduction on the word co-occurrence matrix,[3][4][5] probabilistic models,[6] explainable knowledge base method,[7] and explicit representation in terms of the context in which words appear.[8]

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[9] and sentiment analysis.[10]

Development and history of the approach

In Distributional semantics, a quantitative methodological approach to understanding meaning in observed language, word embeddings or semantic vector space models have been used as a knowledge representation for some time.[11] Such models aim to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was proposed in a 1957 article by John Rupert Firth,[12] but also has roots in the contemporaneous work on search systems[13] and in cognitive psychology.[14]

The notion of a semantic space with lexical items (words or multi-word terms) represented as vectors or embeddings is based on the computational challenges of capturing distributional characteristics and using them for practical application to measure similarity between words, phrases, or entire documents. The first generation of semantic space models is the vector space model for information retrieval.[15][16][17] Such vector space models for words and their distributional data implemented in their simplest form results in a very sparse vector space of high dimensionality (cf. Curse of dimensionality). Reducing the number of dimensions using linear algebraic methods such as singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s and the Random indexing approach for collecting word cooccurrence contexts.[18][19][20][21] In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language models" to reduce the high dimensionality of words representations in contexts by "learning a distributed representation for words".[22][23]

A study published in NeurIPS (NIPS) 2002 introduced the use of both word and document embeddings applying the method of kernel CCA to bilingual (and multi-lingual) corpora, also providing an early example of self-supervised learning of word embeddings [24]

Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in (Lavelli et al., 2004).[25] Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures.[26] Most new word embedding techniques after about 2005 rely on a neural network architecture instead of more probabilistic and algebraic models, since some foundational work by Yoshua Bengio and colleagues.[27][28]

The approach has been adopted by many research groups after advances around year 2010 had been made on theoretical work on the quality of vectors and the training speed of the model and hardware advances allowed for a broader parameter space to be explored profitably. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit that can train vector space models faster than the previous approaches. The word2vec approach has been widely used in experimentation and was instrumental in raising interest for word embeddings as a technology, moving the research strand out of specialised research into broader experimentation and eventually paving the way for practical application.[29]


Traditionally, one of the main limitations of word embeddings (word vector space models in general) is that words with multiple meanings are conflated into a single representation (a single vector in the semantic space). In other words, polysemy and homonymy are not handled properly. For example, in the sentence "The club I tried yesterday was great!", it is not clear if the term club is related to the word sense of a club sandwich, baseball club, clubhouse, golf club, or any other sense that club might have. The necessity to accommodate multiple meanings per word in different vectors (multi-sense embeddings) is the motivation for several contributions in NLP to split single-sense embeddings into multi-sense ones.[30][31]

Most approaches that produce multi-sense embeddings can be divided into two main categories for their word sense representation, i.e., unsupervised and knowledge-based.[32] Based on word2vec skip-gram, Multi-Sense Skip-Gram (MSSG)[33] performs word-sense discrimination and embedding simultaneously, improving its training time, while assuming a specific number of senses for each word. In the Non-Parametric Multi-Sense Skip-Gram (NP-MSSG) this number can vary depending on each word. Combining the prior knowledge of lexical databases (e.g., WordNet, ConceptNet, BabelNet), word embeddings and word sense disambiguation, Most Suitable Sense Annotation (MSSA)[34] labels word-senses through an unsupervised and knowledge-based approach considering a word's context in a pre-defined sliding window. Once the words are disambiguated, they can be used in a standard word embeddings technique, so multi-sense embeddings are produced. MSSA architecture allows the disambiguation and annotation process to be performed recurrently in a self-improving manner.

The use of multi-sense embeddings is known to improve performance in several NLP tasks, such as part-of-speech tagging, semantic relation identification, semantic relatedness, named entity recognition and sentiment analysis.[35][36]

As of the late 2010s, contextually-meaningful embeddings such as ELMo and BERT have been developed. These embeddings use a word's context to disambiguate polysemes. They do so using LSTM and Transformer neural network architectures.

For biological sequences: BioVectors

Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[37] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad[37] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Sentence embeddings

The idea has been extended to embeddings of entire sentences or even documents, e.g. in form of the thought vectors concept. In 2015, some researchers suggested "skip-thought vectors" as a means to improve the quality of machine translation.[38]


Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe,[39] GN-GloVe,[40] Flair embeddings,[35] AllenNLP's ELMo,[41] BERT,[42] fastText, Gensim,[43] Indra[44] and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.[45]

Examples of application

For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.[46]

See also


  1. ^ Jurafsky, Daniel; H. James, Martin (2000). Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0-13-095069-7.
  2. ^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Distributed Representations of Words and Phrases and their Compositionality". arXiv:1310.4546 [cs.CL].
  3. ^ Lebret, Rémi; Collobert, Ronan (2013). "Word Emdeddings through Hellinger PCA". Conference of the European Chapter of the Association for Computational Linguistics (EACL). Vol. 2014. arXiv:1312.5542.
  4. ^ Levy, Omer; Goldberg, Yoav (2014). Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS.
  5. ^ Li, Yitan; Xu, Linli (2015). Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI).
  6. ^ Globerson, Amir (2007). "Euclidean Embedding of Co-occurrence Data" (PDF). Journal of Machine Learning Research.
  7. ^ Qureshi, M. Atif; Greene, Derek (2018-06-04). "EVE: explainable vector based embedding technique using Wikipedia". Journal of Intelligent Information Systems. 53: 137–165. arXiv:1702.06891. doi:10.1007/s10844-018-0511-x. ISSN 0925-9902. S2CID 10656055.
  8. ^ Levy, Omer; Goldberg, Yoav (2014). Linguistic Regularities in Sparse and Explicit Word Representations (PDF). CoNLL. pp. 171–180.
  9. ^ Socher, Richard; Bauer, John; Manning, Christopher; Ng, Andrew (2013). Parsing with compositional vector grammars (PDF). Proc. ACL Conf.
  10. ^ Socher, Richard; Perelygin, Alex; Wu, Jean; Chuang, Jason; Manning, Chris; Ng, Andrew; Potts, Chris (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank (PDF). EMNLP.
  11. ^ Sahlgren, Magnus. "A brief history of word embeddings".
  12. ^ Firth, J.R. (1957). "A synopsis of linguistic theory 1930–1955". Studies in Linguistic Analysis: 1–32. Reprinted in F.R. Palmer, ed. (1968). Selected Papers of J.R. Firth 1952–1959. London: Longman.
  13. ^ }{cite journal|last=Luhn|first=H.P.|year=1953|title=A New Method of Recording and Searching Information|journal=American Documentation|pages=14–16|doi=doi:10.1002/asi.5090040104}}
  14. ^ Osgood, C.E.; Suci, G.J.; Tannenbaum, P.H. (1957). The Measurement of Meaning. University of Illinois Press.
  15. ^ Salton, Gerard (1962). "Some experiments in the generation of word and document associations". Proceeding AFIPS '62 (Fall) Proceedings of the December 4–6, 1962, Fall Joint Computer Conference. AFIPS '62 (Fall): 234–250. doi:10.1145/1461518.1461544. ISBN 9781450378796. S2CID 9937095. Retrieved 18 October 2020.
  16. ^ Salton, Gerard; Wong, A; Yang, C S (1975). "A Vector Space Model for Automatic Indexing". Communications of the Association for Computing Machinery (CACM). 18 (11): 613–620. doi:10.1145/361219.361220. hdl:1813/6057. S2CID 6473756.
  17. ^ Dubin, David (2004). "The most influential paper Gerard Salton never wrote". Retrieved 18 October 2020.
  18. ^ Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000): Random Indexing of Text Samples for Latent Semantic Analysis, Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036. Mahwah, New Jersey: Erlbaum, 2000.
  19. ^ Karlgren, Jussi; Sahlgren, Magnus (2001). Uesaka, Yoshinori; Kanerva, Pentti; Asoh, Hideki (eds.). "From words to understanding". Foundations of Real-World Intelligence. CSLI Publications: 294–308.
  20. ^ Sahlgren, Magnus (2005) An Introduction to Random Indexing, Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, August 16, Copenhagen, Denmark
  21. ^ Sahlgren, Magnus, Holst, Anders and Pentti Kanerva (2008) Permutations as a Means to Encode Order in Word Space, In Proceedings of the 30th Annual Conference of the Cognitive Science Society: 1300–1305.
  22. ^ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian (2003). "A Neural Probabilistic Language Model" (PDF). Journal of Machine Learning Research. 3: 1137–1155.
  23. ^ Bengio, Yoshua; Schwenk, Holger; Senécal, Jean-Sébastien; Morin, Fréderic; Gauvain, Jean-Luc (2006). A Neural Probabilistic Language Model. Studies in Fuzziness and Soft Computing. Vol. 194. pp. 137–186. doi:10.1007/3-540-33486-6_6. ISBN 978-3-540-30609-2.
  24. ^ Vinkourov, Alexei; Cristianini, Nello; Shawe-Taylor, John (2002). Inferring a semantic representation of text via cross-language correlation analysis (PDF). Advances in neural information processing systems 15.
  25. ^ Lavelli, Alberto; Sebastiani, Fabrizio; Zanoli, Roberto (2004). Distributional term representations: an experimental comparison. 13th ACM International Conference on Information and Knowledge Management. pp. 615–624. doi:10.1145/1031171.1031284.
  26. ^ Roweis, Sam T.; Saul, Lawrence K. (2000). "Nonlinear Dimensionality Reduction by Locally Linear Embedding". Science. 290 (5500): 2323–6. Bibcode:2000Sci...290.2323R. CiteSeerX doi:10.1126/science.290.5500.2323. PMID 11125150.
  27. ^ Morin, Fredric; Bengio, Yoshua (2005). "Hierarchical probabilistic neural network language model" (PDF). In Cowell, Robert G.; Ghahramani, Zoubin (eds.). Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research. Vol. R5. pp. 246–252.
  28. ^ Mnih, Andriy; Hinton, Geoffrey (2009). "A Scalable Hierarchical Distributed Language Model". Advances in Neural Information Processing Systems 21 (NIPS 2008). Curran Associates, Inc. 21: 1081–1088.
  29. ^ "word2vec". Google Code Archive. Retrieved 23 July 2021.
  30. ^ Reisinger, Joseph; Mooney, Raymond J. (2010). Multi-Prototype Vector-Space Models of Word Meaning. Vol. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, California: Association for Computational Linguistics. pp. 109–117. ISBN 978-1-932432-65-7. Retrieved October 25, 2019.
  31. ^ Huang, Eric. (2012). Improving word representations via global context and multiple word prototypes. OCLC 857900050.
  32. ^ Camacho-Collados, Jose; Pilehvar, Mohammad Taher (2018). "From Word to Sense Embeddings: A Survey on Vector Representations of Meaning". arXiv:1805.04032 [cs.CL].
  33. ^ Neelakantan, Arvind; Shankar, Jeevan; Passos, Alexandre; McCallum, Andrew (2014). "Efficient Non-parametric Estimation of Multiple Embeddings per Word in Vector Space". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics: 1059–1069. arXiv:1504.06654. doi:10.3115/v1/d14-1113. S2CID 15251438.
  34. ^ Ruas, Terry; Grosky, William; Aizawa, Akiko (2019-12-01). "Multi-sense embeddings through a word sense disambiguation process". Expert Systems with Applications. 136: 288–303. arXiv:2101.08700. doi:10.1016/j.eswa.2019.06.026. hdl:2027.42/145475. ISSN 0957-4174. S2CID 52225306.
  35. ^ a b Akbik, Alan; Blythe, Duncan; Vollgraf, Roland (2018). "Contextual String Embeddings for Sequence Labeling". Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics: 1638–1649.
  36. ^ Li, Jiwei; Jurafsky, Dan (2015). "Do Multi-Sense Embeddings Improve Natural Language Understanding?". Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics: 1722–1732. arXiv:1506.01070. doi:10.18653/v1/d15-1200. S2CID 6222768.
  37. ^ a b Asgari, Ehsaneddin; Mofrad, Mohammad R.K. (2015). "Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics". PLOS ONE. 10 (11): e0141287. arXiv:1503.05140. Bibcode:2015PLoSO..1041287A. doi:10.1371/journal.pone.0141287. PMC 4640716. PMID 26555596.
  38. ^ Kiros, Ryan; Zhu, Yukun; Salakhutdinov, Ruslan; Zemel, Richard S.; Torralba, Antonio; Urtasun, Raquel; Fidler, Sanja (2015). "skip-thought vectors". arXiv:1506.06726 [cs.CL].
  39. ^ "GloVe".
  40. ^ Zhao, Jieyu; et al. (2018) (2018). "Learning Gender-Neutral Word Embeddings". arXiv:1809.01496 [cs.CL].
  41. ^ "Elmo".
  42. ^ Pires, Telmo; Schlinger, Eva; Garrette, Dan (2019-06-04). "How multilingual is Multilingual BERT?". arXiv:1906.01502 [cs.CL].
  43. ^ "Gensim".
  44. ^ "Indra". GitHub. 2018-10-25.
  45. ^ Ghassemi, Mohammad; Mark, Roger; Nemati, Shamim (2015). "A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes" (PDF). Computing in Cardiology.
  46. ^ "Embedding Viewer". Embedding Viewer. Lexical Computing. Retrieved 7 Feb 2018.
  • v
  • t
  • e
General terms
Text analysis
Text segmentation
Automatic summarization
Machine translation
Distributional semantics models
Language resources,
datasets and corpora
Types and
Automatic identification
and data capture
Topic model
Natural language
user interface
Other software