What Is Lemmatization?

Written by Coursera Staff • Updated on

Lemmatization is a common technique in natural language processing (NLP) that allows programmers to train algorithms efficiently and in a way that mirrors human language acquisition. Explore it in more detail to better understand its uses.

[Featured Image] Software developers answer the question, “What is lemmatization?” and discuss how it fits into natural language processing.

Lemmatization is the reduction of inflected words to their common root, or lemma (plural lemmas). Programmers often use lemmatization to train certain types of complex artificial intelligence (AI) interfaces, notably those utilizing natural language processing (NLP). 

Words change form in a process linguists call inflection. Inflection allows words in a sentence to convey a variety of meanings, such as: 

  • Tense

  • Case

  • Person

  • Number

  • Gender

  • Mood

  • Voice

A lemma is the non-inflected or “root” form of a word. It can be the singular form of a plural noun (dog versus dogs) or verbs such as swims and swimming that you group with the lemma swim, as well as the past participle “swum.” This illustrates the degree of sophistication an advanced NLP algorithm, capable of understanding such irregularities, possesses. 

What is lemmatization? Continue reading to explore the technique, including its benefits and limitations, in more detail.

What is lemmatization used for?

The discipline of NLP shares certain terms in common with the field of computational linguistics. This is no coincidence: NLP is a multimodal domain of knowledge based on computational linguistics, statistical modeling, deep learning, and machine learning (ML). The resulting modality allows programmers to develop AI that “understands” human language and can respond to it in a human-like way. 

Computer programmers use lemmatization to improve how ML and NLP programs comprehend text. By reducing words to their respective lemmas, NLP programs have less raw data to learn. Instead, they learn in a pattern-based way, a method similar to that involved in human language learning. 

Understanding lemmatization in natural language processing (NLP)

Lemmatization offers you a way to program an NLP algorithm using words morphologically rather than syntactically. In other words, via lemmatization, an NLP algorithm understands words in and of themselves as discrete units (including suffixes, prefixes, and inflections) rather than as parts of a sentence that accrue meaning with the addition of further words.

This modality is more straightforward for an NLP model to learn and build on. An NLP is a highly complex auto-correct feature, and it works by determining the statistical probability of a particular word following another. By breaking words down to lemmas—by discovering the root pattern of morphological inflection—an NLP has fewer distinct words to learn. 

You can program an NLP more efficiently and accurately by working with simpler, more specific data—i.e., lemmas rather than chains of similar words differentiated by inflection. Such reduced training dimensionality improves tasks such as: 

  • Text mining

  • Text classification

  • Clustering

  • Indexing

While lemmatization initially reduces words to their roots to avoid ambiguity in NLP programming, it has a bearing on part-of-speech (POS) tagging. POS assigns appropriate semantic values to words in a sentence. That is, POS tells you what words are doing in context by labeling them as nouns, verbs, adjectives, prepositions, and so on. 

How lemmatization differs from stemming

Both lemmatization and stemming use root morphemes to train an NLP algorithm. However, lemmatization is different from stemming in a few key ways. 

Stemming eliminates suffixes from word tokens, resulting in a pseudo-lemma that isn’t necessarily a word you’d find in a dictionary. Conversely, Lemmatization reduces words to an intelligible common morpheme that is part of a standard lexicon. 

You can think of stemming as a degraded form of lemmatization. Stemming operates via a sort of trial and error method; it represents an attempt to identify and group words by root simply by lopping off the ends of them. This sometimes results in success. Other times, the result is a word token of just a single letter or of an ambiguous root that doesn’t accurately speak to the meanings of the words grouped together under it. For example, stemming would group the words sophisticated and sophistry together based on the root sophis; this does not accurately illustrate any definitional commonality between those words. Lemmatization, on the other hand, would not group these words together; it’s a more accurate, albeit more painstaking, modality than stemming. 

Algorithms used for lemmatization

Programmers use various algorithms for lemmatization, including spaCy and WordNet, two of the most popular. Each has its strengths. 

WordNet 

The WordNet lemmatization tool is based on the free, open-to-the-public WordNet lexical database. WordNet groups English words in a couple of ways: First by synonymy or similarity of meaning (not unlike a thesaurus), and then by semantics, or the senses of words in a larger context. WordNet displays words as interrelated hierarchies: As parts of a whole (chair is a subset of furniture), a whole to its parts (furniture is a superset of chair), or, in terms of verbs, via increasing specificity—from compose to write to print, for instance. 

spaCy 

Written in Python and Cython, spaCy is an open-source NLP library that can help you quickly process enormous data sets reasonably capably. Featuring a variety of plugins and workflows, spaCy is adaptable and may be what you need if you’re developing your own NLP model. Plus, its training system features lemmatization capabilities as standard. 

Benefits of lemmatization

By formally grouping inflected words, lemmatization improves search engine function. You don’t need to search for every form of a word to get a result. Stemming, however, doesn’t allow for this. In fact, stemming might result in irrelevant searches. For example, if you search for business, a stemming algorithm will return results for words like busy—distinct and not necessarily related terms that share a stem (“bu”). 

Limitations of lemmatization

While lemmatization is more time- and resource-intensive than stemming, it is also more computationally sophisticated. Lemmatization recognizes prefixes and can sort irregular verbs by dictionary-intelligible lemmas, whereas stemming only deals with suffixes and crude pseudo-lemmas. 

However, lemmatization limits NLP to languages whose words reliably break down into relevant, related morphemes. Some languages, such as Arabic, don’t have the same root-word system English does; as such, lemmatization programs struggle to teach NLP algorithms in Arabic. 

Applications of lemmatization

From text and sentiment analysis to indexing content, lemmatization can help you perform various functions. A few applications include the following.

Text analysis and information retrieval

Lemmatization is a good way to train chatbots. Chatbots need to understand human language in a fairly sophisticated way to offer customer service to a user. Via lemmatization training, they develop knowledge of word groupings by root, more or less the way people do. This enhances the speed of information retrieval, as a lemmatization-trained NLP system doesn’t have to scan the entirety of each word to understand it. 

Sentiment analysis and opinion mining

Via lemmatization, an NLP algorithm can better complete sentiment analysis (also known as opinion mining)—that is, it can group customer reviews into positive, negative, or neutral by reducing key emotion words (love, hate, etc.) to their respective lemmas rather than reading the entire piece and attempting to analyze it as a whole. For instance, the words exceeded, exceeds, and exceeding all lemmatize back to exceed, which would likely indicate a positive review. 

Search engine optimization (SEO) and content indexing

Lemmatization is also helpful in powering search engines. It allows you to return a search for a specific term without requiring you to search each conjugation or inflection of the term—that is, you can search the word sing and return results for singing, sings, and so on, without having to mount a separate search. This optimizes user experience and makes it easier for a business to have its content indexed—that is, it allows its text to show up in a Google search and rank in search results (and therefore optimized based on a number of factors). 

Continue learning about lemmatization on Coursera.

Lemmatization is an integral part of NLP. If you plan to work in the field of AI, you’ll want to know all you can about how you can use lemmatization to create more sophisticated, human-like NLP programs.

Learn more on Coursera with options like the IBM Machine Learning Professional Certificate, which can help you master artificial neural networks, machine learning, and deep learning. DeepLearning.AI also offers a Deep Learning Specialization that can help you build and train neural networks and use standard techniques for testing training sets to optimize algorithms. 

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.