Cập nhật ngày 03/09/2022 bởi mychi

Bài viết Stemming thuộc chủ đề về HỎi Đáp thời gian này đang được rất nhiều bạn quan tâm đúng không nào !! Hôm nay, Hãy cùng tìm hiểu Stemming trong bài viết hôm nay nhé ! Các bạn đang xem bài viết : “Stemming”

Đánh giá về Stemming

Xem nhanh

Example of stemming. Source: Fox 2018.

Example of stemming. Source: Fox 2018.

Consider different forms of a word, such as organize, organizes, and organizing. Consider also words that are closely related, such as democracy, democratic, and democratization. In many applications, it may be inefficient to handle all the variations individually. Stemming reduces them to a common form. Algorithms that do this are called stemmers. The output of a stemmer is called the stem, which is the root word.

Stemming may be seen as a crude heuristic process that simply chops off ends of words. Unlike lemmatization, stemming doesn’t involve dictionary lookup or morphological analysis. It’s not even required that the stem be a valid word or identical to its morphological root. The goal is to reduce related words to the same stem.


  • What are some applications of stemming?
    Variations of verb forms to include for better SEO. Source: Ives 2011, table 3.

    Variations of verb forms to include for better SEO. Source: Ives 2011, table 3.

    Stemming is a pre-processing task that’s done before other application-specific tasks are invoked. One application of stemming is to count the use of emotion words and perform basic sentiment analysis. When used with a dictionary or spell checker such as Hunspell, stemmers can be used to suggest corrections when wrong spelling is encountered.

    One of the first applications of stemming was in Information Retrieval (IR). Searching with keyword “explosion” would fail to retrieve documents indexed by the word “explosives”. Stemming solves this problem since indexing would be done using stem words.

    Online search engines such as Google Search use stemming. An analysis from 2009 of Google Search showed that some suffixes such as ‘-s’, ‘-ed’, and ‘-ing’ are considered to be strongly correlated with the stems. Suffixes ‘-able’, ‘-tive’, ‘-ly’, and ‘-ness’ are considered less correlated. For better SEO, forms that are poorly correlated could be added to improve search ranking. However, Google may penalize the page if it appears unnatural.

  • What are some essential terms concerning stemming?
    Comparing derivational and inflectional morphemes. Source: Liberman and Prince 1998.

    Comparing derivational and inflectional morphemes. Source: Liberman and Prince 1998.

    Stemming is based on the assumption that words have a structure, based on a root word and modifications of the root. The study of words and their parts is called morphology. In IR systems, given a word, stemming is really about finding morphological variants. The term conflation indicates the combining of variants to a common stem.

    Words may contain prefixes and suffixes, which generally are called affixes. Stemming usually concerns itself with suffixes. Suffixes themselves are of two types:

    • Inflectional: Form is varied to express some grammatical feature such as singular/plural or present/past/future tense. Inflections don’t change part of speech or meaning. For example, ‘boy’ and ‘boys’; ‘big’, ‘bigger’ and ‘biggest’.
    • Derivational: New forms are created from words. New ones have a different part of speech or meaning. For example, ‘creation’ from ‘create’. They can occur between the stem and an inflectional suffix, such as ‘governments’, where ‘-ment’ precedes ‘-s’. Another example is ‘rationalizations’, where ‘-al’, ‘-iz’ and ‘-ation’ are derivational, and ‘-s’ is inflectional.
  • What are the typical errors while stemming?
    Search for 'Withings' wrongly shows items containing 'with'. Source: Larochelle 2014.

    Search for ‘Withings’ wrongly shows items containing ‘with’. Source: Larochelle 2014.

    Errors occur because rules fail for some special cases. The worst error is when two different concepts conflate to the same stem. For example, when ‘Withings’ is stemmed to ‘with’ on a web portal, wrong items are presented to the user. Porter stemmer stems ‘meanness’ and ‘meaning’ to ‘mean’, though they relate to different concepts. On the contrary, ‘goose’ and ‘geese’ are equivalent but they’re stemmed as ‘goos’ and ‘gees’ respectively.

    Typical errors of stemming are the following:

    • Overstemming: Happens when too much is removed. For example, ‘wander’ becomes ‘wand’; ‘news’ becomes ‘new’; or ‘universal’, ‘universe’, ‘universities’, and ‘university’ are all reduced to ‘univers’. A better result would be ‘univers’ for the first two and ‘universi’ for the last two.
    • Understemming: Happens when words are from the same root but are not seen that way. For example, ‘knavish’ remains as ‘knavish’; ‘data’ and ‘datum’ become ‘dat’ and ‘datu’ although they’re from the same root.
    • Misstemming: Usually not a problem unless it leads for false conflations. For example, ‘relativity’ becomes ‘relative’.
  • Could you give an overview of algorithms for stemming?
    Types of stemming algorithms. Source: Singh and Gupta 2017, fig. 1.

    Types of stemming algorithms. Source: Singh and Gupta 2017, fig. 1.

    Stemming algorithms broadly fall into one of two categories:

    • Rule-based: Table lookup is a brute force approach where inflected or derived forms are mapped to stems. For Turkish that has lots of inflected forms, this will lead to large tables. Another approach is to apply a series of rules to strip affixes to get to the stem. Yet another approach is to use word morphology, including part of speech. All these approaches require particular knowledge of the language.
    • Statistical: By training a model, suffix-stripping rules are implicitly derived. This is good for languages with complex morphology. No expert knowledge of the language is needed. Models can be trained from a lexicon, a corpus or character-based n-grams of the language’s words.

    Among the rule-based stemmers are Lovins stemmer, Dawson stemmer, Porter stemmer, Paice/Husk stemmer (aka Lancaster stemmer), Krovetz stemmer and Xerox stemmer. Porter stemmer in its Snowball implementation is commonly used. Lancaster stemmer is more aggressive, leading to overstemming.

  • Could you explain Porter’s algorithm?
    Computing word measure, resulting in m=4. Source: Snowball 2019a.

    Computing word measure, resulting in m=4. Source: Snowball 2019a.

    Porter stemmer applies a set of rules in five steps involving 51 suffixes and 60 rules.

    Each rule is given by the form ((condition) , S1 to S2), where S1 and S2 are suffixes. If the condition is matched or null, S1 is replaced with S2. When multiple rules are applicable in a given sub-step, only the rule with the longest S1 is obeyed. For example, step-1a has four rules with null condition:

    $$SSES to SS\IES to I\SS to SS\S to$$

    In this case, ‘caresses’ becomes ‘caress’ since the first rule gives the longest match for S1; ‘caress’ remains caress (third rule); ‘cares’ becomes ‘care’ (fourth rule).

    Porter also defined the generic structure of a word as ([C] (VC)^m [V]), where [ ] denotes arbitrary presence, C is one or more consonants, V is one or more vowels and m is the word’s measure. Measure is computed on what precedes the rule’s suffix. We could use it to ignore rules when stem is too short. For example, the following rule changes ‘agreed’ to ‘agree’ (m=1) but retains ‘feed’ as ‘feed’ (m=0): ((m>0) , EED to EE)

  • What are the common approaches used by stochastic stemmers?

    One approach is to segment words into roots and suffixes and selecting the best possible segmentation. This is based on counts of letter successor varieties. Minimum Description Length (MDL) stemmer is also about finding the optimal split between root and suffix. MDL was found to be computationally intensive and gave 83% accuracy for both English and French.

    Another approach is to discover suffixes based on their frequencies. Frequencies are adjusted for partial suffixes so that ‘-ng’ doesn’t include ‘-ing’.

    Hidden Markov Model (HMM) has been applied to stemming. Word letters are considered as hidden states and the transition from a root state to a suffix state is taken as the split point.

    Yet Another Suffix Stripper (YASS) treats stemming as a word clustering problem. Two words having long common prefix are considered similar. Threshold value must be selected carefully to produce meaningful clusters. YASS was shown to perform similar to rule-based stemmers.

    Other approaches make use of graphs with weighted edges, word co-occurrences, distribution similarity, and queries along with context.

  • How is the performance of stemming?

    Algorithmic stemmers are typically fast. For example, a million words can be stemmed in 6 seconds on a 500 MHz personal computer. It’s more efficient not to use a dictionary. Rather than complicate the algorithm, it’s simpler to ignore irregular forms.

    Lemmatization uses a dictionary such as WordNet to get to the correct base forms. However, lemmatization has its limitations too. For example, ‘happiness’ and ‘happy’ are left unchanged while the Porter stemmer stems them correctly to ‘happi’.

    In IR systems, stemming reduces storage requirements by over 50% since we store stems rather than full terms.

  • What are some tools to do stemming?

    For a quick demo of stemming, online tools are available:

    For developers, R has the corpus package for stemming. This is based on Snowball. This has support for multiple natural languages. In Python, NLTK and TextBlob are two packages that support stemming.

    Martin Porter has shared a list of many language implementations of the Porter stemmer. For test purpose, he also shares a sample vocabulary and the expected output.


Sample Code

  • R
  • # Source: # Accessed: 2019-09-24 library(corpus) # A simple example text <- "love loving lovingly loved lover lovely love" text_tokens(text, stemmer = "en") # english stemmer # Emotion word usage in the text of Wizard of Oz data <- gutenberg_corpus(55, verbose = FALSE) text_filter(data)$stemmer <- with(affect_wordnet, new_stemmer(term, interaction(category, emotion), default = NA, duplicates = "omit")) print(term_stats(data), -1) text_sample(data, "Joy.Positive")


  1. Butt, Miriam. 2003. “Porter Stemmer.” SlidePlayer. Accessed 2019-09-24.
  2. Fairweather, John. 2007. ” Language independent stemming.” U.S. patent US8015175B2, March 16. Granted 2011-09-06. Accessed 2019-09-24.
  3. Fortney, Kendall. 2017. “Pre-Processing in Natural Language Machine Learning.” Towards Data Science, November 28. Accessed 2019-09-24.
  4. Fox, Trevor. 2018. “What is keyword stemming? Use it … Wisely.” July 30. Accessed 2019-09-24.
  5. Frakes, W. B. 1992. “Stemming Algorithms.” Chapter 8 in: Information Retrieval: Data Structures and Algorithms, William B. Frakes and Ricardo Baeza-Yates (eds), Prentice Hall. Accessed 2019-09-24.
  6. Heidenreich, Hunter. 2018. “Stemming? Lemmatization? What?” Blog, December 21. Accessed 2019-09-24.
  7. Ives, Ted. 2011. “Stemming for SEO: The Complete Guide.” Coconut Headphones, December 02. Accessed 2019-09-24.
  8. Larochelle, David. 2014. “The Problems With Stemmming: A Practical Example.” Blog, March 30. Accessed 2019-09-24.
  9. Liberman, Mark and Ellen Prince. 1998. “Morphology II.” LING 001: Introduction to Linguistics, University of Pennsylvania, September. Accessed 2019-09-24.
  10. Lovins, Julie Beth. 1968. “Development of a Stemming Algorithm.” Mechanical Translation and Computational Linguistics, vol. 11, no. 1 and 2, pp. 22-31, March and June. Accessed 2019-09-24.
  11. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. “Introduction to Information Retrieval.” Cambridge University Press. Accessed 2019-09-24.
  12. Mitosystems. 2014a. “Accurate cross language search – a solution at last!” Mitosystems, January 23. Updated 2015-03-31. Accessed 2019-09-24.
  13. Mitosystems. 2014b. “A framework for combining stemming algorithms to improve accuracy.” Mitosystems, January 23. Updated 2015-03-31. Accessed 2019-09-24.
  14. Paice, Chris D. 1994. “An Evaluation Method for Stemming Algorithms.” In: Croft B.W., van Rijsbergen C.J. (eds) SIGIR ’94. Springer, London. Accessed 2019-09-24.
  15. Paice, Chris D. 2016. “Stemming.” In: Liu L., Özsu M. (eds) Encyclopedia of Database Systems, Springer, New York, NY. Accessed 2019-09-24.
  16. Perry, Patrick O. 2017. “Stemming Words.” In: corpus: Text Corpus Analysis, v0.10.0, CRAN, December 12. Accessed 2019-09-24.
  17. Porter, M. F. 1980. “An algorithm for suffix stripping.” Program, vol. 14, no. 3, pp. 130-137, July. Accessed 2019-09-24.
  18. Porter, M. F. 2001. “Snowball: A language for stemming algorithms.” October. Accessed 2019-09-24.
  19. Porter, M. F. 2006. “The Porter Stemming Algorithm.” Accessed 2019-09-24.
  20. Singh, Jasmeet, and Vishal Gupta. 2017. “A systematic review of text stemming techniques.” Artificial Intelligence Review, vol. no.2, pp. 157-217, August. Accessed 2019-09-24.
  21. Slegg, Jennifer. 2018. “Google: No Need to Include All Variations of Keywords.” TheSEMPost, April 04. Accessed 2019-09-24.
  22. Snowball. 2019a. “The Porter stemming algorithm.” Snowball. Accessed 2019-09-24.
  23. Snowball. 2019b. “Homepage.” Accessed 2019-09-25.
  24. Soma, Jonathan. 2017. “Counting and stemming.” Accessed 2019-09-24.
  25. Tunkelang, Daniel. 2017. “Stemming and Lemmatization.” Medium, February 06. Accessed 2019-09-24.
  26. Uyar, Ahmet. 2009. “Google stemming mechanisms.” Journal of Information Science, vol. 35, no. 5, pp. 499–514, SAGE Journals, October 01. Accessed 2019-09-24.
  27. Wikipedia. 2019. “Stemming.” Wikipedia, August 18. Accessed 2019-09-24.

Further Reading

  1. Jabeen, Hafsa. 2018. “Stemming and Lemmatization in Python.” DataCamp Community, October 23. Accessed 2019-09-24.
  2. Porter, M.F. 1980. “An algorithm for suffix stripping.” Program, vol. 14, no. 3, pp. 130-137, July. Accessed 2019-09-24.
  3. Porter, M. F. 2001. “Snowball: A language for stemming algorithms.” October. Accessed 2019-09-24.
  4. Singh, Jasmeet, and Vishal Gupta. 2017. “A systematic review of text stemming techniques.” Artificial Intelligence Review, vol. no.2, pp. 157-217, August. Accessed 2019-09-24.
  5. Bitext. 2018. “What is the difference between stemming and lemmatization?” Blog, Bitext, February 28. Accessed 2019-09-24.

Article Stats











Cite As

Devopedia. 2019. “Stemming.” Version 3, September 28. Accessed 2022-06-15.

Các câu hỏi về stemmed là gì

Nếu có bắt kỳ câu hỏi thắc mắt nào vê stemmed là gì hãy cho chúng mình biết nhé, mõi thắt mắt hay góp ý của các bạn sẽ giúp mình cải thiện hơn trong các bài sau nhé <3 Bài viết stemmed là gì ! được mình và team xem xét cũng như tổng hợp từ nhiều nguồn. Nếu thấy bài viết stemmed là gì Cực hay ! Hay thì hãy ủng hộ team Like hoặc share. Nếu thấy bài viết stemmed là gì rât hay ! chưa hay, hoặc cần bổ sung. Bạn góp ý giúp mình nhé!!

Các Hình Ảnh Về stemmed là gì

Các hình ảnh về stemmed là gì đang được chúng mình Cập nhập. Nếu các bạn mong muốn đóng góp, Hãy gửi mail về hộp thư [email protected] Nếu có bất kỳ đóng góp hay liên hệ. Hãy Mail ngay cho tụi mình nhé

Tham khảo báo cáo về stemmed là gì tại WikiPedia

Bạn hãy tham khảo thông tin về stemmed là gì từ web Wikipedia.◄ Tham Gia Cộng Đồng Tại

???? Nguồn Tin tại:

???? Xem Thêm Chủ Đề Liên Quan tại :

Related Posts

About The Author

Add Comment