Implementing the Porter Stemming Algorithm in C#

Elemar Júnior

In a previous post, I started an elementary search library using C#. Now, I will improve it adding a Porter Stemming filter to the indexing and searching processes. The resulting source code is available on my Github.

Note: I’ve been learning about it from the “Introduction to Information Retrieval” book.

Stemming?

From the “Introduction to Information Retrieval” book:

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Addiotionaly there are families of derivationally related words with similar meaning, such as democracy, democratic, democratization. In many situations, it seems as if it would be useful to search for one of these words to renturn documents that contain another word in the set.

The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of word to a common base form.

Porter Stemming?

From the “Introduction to Information Retrieval” book:

The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective is the Porter Algorithm. It consists of several phases of word reductions applied sequentially.

Implementing it

Recall the significant steps in the inverted index construction

  1. Collect the documents to be indexed – I will use simple strings for while;
  2. Tokenize the text, turning each document into a list of tokens
  3. Do linguistic preprocessing, producing a list of indexing terms
  4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

The Porter Stemming fits the step 3. Right?

public class PorterStemmerFilter : IFilter
{
    public bool Process(TokenSource source)
    {
        PerformStep1(source);
        PerformStep2(source);
        PerformStep3(source);
        PerformStep4(source);
        PerformStep5(source);
        PerformStep6(source);
        return source.Size > 0;
    }

// ..

I used this implementation as a starting point to write my own.

Step 1

In step 1 we remove common suffices and pluralizations.

[Theory]
[InlineData("caresses", "caress")]
[InlineData("ponies", "poni")]
[InlineData("ties", "ti")]
[InlineData("caress", "caress")]
[InlineData("cats", "cat")]
[InlineData("feed", "feed")]
[InlineData("agreed", "agree")]
[InlineData("disabled", "disable")]
[InlineData("matting", "mat")]
[InlineData("mating", "mate")]
[InlineData("meeting", "meet")]
[InlineData("milling", "mill")]
[InlineData("messing", "mess")]
[InlineData("meetings", "meet")]
public void Step1(string from, string to)
{
    var filter = new PorterStemmerFilter();
    using (var reader = new StringReader(from))
    {
        var tokenSource = new TokenSource(reader);
        tokenSource.Next();
        filter.PerformStep1(tokenSource);
        Assert.Equal(to, tokenSource.ToString());
    }
}

To do that:

public void PerformStep1(TokenSource source)
{
    if (source.EndsWith('s'))
    {
        if (source.EndsWith("sses") || source.EndsWith("ies"))
        {
            source.Size -= 2;
        }
        else if (source.Buffer[source.Size - 2] != 's')
        {
            source.Size -= 1;
        }
    }

    if (source.EndsWith("eed"))
    {
        var limit = source.Size - 3; // source.Length
        if (source.NumberOfConsoantSequences(limit) > 0)
        {
            source.Size -= 1;
        }
    }
    else
    {
        var limit = 0;
        if (source.EndsWith("ed"))
        {
            limit = source.Size - 2;
        }
        else if (source.EndsWith("ing"))
        {
            limit = source.Size - 3;
        }

        if (limit != 0 && source.ContainsVowel(limit))
        {
            source.Size = limit;
            if (
                source.EndsWith("at") ||
                source.EndsWith("bl") ||
                source.EndsWith("iz")
            )
            {
                source.InsertIntoBuffer('e');
            }
            else if (source.EndsWithDoubleConsonant())
            {
                var ch = source.LastChar;
                if (ch != 'l' && ch != 's' && ch != 'z')
                {
                    source.Size--;
                }
            }
            else if (
                source.NumberOfConsoantSequences(source.Size - 1) == 1 &&
                source.HasCvcAt(source.Size - 1)
            )
            {
                source.InsertIntoBuffer('e');
            }
        }
    }
}

Notes:

  • The EndsWith method checks if the end of current token matches with the specified string/char.
  • The Buffer is a plain old fixed size char array.
  • The Size is an integer with the used length of Buffer used to store the current token.
  • The NumberOfConsoantsSequences retrieves how many consonants sequences are present in the specified portion of the buffer. It is util to check if we are not doing oversimplification.
  • HasCvcAt verifies if there is a sequence of Consonant-Vowel-Consonant before ending in the specified position. Again, it is relevant to guarantee that we are not doing oversimplification.

Step 2

Step 2 turns terminal -y to -i when there is another vowel in the stem.

public void PerformStep2(TokenSource source)
{
    if (source.EndsWith('y')
        && source.ContainsVowel(source.Size - 2)
    )
    {
        source.Buffer[source.Size - 1] = 'i';
    }
}

Step 3

Step 3 maps double suffices to single ones. So -ization maps to -ize, etc.

Testing Step3 in isolation:

[Theory]
[InlineData("international", "internate")]
[InlineData("rational", "rational")]
[InlineData("constitutional", "constitution")]
[InlineData("energizer", "energize")]
[InlineData("internacionalization", "internacionalize")]
[InlineData("enumeration", "enumerate")]
[InlineData("consolidator", "consolidate")]
[InlineData("tropicalism", "tropical")]
[InlineData("vandalism", "vandal")]
[InlineData("activeness", "active")]
[InlineData("remorsefulness", "remorseful")]
public void Step3(string from, string to)
{
    var filter = new PorterStemmerFilter();
    using (var reader = new StringReader(from))
    {
        var tokenSource = new TokenSource(reader);
        tokenSource.Next();
        filter.PerformStep3(tokenSource);
        Assert.Equal(to, tokenSource.ToString());
    }
}

And now step3’s implementation.

public void PerformStep3(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.Buffer[source.Size - 2])
    {
        case 'a':
            if (source.ChangeSuffix("ational", "ate")) break;
            source.ChangeSuffix("tional", "tion");
            break;
        case 'c':
            if (source.ChangeSuffix("enci", "ence")) break;
            source.ChangeSuffix("anci", "ance");
            break;
        case 'e':
            source.ChangeSuffix("izer", "ize");
            break;
        case 'l':
            if (source.ChangeSuffix("bli", "ble")) break;
            if (source.ChangeSuffix("alli", "al")) break;
            if (source.ChangeSuffix("entli", "ent")) break;
            if (source.ChangeSuffix("eli", "e")) break;
            source.ChangeSuffix("ousli", "ous");
            break;
        case 'o':
            if (source.ChangeSuffix("ization", "ize")) break;
            if (source.ChangeSuffix("ation", "ate")) break;
            source.ChangeSuffix("ator", "ate");
            break;
        case 's':
            if (source.ChangeSuffix("alism", "al")) break;
            if (source.ChangeSuffix("iveness", "ive")) break;
            if (source.ChangeSuffix("fulness", "ful")) break;
            source.ChangeSuffix("ousness", "ous");
            break;
        case 't':
            if (source.ChangeSuffix("aliti", "al")) break;
            if (source.ChangeSuffix("iviti", "ive")) break;
            source.ChangeSuffix("biliti", "ble");
            break;
        case 'g':
            source.ChangeSuffix("logi", "log");
            break;
    }
}

Notes:

  • ChangeSuffix will replace the specified suffix if the prefix has a sequence of consonants.

Step 4

Step 4 deals with -ic-, -full, -ness etc. In a similar fashion of step3.

public void PerformStep4(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.LastChar)
    {
        case 'e':
            if (source.ChangeSuffix("icate", "ic")) break;
            if (source.RemoveSuffix("ative")) break;
            source.ChangeSuffix("alize", "al");
            break;
        case 'i':
            source.ChangeSuffix("iciti", "ic");
            break;
        case 'l':
            if (source.ChangeSuffix("ical", "ic")) break;
            source.RemoveSuffix("ful");
            break;
        case 's':
            source.RemoveSuffix("ness");
            break;
    }
}

Notes:

  • RemoveSuffix will remove the specified suffix if the prefix has a sequence of consonants.

Step 5

Step 5 takes off -ant, -ence, etc.

public void PerformStep5(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.Buffer[source.Size - 2])
    {
        case 'a':
            source.RemoveSuffix("al");
            return;
        case 'c':
            if (source.RemoveSuffix("ance")) return;
            source.RemoveSuffix("ence");
            return;
        case 'e':
            source.RemoveSuffix("er");
            return;
        case 'i':
            source.RemoveSuffix("ic");
            return;
        case 'l':
            if (source.RemoveSuffix("able")) return;
            source.RemoveSuffix("ible");
            return;
        case 'n':
            if (source.RemoveSuffix("ant")) return;
            if (source.RemoveSuffix("ement")) return;
            if (source.RemoveSuffix("ment")) return;
            source.RemoveSuffix("ent");
            return;
        case 'o':
            if (source.ChangeSuffix("tion", "t")) return;
            if (source.ChangeSuffix("sion", "s")) return;
            source.RemoveSuffix("ou");
            return;
        case 's':
            source.RemoveSuffix("ism");
            return;
        case 't':
            if (source.RemoveSuffix("ate")) return;
            source.RemoveSuffix("iti");
            return;
        case 'u':
            source.RemoveSuffix("ous");
            return;
        case 'v':
            source.RemoveSuffix("ive");
            return;
        case 'z':
            source.RemoveSuffix("ize");
            return;
        default:
            return;
    }
}

Step 6

Step 6 removes the final -e and -l.

public void PerformStep6(TokenSource source)
{
    switch (source.LastChar)
    {
        case 'e':
            var a = source.NumberOfConsoantSequences(source.Size - 1);
            if (a > 1 || a == 1 && !source.HasCvcAt(source.Size - 2))
                source.Size --;
            break;
        case 'l' when source.ContainsDoubleConsonantAt(source.Size - 1) && source.NumberOfConsoantSequences(source.Size - 2) > 0:
            source.Size--;
            break;
    }
}

How it affects the indexing and the search results

Adopting the Porter Stemming Algorithm will reduce the number of the distinct tokens and increase the posting list sizes.

Let’s see what happens with the queries.

[Theory]
[InlineData(
    new[]
    {
        "Human cannibalism is the act or practice of humans eating the flesh or internal organs of other human beings. ",
        "There are cannibals in some primitive communities.",
        "In marketing strategy, cannibalization refers to a reduction in sales volume, sales revenue,... ",
    },
    "Cannibalization",
    new[] { 0, 1, 2 }
)]
public void SearchByQuery_TestingPorterStemming(
    string[] documents,
    string query,
    int[] expectedResults
)
{
    var index = new StringIndexer().CreateIndex(documents);
    var searcher = new Searcher(index);
    var results = searcher.Search(query, DefaultAnalyzer.Instance);
    Assert.Equal(expectedResults, results);
}

In the test query, all documents are returned. Why? Cannibalization, Cannibals and Cannibalism are all stemmed to  “cannib.”

Final Words

In this post, I shared the implementation of the famous Porter Stemming Algorithm. Needless to say how much I enjoy it.

Again, I am still studying Information Retrieval. So, if you found any errors in my implementation, let me know.

Compartilhe este insight:

Comentários

Participe deixando seu comentário sobre este artigo a seguir:

Subscribe
Notify of
guest
1 Comentário
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
ras
ras
1 ano atrás

where TokenSource come from ? it is not .net built in class, pls give implementation of TokenSource

AUTOR

Elemar Júnior
Fundador e CEO da EximiaCo atua como tech trusted advisor ajudando empresas e profissionais a gerar mais resultados através da tecnologia.

NOVOS HORIZONTES PARA O SEU NEGÓCIO

Nosso time está preparado para superar junto com você grandes desafios tecnológicos.

Entre em contato e vamos juntos utilizar a tecnologia do jeito certo para gerar mais resultados.

Insights EximiaCo

Confira os conteúdos de negócios e tecnologia desenvolvidos pelos nossos consultores:

Engenharia de Software

Três vantagens reais de utilizar orquestradores BPM para serviços

Arquiteto de software e solução com larga experiência corporativa
Desenvolvimento de Software

Os principais desafios ao adotar testes

Especialista em Testes e Arquitetura de Software
Arquitetura de Dados

Insights de um DBA na análise de um plano de execução

Especialista em performance de Bancos de Dados de larga escala

Acesse nossos canais

Simplificamos, potencializamos e aceleramos resultados usando a tecnologia do jeito certo

EximiaCo 2022 – Todos os direitos reservados

1
0
Queremos saber a sua opinião, deixe seu comentáriox
()
x

Implementing the Porter Stemming Algorithm in C#

Para se candidatar nesta turma aberta, preencha o formulário a seguir:

Condição especial de pré-venda: R$ 14.000,00 - contratando a mentoria até até 31/01/2023 e R$ 15.000,00 - contratando a mentoria a partir de 01/02/2023, em até 12x com taxas.

Tenho interesse nessa capacitação

Para solicitar mais informações sobre essa capacitação para a sua empresa, preencha o formulário a seguir:

Tenho interesse em conversar

Se você está querendo gerar resultados através da tecnologia, preencha este formulário que um de nossos consultores entrará em contato com você:

O seu insight foi excluído com sucesso!

O seu insight foi excluído e não está mais disponível.

O seu insight foi salvo com sucesso!

Ele está na fila de espera, aguardando ser revisado para ter sua publicação programada.

Tenho interesse em conversar

Se você está querendo gerar resultados através da tecnologia, preencha este formulário que um de nossos consultores entrará em contato com você:

Tenho interesse nessa solução

Se você está procurando este tipo de solução para o seu negócio, preencha este formulário que um de nossos consultores entrará em contato com você:

Tenho interesse neste serviço

Se você está procurando este tipo de solução para o seu negócio, preencha este formulário que um de nossos consultores entrará em contato com você:

× Precisa de ajuda?