Implementing the Porter Stemming Algorithm in C#

Conteúdo

In a previous post, I started an elementary search library using C#. Now, I will improve it adding a Porter Stemming filter to the indexing and searching processes. The resulting source code is available on my Github.

Note: I’ve been learning about it from the “Introduction to Information Retrieval” book.

Stemming?

From the “Introduction to Information Retrieval” book:

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Addiotionaly there are families of derivationally related words with similar meaning, such as democracy, democratic, democratization. In many situations, it seems as if it would be useful to search for one of these words to renturn documents that contain another word in the set.

The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of word to a common base form.

Porter Stemming?

From the “Introduction to Information Retrieval” book:

The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective is the Porter Algorithm. It consists of several phases of word reductions applied sequentially.

Implementing it

Recall the significant steps in the inverted index construction

Collect the documents to be indexed – I will use simple strings for while;
Tokenize the text, turning each document into a list of tokens
Do linguistic preprocessing, producing a list of indexing terms
Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

The Porter Stemming fits the step 3. Right?

public class PorterStemmerFilter : IFilter
{
    public bool Process(TokenSource source)
    {
        PerformStep1(source);
        PerformStep2(source);
        PerformStep3(source);
        PerformStep4(source);
        PerformStep5(source);
        PerformStep6(source);
        return source.Size > 0;
    }

// ..

I used this implementation as a starting point to write my own.

Step 1

In step 1 we remove common suffices and pluralizations.

[Theory]
[InlineData("caresses", "caress")]
[InlineData("ponies", "poni")]
[InlineData("ties", "ti")]
[InlineData("caress", "caress")]
[InlineData("cats", "cat")]
[InlineData("feed", "feed")]
[InlineData("agreed", "agree")]
[InlineData("disabled", "disable")]
[InlineData("matting", "mat")]
[InlineData("mating", "mate")]
[InlineData("meeting", "meet")]
[InlineData("milling", "mill")]
[InlineData("messing", "mess")]
[InlineData("meetings", "meet")]
public void Step1(string from, string to)
{
    var filter = new PorterStemmerFilter();
    using (var reader = new StringReader(from))
    {
        var tokenSource = new TokenSource(reader);
        tokenSource.Next();
        filter.PerformStep1(tokenSource);
        Assert.Equal(to, tokenSource.ToString());
    }
}

To do that:

public void PerformStep1(TokenSource source)
{
    if (source.EndsWith('s'))
    {
        if (source.EndsWith("sses") || source.EndsWith("ies"))
        {
            source.Size -= 2;
        }
        else if (source.Buffer[source.Size - 2] != 's')
        {
            source.Size -= 1;
        }
    }

    if (source.EndsWith("eed"))
    {
        var limit = source.Size - 3; // source.Length
        if (source.NumberOfConsoantSequences(limit) > 0)
        {
            source.Size -= 1;
        }
    }
    else
    {
        var limit = 0;
        if (source.EndsWith("ed"))
        {
            limit = source.Size - 2;
        }
        else if (source.EndsWith("ing"))
        {
            limit = source.Size - 3;
        }

        if (limit != 0 && source.ContainsVowel(limit))
        {
            source.Size = limit;
            if (
                source.EndsWith("at") ||
                source.EndsWith("bl") ||
                source.EndsWith("iz")
            )
            {
                source.InsertIntoBuffer('e');
            }
            else if (source.EndsWithDoubleConsonant())
            {
                var ch = source.LastChar;
                if (ch != 'l' && ch != 's' && ch != 'z')
                {
                    source.Size--;
                }
            }
            else if (
                source.NumberOfConsoantSequences(source.Size - 1) == 1 &&
                source.HasCvcAt(source.Size - 1)
            )
            {
                source.InsertIntoBuffer('e');
            }
        }
    }
}

Notes:

The EndsWith method checks if the end of current token matches with the specified string/char.
The Buffer is a plain old fixed size char array.
The Size is an integer with the used length of Buffer used to store the current token.
The NumberOfConsoantsSequences retrieves how many consonants sequences are present in the specified portion of the buffer. It is util to check if we are not doing oversimplification.
HasCvcAt verifies if there is a sequence of Consonant-Vowel-Consonant before ending in the specified position. Again, it is relevant to guarantee that we are not doing oversimplification.

Step 2

Step 2 turns terminal -y to -i when there is another vowel in the stem.

public void PerformStep2(TokenSource source)
{
    if (source.EndsWith('y')
        && source.ContainsVowel(source.Size - 2)
    )
    {
        source.Buffer[source.Size - 1] = 'i';
    }
}

Step 3

Step 3 maps double suffices to single ones. So -ization maps to -ize, etc.

Testing Step3 in isolation:

[Theory]
[InlineData("international", "internate")]
[InlineData("rational", "rational")]
[InlineData("constitutional", "constitution")]
[InlineData("energizer", "energize")]
[InlineData("internacionalization", "internacionalize")]
[InlineData("enumeration", "enumerate")]
[InlineData("consolidator", "consolidate")]
[InlineData("tropicalism", "tropical")]
[InlineData("vandalism", "vandal")]
[InlineData("activeness", "active")]
[InlineData("remorsefulness", "remorseful")]
public void Step3(string from, string to)
{
    var filter = new PorterStemmerFilter();
    using (var reader = new StringReader(from))
    {
        var tokenSource = new TokenSource(reader);
        tokenSource.Next();
        filter.PerformStep3(tokenSource);
        Assert.Equal(to, tokenSource.ToString());
    }
}

And now step3’s implementation.

public void PerformStep3(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.Buffer[source.Size - 2])
    {
        case 'a':
            if (source.ChangeSuffix("ational", "ate")) break;
            source.ChangeSuffix("tional", "tion");
            break;
        case 'c':
            if (source.ChangeSuffix("enci", "ence")) break;
            source.ChangeSuffix("anci", "ance");
            break;
        case 'e':
            source.ChangeSuffix("izer", "ize");
            break;
        case 'l':
            if (source.ChangeSuffix("bli", "ble")) break;
            if (source.ChangeSuffix("alli", "al")) break;
            if (source.ChangeSuffix("entli", "ent")) break;
            if (source.ChangeSuffix("eli", "e")) break;
            source.ChangeSuffix("ousli", "ous");
            break;
        case 'o':
            if (source.ChangeSuffix("ization", "ize")) break;
            if (source.ChangeSuffix("ation", "ate")) break;
            source.ChangeSuffix("ator", "ate");
            break;
        case 's':
            if (source.ChangeSuffix("alism", "al")) break;
            if (source.ChangeSuffix("iveness", "ive")) break;
            if (source.ChangeSuffix("fulness", "ful")) break;
            source.ChangeSuffix("ousness", "ous");
            break;
        case 't':
            if (source.ChangeSuffix("aliti", "al")) break;
            if (source.ChangeSuffix("iviti", "ive")) break;
            source.ChangeSuffix("biliti", "ble");
            break;
        case 'g':
            source.ChangeSuffix("logi", "log");
            break;
    }
}

Notes:

ChangeSuffix will replace the specified suffix if the prefix has a sequence of consonants.

Step 4

Step 4 deals with -ic-, -full, -ness etc. In a similar fashion of step3.

public void PerformStep4(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.LastChar)
    {
        case 'e':
            if (source.ChangeSuffix("icate", "ic")) break;
            if (source.RemoveSuffix("ative")) break;
            source.ChangeSuffix("alize", "al");
            break;
        case 'i':
            source.ChangeSuffix("iciti", "ic");
            break;
        case 'l':
            if (source.ChangeSuffix("ical", "ic")) break;
            source.RemoveSuffix("ful");
            break;
        case 's':
            source.RemoveSuffix("ness");
            break;
    }
}

Notes:

RemoveSuffix will remove the specified suffix if the prefix has a sequence of consonants.

Step 5

Step 5 takes off -ant, -ence, etc.

public void PerformStep5(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.Buffer[source.Size - 2])
    {
        case 'a':
            source.RemoveSuffix("al");
            return;
        case 'c':
            if (source.RemoveSuffix("ance")) return;
            source.RemoveSuffix("ence");
            return;
        case 'e':
            source.RemoveSuffix("er");
            return;
        case 'i':
            source.RemoveSuffix("ic");
            return;
        case 'l':
            if (source.RemoveSuffix("able")) return;
            source.RemoveSuffix("ible");
            return;
        case 'n':
            if (source.RemoveSuffix("ant")) return;
            if (source.RemoveSuffix("ement")) return;
            if (source.RemoveSuffix("ment")) return;
            source.RemoveSuffix("ent");
            return;
        case 'o':
            if (source.ChangeSuffix("tion", "t")) return;
            if (source.ChangeSuffix("sion", "s")) return;
            source.RemoveSuffix("ou");
            return;
        case 's':
            source.RemoveSuffix("ism");
            return;
        case 't':
            if (source.RemoveSuffix("ate")) return;
            source.RemoveSuffix("iti");
            return;
        case 'u':
            source.RemoveSuffix("ous");
            return;
        case 'v':
            source.RemoveSuffix("ive");
            return;
        case 'z':
            source.RemoveSuffix("ize");
            return;
        default:
            return;
    }
}

Step 6

Step 6 removes the final -e and -l.

public void PerformStep6(TokenSource source)
{
    switch (source.LastChar)
    {
        case 'e':
            var a = source.NumberOfConsoantSequences(source.Size - 1);
            if (a > 1 || a == 1 && !source.HasCvcAt(source.Size - 2))
                source.Size --;
            break;
        case 'l' when source.ContainsDoubleConsonantAt(source.Size - 1) && source.NumberOfConsoantSequences(source.Size - 2) > 0:
            source.Size--;
            break;
    }
}

How it affects the indexing and the search results

Adopting the Porter Stemming Algorithm will reduce the number of the distinct tokens and increase the posting list sizes.

Let’s see what happens with the queries.

[Theory]
[InlineData(
    new[]
    {
        "Human cannibalism is the act or practice of humans eating the flesh or internal organs of other human beings. ",
        "There are cannibals in some primitive communities.",
        "In marketing strategy, cannibalization refers to a reduction in sales volume, sales revenue,... ",
    },
    "Cannibalization",
    new[] { 0, 1, 2 }
)]
public void SearchByQuery_TestingPorterStemming(
    string[] documents,
    string query,
    int[] expectedResults
)
{
    var index = new StringIndexer().CreateIndex(documents);
    var searcher = new Searcher(index);
    var results = searcher.Search(query, DefaultAnalyzer.Instance);
    Assert.Equal(expectedResults, results);
}

In the test query, all documents are returned. Why? Cannibalization, Cannibals and Cannibalism are all stemmed to “cannib.”

Final Words

In this post, I shared the implementation of the famous Porter Stemming Algorithm. Needless to say how much I enjoy it.

Again, I am still studying Information Retrieval. So, if you found any errors in my implementation, let me know.

Compartilhe este insight:

Comentários

Participe deixando seu comentário sobre este artigo a seguir:

1 Comentário

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

ras

2 anos atrás

where TokenSource come from ? it is not .net built in class, pls give implementation of TokenSource

Responder

AUTOR

Elemar Júnior

Fundador e CEO da EximiaCo atua como tech trusted advisor ajudando empresas e profissionais a gerar mais resultados através da tecnologia.

NOVOS HORIZONTES PARA O SEU NEGÓCIO

Nosso time está preparado para superar junto com você grandes desafios tecnológicos.

Entre em contato e vamos juntos utilizar a tecnologia do jeito certo para gerar mais resultados.

Insights EximiaCo

Confira os conteúdos de negócios e tecnologia desenvolvidos pelos nossos consultores:

Arquivo

Pós-pandemia, trabalho remoto e a retenção dos profissionais de TI

Fernando Paiva

CTO Consulting e Especialista em Execução em TI

Implementing the Porter Stemming Algorithm in C#

Stemming?

Porter Stemming?

Implementing it

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

How it affects the indexing and the search results

Final Words

Comentários

Insights EximiaCo

Pós-pandemia, trabalho remoto e a retenção dos profissionais de TI

Fernando Paiva

Muito obrigado!