A First Take at Building an Inverted Index and Querying Using C#

Conteúdo

1 Major steps to build an inverted index

In this post, I would like to share my first attempt to create a (still) elementary search library. The source code is available on my GitHub.

I am learning how to do it from the “Introduction to Information Retrieval” book (which is part of my current reading list). Also, all my code is inspired by the Ayende’s Corax project.

Major steps to build an inverted index

Following what I read, I would need:

Collect the documents to be indexed – I will use simple strings for while;
Tokenize the text, turning each document into a list of tokens
Do linguistic preprocessing, producing a list of indexing terms
Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

So, let’s do it.

Collecting and tokenizing documents

Starting simple, I decided that my documents would be simple strings. So, here is a test to do it.

public class TokenSourceTests
{
    [Theory]
    [InlineData(
        "This is a simple string.",
        new[] {"This", "is", "a", "simple", "string"}
    )]
    public void TokenizationWorks(string input, string[] expected)
    {
        using (var reader = new StringReader(input))
        {
            var tokenSource = new TokenSource(reader);
            var result = tokenSource.ReadAll().ToArray();
            Assert.Equal(expected, result);
        }
    }
}

I will ignore punctuation and white spaces.

public sealed class TokenSource
{
    private readonly TextReader _reader;
        
    public TokenSource(TextReader reader)
    {
        _reader = reader;
    }

    public char[] Buffer { get; } = new char[256];
    public int Size { get; private set; } = 0;

    public bool Next()
    {
        Size = 0;
        int r;
        while ((r = _reader.Read()) != -1) // EOF
        {
            var ch = (char)r;

            if (ch == 'r' || ch == 'n' || char.IsWhiteSpace(ch) || char.IsPunctuation(ch))
            {
                if (Size > 0)
                {
                    return true;
                }
            }
            else
            {
                Recognize(ch);
            }
        }

        return Size > 0;
    }

    private void Recognize(char curr)
    {
        Buffer[Size++] = curr;
    }

    public IEnumerable<string> ReadAll()
    {
        while (Next())
        {
            yield return ToString();
        }
    }

    public IEnumerable<string> ReadAll(Func<TokenSource, bool> filter)
    {
        while (Next())
        {
            if (filter(this))
            {
                yield return ToString();
            }
        }
    }

    public override string ToString()
    {
        return new string(Buffer, 0, Size);
    }
}

The idea of using this array of chars looks excellent (that comes from Ayende). The code does not generate “garbage” strings.

Doing linguistic preprocessing

Having tokens, I need to normalize them (I will merely use lower cases) and to ignore stop words.

public class DefaultAnalyzerTests
{
    [Theory]
    [InlineData(
        "This is a simple string.",
        new[] { "simple", "string" }
    )]
    public void TokenizationWorks(string input, string[] expected)
    {
        using (var reader = new StringReader(input))
        {
            var tokenSource = new TokenSource(reader);
            var result = tokenSource
                .ReadAll(DefaultAnalyzer.Instance.Process)
                .ToArray();
            Assert.Equal(expected, result);
        }
    }
}

As you can see, DefaultAnalyzer class do all the filtering/normalizing process.

using System;
using Tupi.Indexing.Filters;

namespace Tupi.Indexing
{
    public class DefaultAnalyzer : IAnalyzer
    {
        private readonly IFilter[] _filters =
        {
            new ToLowerFilter(), 
            new StopWordsFilter()
        };

        public bool Process(TokenSource source)
        {
            for (var i = 0; i < _filters.Length; i++)
            {
                if (!_filters[i].Process(source))
                    return false;
            }
            return true;
        }

        private static readonly Lazy<DefaultAnalyzer> LazyInstance = 
            new Lazy(() => new DefaultAnalyzer());
        public static DefaultAnalyzer Instance => LazyInstance.Value;
    }
}

The analyzer uses two filters. One to make all tokens lower case.

public class ToLowerFilter : IFilter
{
    public bool Process(TokenSource source)
    {
        for (var i = 0; i < source.Size; i++)
        {
            source.Buffer[i] = char.ToLowerInvariant(source.Buffer[i]);
        }
        return true;
    }
}

One to ignore stop words.

using System.Collections.Generic;

namespace Tupi.Indexing.Filters
{
    // based on:
    // https://github.com/ayende/Corax/blob/master/Corax/Indexing/Filters/StopWordsFilter.cs
    public class StopWordsFilter : IFilter
    {
        private static readonly string[] DefaultStopWords =
        {
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
            "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
        };

        private readonly HashSet<ArraySegmentKey> _stopWords
            = new HashSet<ArraySegmentKey>();

        public StopWordsFilter()
        {
            foreach (var word in DefaultStopWords)
            {
                _stopWords.Add(new ArraySegmentKey(word.ToCharArray()));
            }
        }

        public bool Process(TokenSource source)
        {
            var term = new ArraySegmentKey(source.Buffer, source.Size);
            return !_stopWords.Contains(term);
        }
    }
}

Creating an inverted index

Now that we have all the components, the indexing process is simple.

public class StringIndexer
{
    public static InvertedIndex CreateIndex(
        params string[] documents
    )
    {
        var result = new InvertedIndex();

        for (var i = 0; i < documents.Length; i++)
        {
            using (var reader = new StringReader(documents[i]))
            {
                var tokenSource = new TokenSource(reader);

                var tokens = tokenSource
                    .ReadAll(DefaultAnalyzer.Instance.Process)
                    .Distinct()
                    .ToArray();

                foreach (var token in tokens)
                {
                    result.Append(token, i);
                }
            }
        }

        return result;
    }
}

I tokenize all the documents, updating a simple data structure.

public class InvertedIndex
{
    private readonly IDictionary<string, List> _data = 
        new Dictionary<string, List>();
    internal void Append(string term, int documentId)
    {
        if (_data.ContainsKey(term))
        {
            _data[term].Add(documentId);
        }
        else
        {
            var postings = new List {documentId};
            _data.Add(term, postings);
        }
    }
    // ...
}

As you can see, an inverted index is just a dictionary of terms, where the values are lists of documents containing these terms.

Processing Boolean Queries

The technique for boolean queries is simple. Considering that I have to terms to query, I would need:

Locate the first term in the dictionary
Retrieve its postings
Locate the second term in the dictionary
Retrieve its postings
Intersect the two posting lists.

Here is my test:

[Theory]
[InlineData(
    new[]
    {
        "Elemar is learning how to create an inverted index",
        "It is cool to use csharp.",
        "This is a simple string document created by Elemar",
        "This string will be indexed in an inverted index generated by Elemar"
    },
    "Elemar inverted",
    new[] { 0, 3 }
)]
public void SearchByQuery(
    string[] documents,
    string query,
    int[] expectedResults
)
{
    string[] terms;
    using (var reader = new StringReader(query))
    {
        terms = new TokenSource(reader)
            .ReadAll(DefaultAnalyzer.Instance.Process)
            .ToArray();
    }

    var index = StringIndexer.CreateIndex(documents);
    var results = index.Search(terms);
    Assert.Equal(expectedResults, results);
}

I start splitting the query into terms and then using these terms to search.

Here is my implementation:

public IEnumerable<int> Search(params string[] terms)
{
    // no terms
    if (terms.Length == 0)
    {
        return Enumerable.Empty();
    }

    // only one term
    if (terms.Length == 1)
    {
        return _data.ContainsKey(terms[0])
            ? _data[terms[0]]
            : Enumerable.Empty();
    }

    var postings = new List<List>();
    for (var i = 0; i < terms.Length; i++)
    {
        // there is a term with no results
        if (!_data.ContainsKey(terms[i]))
        {
            return Enumerable.Empty();
        }
        postings.Add(_data[terms[i]]);
    }

    // 
    postings.Sort((l1, l2) => l1.Count.CompareTo(l2.Count));

    var results = postings[0];

    for (var i = 1; i < postings.Count; i++)
    {
        results = results.Intersect(postings[1]).ToList();
    }

    return results;
}

That’s it.

Conclusions

It is just my first attempt in years to work with inverted indexes. It’s not that hard, but it is not that simple too.

Ayende’s Corax project was an excellent reference for tokenizing and analyzing documents. Also, the “Information Retrieval” book that I have been reading is straightforward to follow and understand.

Please, share your thoughts about the code. Let’s improve it!

Compartilhe este insight:

Comentários

Participe deixando seu comentário sobre este artigo a seguir:

0 Comentários

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

AUTOR

Elemar Júnior

Fundador e CEO da EximiaCo atua como tech trusted advisor ajudando empresas e profissionais a gerar mais resultados através da tecnologia.

Arquitetura Corporativa

Ajudamos empresas a estruturar e fortalecer sua capacidade interna em Arquitetura Corporativa, promovendo alinhamento estratégico entre TI e Negócios, eficiência operacional e sustentação da transformação digital.

NOVOS HORIZONTES PARA O SEU NEGÓCIO

Nosso time está preparado para superar junto com você grandes desafios tecnológicos.

Entre em contato e vamos juntos utilizar a tecnologia do jeito certo para gerar mais resultados.

Insights EximiaCo

Confira os conteúdos de negócios e tecnologia desenvolvidos pelos nossos consultores:

Arquivo

Pós-pandemia, trabalho remoto e a retenção dos profissionais de TI

15/07/2021

Fernando Paiva