What are n-grams and how do I use them?

By Edwin Dijk, on 13 januari 2021
Share this article:



What are n-grams?

N-grams are a way of modeling sequences in all sorts of (textual or numeric) information. The best way to describe this, is by looking at a couple of examples. Let's take the sentence "What are n-grams". We can look at this sentence as a sequence of words, divided by a space. Each word is an n-gram, and to be precise, a unigram. Then if we take all the two-word sequences, we get a bigram and lastly if we take a sequence of three words, we get a trigram.

To vizualise this, we can define collections Cx = { n } as:

Unigram C1 = {what, are, n-grams}
Bigram C2 = {what are, are n-grams}
Trigram C3 = {what are n-grams}


We can of course expand this to a 10-gram if we would want to.

 

Using n-grams for statistical natural language processing

Originally, n-grams were used as a method to predict the statistical probability of the next element in a sequence. For example, based on the dataset {ab, ab,ac}, one could say that the probability of having b after a is 2/3 and the probability of having after a is 1/3. This is where the use of n-grams comes in handy, because you can make this work on a larger scale and for more complicated comparisons and predictions.

By examining a big corpus (= dataset), you are able to make pretty accurate predictions about which word will come after a specific sequence of words. It is however important to note that this is purely a statistical method and does not help distill meaning from text.

This method of calculating probability can be used for pretty much every type of sequence. Whether it concerns natural language, music notes, DNA sequences, 

Using n-grams in a web crawler

A different way of using n-grams when developing a search engine, is to save storage space. Certain sequences of words are very rare while some are not. In case two sequences overlap, the probability increases that they are part of a longer sequence. Let's illustrate that with another example sequence S = "this is an example sentence for this article".

We can now turn this into a collection of n-grams. Using a collection of unigrams wouldn't be very useful, as that wouldn't give us any overlap. So let's make a collection of bigrams first:

C = {this is, is an, an example, example sentence, sentence for, for this, this article}

What we are looking for, is an overlap between two elements. (Perhaps you might see this as an intersection between two elements that represent collections, but in this case the specific order matters and that complicates things, so let's just forget about that for now.) If we can find overlap between two elements, that means that the probability increases that these elements are part of the same sequence.

Okay. Let's simplify further: if we divide the sequence up into 7-grams (the total length of the sequence or |S|= 8). So we get these two 7-grams: "this is an example sentence for this" and "is an example sentence for this article". These two 7-grams have an overlap of 6 elements of the sequence. When you see these two 7-grams in the wild, you can be pretty sure they're part of the same 8-word sequence.

That means that you don't have to save every possible 8-word sequence in order to search for perfect matches in your database. You could also save only every 7-word sequence and use two queries to match up if both the sequences happen in the same document. If that's the case, you've established with a very high accuracy that the document actually contains the 8-word sequence. Although, technically, you're not 100% sure!

Practical use

This way of storing data has profound implications on both your storage and number of database connections. Lowering n causes less storage, but increases database connections when searching for longer strings.

I have used this methodology in my search engine project with trigrams. This means that every search query longer than 3 words will need to be divided up into an x amount of trigrams. This means that for a 5-word search phrase, the search engine will have to search the database for 3 different trigrams and compute the matches at runtime. This may or may not the best way to go in your specific situation.

It is perfectly okay to remove all punctuation from the strings you process. As long as you use the same method of removing punctuation at search-time, the influence of this practice is not enormous in most cases. There are always exceptions, so when creating real-life applications you might want to keep a close watch at the data processing to catch any oddities.

Using n-grams is very helpful when full-text search is not an option, but it's also dangerous territory. This leads me to a last advice: always remember you're dealing with probability!

About Edwin Dijk:
Edwin is a real search engine technology enthusiast. In daily life he works as a software engineer and runs a small software development company.

More about Edwin? Check out www.researchgate.net/profile/Edwin_Dijk2


Comments

There are no comments (yet).
 Sign in with LinkedIn to comment