Latest News

Monday, April 2, 2018

What are N-Grams?

What are N-Grams?
 N-grams are simply all combinations of adjacent words or letters of length n that you can find in your source text.
For example, given the word fox, all 2-grams (or “bigrams”) are fo and ox.
You may also count the word boundary – that would expand the list of 2-grams to #ffoox, and x#, where # denotes a word boundary.
You can do the same on the word level. As an example, the hello, world! text contains the following word-level bigrams: # hellohello worldworld #.
The basic point of n-grams is that they capture the language structure from the statistical point of view, like what letter or word is likely to follow the given one. 
  • The longer the n-gram (the higher the n), the more context you have to work with. 
  • Optimum length really depends on the application – if your n-grams are too short, you may fail to capture important differences. 
  • On the other hand, if they are too long, you may fail to capture the “general knowledge” and only stick to particular cases.
  • Google+
  • Pinterest
« PREV
NEXT »

No comments

Post a Comment