Mobile phone search
Discuss inverted index technology and fundamentals
You can read the speaker notes by pressing
Consulting Software Engineer, Red Hat
Open Source coder:
Debezium, Infinispan, …
Podcasts: Les Cast Codeurs
B-tree vs inverted index
Building & using an inverted index
Mobile phone search
more like this / classification
SELECT * FROM Speakers s WHERE s.bio = 'Open Source'
How is the index built?
Self-balancing tree data stucture
Search, insert and delete in
Good when access time >> process time
Replace numbers in the example with letters for text.
Only keys in the non leaf nodes
Leaf nodes linked with one another for efficient ascending reading
Data can be just pointer to the real data
Not only is that terrible in general, but you just KNOW Billy’s going to open the root present first, and then everyone will have to wait while the heap is rebuilt.
SELECT * FROM Speakers s WHERE s.bio LIKE 'Open Source%'
With like we can have more text after
Still using indices
SELECT * FROM Speakers s WHERE s.bio LIKE '%Open Source%'
Find word anywhere in the text
Table or index scan :(
SELECT * FROM Speakers s WHERE s.bio LIKE '%open source%' OR s.bio LIKE '%Open Source%' OR s.bio LIKE '%opan surce%'
Can’t anticipate the casing
Can’t anticipate all typos
SELECT * FROM Speakers s WHERE s.bio LIKE '%source open%' OR s.bio LIKE '%source%' OR s.bio LIKE '%open%' ORDER BY best??
Words could be in any order
I want the most interesting result first
Some have powerful indexing techniques
Some even full-text related
Tend to have less flexibility than dedicated inverted index tools
Let’s not index column values but words
Let’s not query values but words
doc1: I am your father Luke
doc2: Yes he is your father
doc3: I am gonna make him an offer he can not refuse.
doc4: I love the smell of napalm in the morning.
doc5: One morning I shot an elephant in my pajamas. How he got in my pajamas, I do not know.
query: father napalm
Apply the same word splitting logic
Matching documents: 1, 2 and 4
Apply the same logic to both document and query content
Each token is the entry in the inverted index pointing to documents
Remove unnecessary characters
e.g. remove HTML tags
<p>This is <strong>awesome</strong>.</p> This is awesome.
Split sentence into words called tokens
Split at spaces, dots and other punctuations (with exceptions)
A.B.C., and many other rules
One tokenizer per language, but many languages are similar
Operate on the stream of tokens
Change, remove or even add tokens
Sentence: This is AWESOME Peter! Tokens: |This|is|AWESOME|Peter| stopwords: |AWESOME|Peter| lowercase: |awesome|peter|
When the text mentions a "car" but the research is about "automobile" or "vehicle"
We need a synonym dictionary
Put all synonyms in the index for each word
Use a reference synonym ("automobile" for "car", "compact", "auto", "S.U.V."…)
Index normally, use synonyms when building the query
That would make for lots of synonyms…
Let’s use a stemming algorithm
Porter stemming algorithm
French algorithm explained
Index/query the stem when the word is found
People make mistakes
In the text or in the query
They make thaipo and other mystakes
Same logic as stemming, convert word into phonetic approximation
Soundex, RefinedSoundex, Metaphone, DoubleMetaphone
Split a word into a sliding window of n characters
Index each n-gram
// building a 3 gram mystake: mys yst sta tak ake mistake: mis ist sta tak ake
Low n means more false positives
High n means less forgiving
Based on Damerau-Levenshtein distance
insert, update, delete and transposition
Pure query time operation
Compute distance between word and all words in index
Compute a distance state machine for word
Use it to check specific terms in the index
ne: n consumed chars, e errors
horizontal: unmodified chars
* vertical: addition
* diagonal: substitution
ε diagonal: deletion
Apply different indexing approach for same data
It’s term query all the way down!
All queries (synonyms, phonetic, n-gram, fuzzy) are a (set of) term queries
Term, wildcard, prefix, fuzzy, phrase, range, boolean, all, spatial, more like this, spell checking
Find exact sentences
or find words near one another (sloppiness)
"Laurel and Hardy"
PhraseQuery uses positional information
Shingles uses n-grams but per tokens not per chars
Found results but in random order…
We want the most relevant results first
This is relative
Several approaches, none perfect
How often does the term appear in this document?
More is better
How often does the term appear in all documents in the collection?
Common words are less important
How long is the field?
Long documents would be favored otherwise
If document contains multiple terms, it’s a better fit.
Vector space model
Term Frequency / Inverted Document Frequency
Positional (phrase query) or similarity (fuzzy) information
Feedback function (external or internal)
Your custom scoring function (or a tweak of)
A Lucene example
Search engine library
Used by many, including
When you need write throughput
B-tree requires lots of updates in place
Sequential reads are much faster than random reads
kinda also in memory (L1-3 caches)
Append operations in a file
Reading requires reading all the log
Per batch of writes, create a file storing the sorted key/value pairs
On read, check for the key on each file
Regularly merge files together (e.g. make bigger files)
Immutable (lock-free) and file cache friendly
Fast on write, decent on read
Sequential read/write friendly
Read time decays with number of files ⇒ merge
Page index in memory
Limit the number of files to read
One file per level to be consulted
Compact to the higher levels
Each file per level has non overlapping key ranges
Everything is computed upfront
Each segment is a mini index
Denormalize differently depending on access pattern
term index (like a ToC for the dictionary)
term dictionary (points to posting list offset)
posting list (list of matching document id per term)
stored field index (sparse doc id + offset)
stored field data (list of field values per document id)
Term index provides offset to the dictionary
Based on finite state transducers
Gives one ordinal per prefix
We know where to look in the term dictionary
FST for mop, moth, pop, star, stop and top
mop=0 moth=1 pop=2 star=3 stop=4 top=5
From a given offset (& prefix)
Sorted list of suffixes
For each, frequency and offset to posting list
[prefix=top] _null_, freq=27, offset=234 ography, freq=1, offset=298 ology, freq=6, offset=306 onyms, freq=1, offset=323
List of document ids
Variable int encoding
Encoded as delta of ids (good for variable int encoding)
4,6,9,30,33,39,45 => 4,2,3,23,3,6,6
Bigger in size but less CPU branch miss prediction
Also pulse optimization for frequency=1
Stored field index in memory doc id + offset for every 16k of data
Stored value stored as bocks of 16k and compressed
You said segments are immutable
What about deleted documents?
Deleted document file
1 bit per doc
sparse list of cleared docs
2 disk seeks per field search (binary search)
1 disk seek per doc for stored fields
But things likely fit in file system cache
Warning: this is a simplified view :)
Called doc values
Used for aggregation or sorting or faceting
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
XKCD images are licensed under Creative Commons Attribution-NonCommercial 2.5 License.
A couple of drawings are copyright of their respective author (linked in the references).
B-tree and LSM
Lucene file and memory structure