Inverted Index and Ruby
Inverted Index and YOU!
Yahoo, Google, Microsoft and basically all other search engines had to start
somewhere and so we start here with an Inverted Index. What is it? Heck! why do we
even use it?
In the opening of this article i brought up a few search engines, they use large, billion and billions
of records that users can navigate though and search through, hell i even think at this point there are more
unique records in all three data sets then there are starts..pssh. So with all this data how in the world
does google get you the data super fast in a nice little package called a search result? Well its called
an inverted index. Im pretty sure they use much more advanced algorithms then this basic one but like I
said before we have to start somewhere.
For this example we have a database with 1,000,000,000 records and we need to find the words,”my dog is cool” in this database. We also need to locate where the combination of those words appear in. Lucky i know, they appear in row, 1, 8, 5000,999999, and 999999999. If we tried to run a search on this database we would have to search through ALL 1 billion rows in the database, comparing all the words in each row to the words that Im trying to serach for. Not cool.
This isnt good, not optimal, because it will take a very long time to find the results that we want. Got it? If you dont here is a very simple equation to remember.
“very big database” + “words to search for” = long time to wait for results if we dont have a good algorithm.
So how can we speed this up? What we can do is use, yes you guessed it, an “inverted index”. The inverted index allows us to run through each row, find where the combination of words occures, and finally keep one single row containing the Ids of the rows to where the combination of words appear in. This will allow us to look at one table and read one 1 row instead of 1,000,000,000 to identify where the combination of words appears. Once we identify which rows contain the word combination using that single row, we can present the results for the user. Much fater isnt it 🙂 Graphically this would look something like this.
Example #1 – Initial database without the inverted index. Table has rows from 1 to 1,000,000,000
TABLEX
ID TEXT
1 “on hey dude my dog is cool”
2 “i hate your dog”
3 “damn it! he pooped on my head”
4 “what? he did what?”
5 “he pooped on my head man!”
6 “oh..that…yea he does that”
7 “your ok with that?”
8 “yea my dog is cool like that”
.
.
.
.
.
1000000000 “yea i guess when you said my dog is cool…”
Example #1
With the Inverted Index. New table in the database.
TABLEY
TID KEY ROWS
1 “my dog is cool” {1, 8, 5000, 999999, 999999999}
2 “bleh…i hate…” {20, 26, 27, 28, 29, 10000, 202234, 99000}
Looking at the inverted index array you can see how
we set things up. We have a primary key like always, the unique phrase (this can be anything but has to be unique), and we have an array or some type of delimeter seperated string contaning all the rows from TableX where the unique key appears in.
So concluding, this section is on why we need the inverted index to save the world and possibly solve world hunger. No really we need it to reduce the wait time on mining for information.
By the way, I went ahead and implemented ruby code that you can use to mess with and test out the Inverted index concept with. Enjoy.
##############
# Inverted Index – Ruby Code
#
# Data Sets from:
# Book Data Mining, Concepts and techniques
#
# @author: Armando Padilla, [email protected]
##############
###
# INITIAL DATASET TO USE
# Tuple set from Page 181
#
# [tuple id, value 1, value 2, value 3, value 4, value 5]
#
###
@tuples = Array.new()
@tuples = [[‘1’, ‘a1’, ‘b1’, ‘c1’, ‘d1’, ‘e1’],
[‘2’, ‘a1’, ‘b2’, ‘c1’, ‘d2’, ‘e1’],
[‘3’, ‘a1’, ‘b2’, ‘c1’, ‘d1’, ‘e2’],
[‘4’, ‘a2’, ‘b1’, ‘c1’, ‘d1’, ‘e2’],
[‘5’, ‘a2’, ‘b1’, ‘c1’, ‘d1’, ‘e3’]
]
###
# Holds the inverted array
###
invertedIndex = Hash.new()
###
# FOREACH ROW
###
i = 0
until i == @tuples.size
#Get the row
tuple = @tuples[i]
#Get the number of attributes
size = tuple.size()
###
# FOR EACH ATTRIBUTE IN THE ROW
####
j=1
until j == size
#Get the TID
tid = tuple[0]
#Get the current attribute
attribute = tuple[j]
#Check if we have already visited this attribute.
if !invertedIndex.has_key?(attribute)
###
# If we havent look down the column and
# get the TID of each row the attribute appears
# and add it to a array container. When done
# add teh array container to the main invertedIndex Hash.
##
x = Array.new()
k=0
until k == @tuples.size
if @tuples[k][j] == attribute
#puts “Attribute: “+attribute+” TID: “+@tuples[k][0]
x.push(@tuples[k][0])
end
k = k+1
end
invertedIndex[attribute] = x
end #end – if the attribute has already been visited.
j = j+1
end #end – loop through each attribute
i = i+1
end #end – loop through the rows
#######################################
# Print out the resulting Inverted Index
#######################################
invertedIndex.each{|key, value| puts key+ ” “+value.to_s}
Armando Padilla