Forty Six and Two: Back at Uni

So yesterday was my first day back at Uni. Also I should have a flat sorted by the end of the week. Now it's time for me to type up my notes so I can remember them when it comes time for the exams. It's like super secret study.

Multimedia Information Access - Tuesday 28th September 2010:

Information retrieval method. Indexing.

Searches where you simply match a character string are incredibly slow and are not guaranteed to return relevant results. Also they do not recognise words and cannot check for synonyms to help with your search. So to make searches faster there are several different methods that can be used. The method that was discussed in the lecture today was Indexing. Indexing is a way of representing text so that it can be searched quickly.

Indexing is a three step process. The major steps are Tokenisation, stop word removal and Stemming.

Tokenisation involves splitting the document up into separate words removing punctuation and capitalisation. This puts the words into a format so that they are

Stop word removal involves removing words that have little or no meaning. The words with the highest frequency of use do not help when trying to narrow down a search. Removing these words can help by reducing the space required to store the indexed version of the documents. Also similarly the words with the lowest frequency may not be useful at all as these may include spelling mistakes or peoples names. For languages that follow the Zipf distribution this is very easy to do.

Stemming involves removing suffixes to take the word back to its root. This is so that words like book, booking, booked and books all become the same, 'book'. This helps to ensure that the search will recongise a common concept and will return results that would not have been returned had a stemming algorithm not been applied.

The most common stemming algorithm in use today is Porter's Stemming Algorithm. This was written in c and is very fast. However it does still make mistakes. For some words it doesn't take them all the way back to their root, for example Europe vs European. Also it does not recognise proper nouns.

After these steps have been applied the words are then replaced by an id code and every document is now represented as a list of id codes. For each id code there is a list of documents which contain this id. This approach to searching is known as Inverted File. As you no longer have a document with words, you have a word with a list of documents.

Also a record of the position of the word can be taken so that you can run 'proximity type' searches.

That may not make much sense but it's just me writing down what is in my head so that I can remember it more clearly later.

Forty Six and Two

Wednesday, 29 September 2010

Back at Uni

No comments:

Post a Comment