Zend Lucene Search – part1 – creating index

From ganeshhs.com

In this article i will be discussing about creating index using zend lucene search .

Conventionally most of the site search are powered by database driven.

Lets consider my blog site, if anyone comes to my site and wants to search for any keyword, if i have to give search results i may have to look into articles table, comments table, executing SQL queries against 2 tables is acceptable, but if we go to any e-commerce application, we may have to search against lot of categories and products, since database queries are costlier, it consumes more resources. One more important point is we cannot get more relevant results first, in general we cannot rank the search results.

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. This is being used in most of web2.0 websites. Zend_Search_Lucene was derived from the Apache Lucene project.

  1. <?php
  1. //Index the blog articles
  2. require_once ‘Zend/Search/Lucene.php’;
  3. $articlesData =    array (0 => array( “url”           => “http://ganeshhs.com/url-1&#8221;,
  4. “title”      => “Google suggest : pick right search keyword”,
  5. “contents”   => “Picking the right keywords for the websites is the success of search engine marketing. When i started search engine optimization, i used to use overture keyword selector tool and check the search counts what other users have searched. “
  6. “category”       => “Google”,
  7. “postedDateTime” => “2007-12-26 12:20:00”,
  8. “articleId”                  => 1),
  9. 1 => array( “url”           => “http://ganeshhs.com/url-2&#8221;,
  10. “title”      => “zend framework tutorial | part 9 Zend Auth”,
  11. “contents”   => “Zend Auth is easy to set up and provides a system that secures our site with an easy to use  authentication mechanism. Zend Auth(Zend_Auth) provides an API for authentication. “
  12. “category”       => “zend-framework”,
  13. “postedDateTime” => “2007-12-26 12:20:00”,
  14. “articleId”      => 2));
  15. if(is_array($articlesData) && count($articlesData))
  16. {
  17. $index = Zend_Search_Lucene::create(‘/var/www/lucene-data/blog-index’);
  18. foreach($articlesData as $articleData)
  19. {
  20. $doc = new Zend_Search_Lucene_Document();
  21. $doc->addField(Zend_Search_Lucene_Field::Keyword(‘url’,
  22. $articleData[“url”]));
  23. $doc->addField(Zend_Search_Lucene_Field::UnIndexed(‘articleId’,
  24. $articleData[“articleId”]));
  25. $doc->addField(Zend_Search_Lucene_Field::UnIndexed(‘postedDateTime’,
  26. $articleData[“postedDateTime”]));
  27. $doc->addField(Zend_Search_Lucene_Field::Text(‘title’,
  28. $articleData[“title”]));
  29. $doc->addField(Zend_Search_Lucene_Field::UnStored(‘contents’,
  30. $articleData[“contents”]));
  31. $doc->addField(Zend_Search_Lucene_Field::Text(‘category’,
  32. $articleData[“category”]));
  33. echo “
  34. Adding: “. $articleData[“title”] .”\n”;
  35. $index->addDocument($doc);
  36. }
  37. $index->commit();
  38. $index->optimize();
  39. }
  40. ?>

$index = Zend_Search_Lucene::create(’/var/www/lucene-data/blog-index’);
Specifies the path of zend lucene index where the documents will be store.

For each iteration, we are creating a document-

  1. $doc = new Zend_Search_Lucene_Document();

Once the document is created we need to add the fields and contents to the document –

Here since the URL is unique to the article we are indexing it as a Keyword field type.

we may need blog article id and blog create date time in the display part, it wont be used for search so we are storing it as UnIndexed field type.

Title is stored as text field type.

Content/Description is indexed but not stored in index. Because description occupies more space and creates a larger index on disk, so if we need to search but not redisplay the data, UnStored field type is preferred.

  1. $doc->addField(Zend_Search_Lucene_Field::Keyword(’url’,
  2. $articleData[”url”]));$doc->addField(Zend_Search_Lucene_Field::UnIndexed(’articleId’,
  3. $articleData[”articleId”]));
  4. $doc->addField(Zend_Search_Lucene_Field::UnIndexed(’postedDateTime’,
  5. $articleData[”postedDateTime”]));
  6. $doc->addField(Zend_Search_Lucene_Field::Text(’title’,
  7. $articleData[”title”]));
  8. $doc->addField(Zend_Search_Lucene_Field::UnStored(’contents’,
  9. $articleData[”contents”]));
  10. $doc->addField(Zend_Search_Lucene_Field::Text(’category’,
  11. $articleData[”category”]));

Once the document is created and fields are added we need to add the document to the index –

  1. $index->addDocument($doc);

After all the iterations we can commit the index-

Following command is used to optimize the index –

  1. $index->optimize();

Understanding Field Types –

  • Keyword fields are stored and indexed, meaning that they can be searched as well as displayed in search results. They are not split up into separate words by tokenization. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.
  • UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields
  • Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.
  • Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results.
  • UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate field as an identifier.
  • Field Type Stored Indexed Tokenized Binary
    Keyword yes yes no no
    UnIndexed yes no no no
    Binary yes no no yes
    Text yes yes yes no
    UnStored no yes yes no

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out /  Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out /  Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out /  Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out /  Change )


    Connecting to %s