This is a fork of Silverstripe Lucene plugin that is hosted at Google Code (SS3 Compatible)
This plugin for the SilverStripe framework allows you to harness the power of
the Lucene search engine on your site.
Using a variety of tools, you can also search PDF, Word, Excel, Powerpoint and
plain text files.
It is easy to set up and use.
This plugin uses Zend_Search_Lucene from Zend, StandardAnalyzer by Kenny
Katzgrau, and pdf-to-text by Joeri Stegeman for PDF scanning.
Zend_Search_Lucene is a PHP port of the Apache project's Lucene search engine.
This extension is inspired by the wpSearch plugin for WordPress.
http://codefury.net/projects/wpSearch/
Graeme Smith
<gs78 (at) me (dot) com>
####To Do:
Tests not working - Status column and Temporary Tables fault
Darren Inwood
<darren (dot) inwood (at) chrometoaster (dot) com>
SilverStripe 3.0 or newer
'Queued Jobs' module for SilverStripe 3.0 or newer - see: https://github.com/nyeholt/silverstripe-queuedjobs
This module is currently only tested on LAMP - Windows and Mac OS X should work,
but are untested.
http://code.google.com/p/lucene-silverstripe-plugin/
There is also phpdoc generated documentation in the docs directory.
Check out the archive into the root directory of your project. This should be
the same folder as the 'framework' directory.
Via Git:
git submodule add https://github.com/Instagraeme/silverstripe-lucene lucene
This will create a directory called 'lucene' containing the plugin files.
You will need to have the 'Queued Jobs' module installed in order to use Lucene:
Via Git:
git submodule add https://github.com/nyeholt/silverstripe-queuedjobs queuedjobs
To get queued jobs to run, you also need to add $_FILE_TO_URL_MAPPING to your
_ss_environment.php file as described in the SilverStripe docs:
http://doc.silverstripe.org/sapphire/en/topics/commandline
Run /dev/build?flush=1 to tell your SilverStripe about your new module, and your
new search engine is installed! (You still need to enable it - see below.)
To enable pdf scanning using the pdftotext utility on Linux, ensure that the
command-line utility is installed. If you are using Debian or Ubuntu, either
of the poppler-utils or xpdf-utils packages will provide this utility:
apt-get install poppler-utils
If you are on another Linux, Mac OS X, or Windows, the Xpdf program includes
pdftotext:
If you do not have the pdftotext utility installed, Lucene will use the
PHP-based PDF2Text class by Joeri Stegeman instead. However, this class is
limited in it's ability compared to pdftotext.
Word, Excel and Powerpoint scanning all require the 'zip' PHP module to be
installed. If you don't have it, newer docx, xlsx and pptx documents won't be
scanned.
To get scanning of older doc, xls and ppt documents working, you need to install
the catdoc command-line utility. There are Windows and Mac OS X ports also.
http://wagner.pp.ru/~vitus/software/catdoc/
http://blog.brush.co.nz/2009/09/catdoc-windows/
http://catdoc.darwinports.com/
If you just want to get up and running as quickly as possible with your Lucene
search engine, install it as per above, and then add the following line to your
project's _config.php file:
ZendSearchLuceneSearchable::enable();
If you're using the Black Candy theme, or another theme that supports the
standard SilverStripe Fulltext Search, your search will now run using Lucene,
indexing all Pages and indexable Files (PDF, Word, Excel, Powerpoint and HTML).
To get the most out of your new search engine, continue reading.
ENABLING THE SEARCH ENGINE
By default, the Lucene Search engine is not enabled. To enable it, you need to
add the following into your _config.php file:
ZendSearchLuceneSearchable::enable();
This will configure all SiteTree and File objects by adding the
'ZendSearchLuceneSearchable' extension to those classes. The following fields
will be indexed whenever an object of this class is written to the database:
'SiteTree' => 'Title,MenuTitle,Content,MetaTitle,MetaDescription,MetaKeywords',
'File' => 'Filename,Title,Content'
After enabling the search engine, you will need to build the index for the first
time. There is a new button marked 'Rebuild search index' on the SiteConfig
page, which is the page in the LHS column at the top, with the name of the site.
This will add a new job to the 'Jobs' list - this will give you a readout of how
far through reindexing your site is.
If you just want to get Lucene up and running as quickly as possible, you can
skip down to the 'Usage Overview' section below - that's all the configuration
you need to do!
INDEXING CLASSES
If you wish to enable the search engine, but not automatically add the extension
to SiteTree and/or File, pass in an array containing the classes to index:
(this only accepts SiteTree and File, see below for indexing other classes)
// Use one of these lines to control which classes to extend
ZendSearchLuceneSearchable::enable(array('SiteTree', 'File'));
ZendSearchLuceneSearchable::enable(array('SiteTree'));
ZendSearchLuceneSearchable::enable(array('File'));
// Do not automatically add the extension to any classes
ZendSearchLuceneSearchable::enable(array());
In order to index classes other than the defaults, you need to add the
ZendSearchLuceneSearchable extension with a list of which fields to index.
For instance, to index your custom Page class, which has custom Summary and
Intro fields added:
Object::add_extension(
'Page',
"ZendSearchLuceneSearchable('"
."Title,MenuTitle,MetaTitle,MetaDescription,MetaKeywords,"
."Summary,Intro,Content')"
);
You can also index custom functions that return strings. If your indexed object
has a method called 'getFoo()' that returns a string representing some special
state you want to index, adding 'getFoo' into the field list will index this
state.
There are four types of indexing used in Lucene:
Keyword - Data that is searchable and stored in the index, but not broken up
into tokens for indexing. This is useful for being able to search on non-textual
data such as IDs or URLs.
UnIndexed - Data that isnât available for searching, but is stored with our
document (eg. article teaser, article URL and timestamp of creation)
UnStored - Data that is available for search, but isnât stored in the index
in full (eg. the document content)
Text â Data that is available for search and is stored in full (eg. title and
author)
The MenuTitle, MetaTitle, MetaDescription and MetaKeywords fields will be
indexed as Unstored.
LastEdited and Created fields will be Unindexed.
ID and ClassName fields will be indexed as Keyword types.
All other fields will be indexed as Text.
INDEXING RELATIONS
You can index has_one, has_many and many_many relations, using dot notation to
indicate the fields to read on the related object.
If we have a has_one relation between Page and our custom class Foo, and Foo
has a text field called Bar, we can index it by adding Foo.Bar into the field
list when we add the extension to the Page type:
Object::add_extension(
'Page',
"ZendSearchLuceneSearchable('"
."Title,MenuTitle,MetaTitle,MetaDescription,MetaKeywords,"
."Content,Foo.Bar')"
);
You can nest relations several layers deep if necessary, eg.
Foo.Bar.Baz.Buz - remember that the names used are the names of the relation
fields, NOT the names of the classes being indexed.
INDEXING FILES
When indexing 'File' DataObjects, this module will detect the file type using
the file extension. Detected types are .txt, .xls, .doc, .ppt, .xlsx, .docx,
.htm, .html, .pptx, and .pdf.
See the 'Installation' section above for details on getting file scanning
working for various file types.
ADVANCED FIELD-LEVEL INDEXING OPTIONS
You can get more fine-grained control over how your classes are indexed by
adding the ZendSearchLuceneSearchable extension with a JSON-encoded object as
the argument.
Your object should be arranged as key-value pairs, the key being the name of the
property, method or relation you wish to index, and the value being another
object containing key-value pairs indicating the options for that field.
Object::add_extension(
'Page',
"ZendSearchLuceneSearchable('
{
"Title" : true,
"CreatedDate" : {
name : "Title",
type : "text",
content_filter : "strtotime"
},
"Intro" : true,
"Content" : {
name : "Content",
type : "unstored"
},
"Foo.Bar" : {
name : "Baz"
},
"Images" : {
content_filter : ["HelperClass","countImages"]
}
}
')"
);
Any omitted config options will use the defaults. Available config options for
each field are:
name
The name to store this as in the document. Default is the same as
the field name. The field name of 'ID' is a special case - this should always
use a name of 'ObjectID', as this is used internally.
type
The type of indexing to use. Default is "text", legal options are "text",
"keyword", "unstored" and "unindexed".
content_filter
a callback that should be used to transform the field value
prior to being indexed. The callback will be called with one argument,
the field value as a string, and should return the transformed field value
also as a string. Could be useful for eg. turning date strings into unix
timestamps prior to indexing. A value of false will indicate that there
should be no content filtering, which is the default.
ADVANCED CLASS-LEVEL INDEXING OPTIONS
You can also provide a second JSON-encoded argument when initialising a class
using Object::add_extension. This should contain key-value pairs indicating
your class-level configuration.
Object::add_extension(
'Foo',
"ZendSearchLuceneSearchable('Foo,Far,Faz','
{
"index_filter" : ""ID" IN ( SELECT "ID" FROM "Foo" LEFT JOIN "Other" ON "Foo"."ID" = "Other"."FooID" WHERE "Other"."FooID" IS NOT NULL )"
}
')"
);
Currently there is only one configuration option:
Note that the config can get a bit messy with all the nested escaped quotes.
You may prefer to create PHP objects, json encode them and insert them that way:
$fields = array(
'Foo' => array(
'name' => 'Foo',
),
'Bar' => array(
'name' => 'Bar',
'type' => 'unstored',
'content_filter' => array('HelperClass','filterFunction')
)
);
$class = array(
'index_filter' => '
"ID" IN (
SELECT "ID"
FROM "Foo"
LEFT JOIN "Other"
ON "Foo"."ID" = "Other"."FooID"
WHERE "Other"."FooID" IS NOT NULL
)'
);
Object::add_extension(
'Foo',
"'".json_encode($fields)."', '".json_encode($class)."'"
);
REBUILDING THE SEARCH INDEX
The search index is rebuilt on every /dev/build. In case you want to disable
this, for example if your site is quite large and rebuilding the search index
takes a while, you can add the following to your _config.php:
ZendSearchLuceneSearchable::$reindexOnDevBuild = false;
To manually rebuild the search index, go to the SiteConfig page (at the very
top of the LHS site tree in the CMS, with the world icon) and there will be a
'Rebuild Search Index' button at the bottom of the page. Clicking this button
will start a Queued Job, which deletes the current index, scans the site for all
content which should be indexed, and reindexes everything.
You can view reindex progress on the 'Jobs' tab, at the top of the CMS. It will
display when the job was started, how long it has run for, how many items there
are to be indexed, and how many have been indexed so far. If there are any
errors, these will also show up here.
PAGINATION
There are some pagination settings that allow you to control the pagination
functions: (Put these in your _config.php to change them)
// Number of results to show on each page
ZendSearchLuceneSearchable::$pageLength = 10;
// Maximum number of pages to show in the pagination
ZendSearchLuceneSearchable::$maxShowPages = 10;
// Always show this number of pages at the start of the pagination
ZendSearchLuceneSearchable::$alwaysShowPages = 3;
INDEX DIRECTORY
You can also set where to store the index:
// These are the defaults.
ZendSearchLuceneSearchable::$cacheDirectory = TEMP_FOLDER;
ZendSearchLuceneWrapper::$indexName = 'Silverstripe';
With the default settings, the index will be created in the SilverStripe temp
folder, and will be called 'SilverStripe'.
ADVANCED INDEX CONFIGURATION
You can use advanced configuration functions directly on the index:
$index = ZendSearchLuceneWrapper::getIndex();
// Retrieving index size
$indexSize = $index->count();
$documents = $index->numDocs();
// Index optimisation
$index->optimize();
You can also specify operations to be run on newly created indexes using
ZendSearchLuceneWrapper::addCreateIndexCallback(). On creation, any callbacks
registered using this function are run. This allows you to set up any
optimisation options you require on your index. The Zend defaults are used if
no callbacks are registered.
To use a callback, you can put something like this in your _config.php:
function create_index_callback() {
$index = ZendSeachLuceneWrapper::getIndex();
$index->setMaxBufferedDocs(20);
}
ZendSearchLuceneWrapper::addCreateIndexCallback('create_index_callback');
Once you have configured and enabled the plugin, you can add a new token into
your template files to output the search form:
$ZendSearchLuceneForm
This will post to the action ZendSearchLuceneResults, which will display the
Search Results page.
This module will also take over the $SearchForm token - this is for convenience,
to get users up and running quickly using the out-of-the-box themes. If you're
planning on customising the form markup, use $ZendSearchLuceneForm instead.
CUSTOM SEARCH FORM
To customise your search form, override this method (or create a new one) and
output a Form object containing a field called 'Search' and an action of
ZendSearchLuceneResults.
/* Custom search form */
class Your_Controller extends Page_Controller {
// . . .
function ZendSearchLuceneForm() {
$form = parent::ZendSearchLuceneForm();
// Customise the form
return $form;
}
}
If you are using $ZendSearchLuceneForm in your templates, you can create a
custom template for the search form called ZendSearchLuceneForm.ss - it can go
in either your root template folder, or in your Includes/ folder. Copying
sapphire/templates/SearchForm.ss is a good starting point.
CUSTOM SEARCH RESULTS PAGE
In the templates/Layout folder of the plugin, you will find the
Lucene_results.ss file. Copy this file into your own theme's Layout folder, and
alter to your heart's content.
Available templating tokens in this file are:
$Query - The string that was searched for
$TotalResults - Total number of hits for the search
$TotalPages - Total number of pages for the query
$ThisPage - The page number currently being viewed
$StartResult - The number of the first result on this page
$EndResult - The number of the last result on this page
$PrevUrl - URL to the previous page of search results
$NextUrl - URL to the next page of results
<% control Results %>
$score (relevance rating assigned by the search engine)
$Number (which number in the set this result is)
$Link (URL to this resource)
You can also use any fields that have been indexed, eg. $Content
<% end_control %>
<% control SearchPages %>
$IsEllipsis (whether this entry is a blank ellipsis to indicate more pages)
$PageNumber
$Link (URL to this page of search results)
$Current (Boolean indicating whether this is the current page)
<% end_control %>
A useful extra function is the SearchTextHighlight string modifier. If you use
eg. $Content.SearchTextHighlight in your template, this will output an HTML
paragraph containing 25 words surrounding your search terms, with the search
terms highlighted with tags.
This modifier takes one optional argument, the number of words to display. So
to display a 50 word summary you would use:
$Content.SearchTextHighlight(50)
CUSTOMISE SEARCH FUNCTION
Lucene is actually a very powerful search engine, you can do a lot with it. If
you have a more advanced search function you want to implement, you can build
your own form and submit it to your own action. Check the Zend docs on building
queries for how to build the query you want from the form fields you've
received.
http://zendframework.com/manual/en/zend.search.lucene.searching.html
class Your_Controller extends Page_Controller {
/**
* Use $AdvancedSearchForm in your template to output this form.
*/
function AdvancedSearchForm() {
$fields = new FieldSet(
new TextField('Query','First search query'),
new TextField('Subquery', 'Second search query')
);
$actions = new FieldSet(
new FormAction('AdvancedSearchResults', 'Search')
);
$form = new Form($this->owner, 'AdvancedSearchForm', $fields, $actions);
$form->disableSecurityToken();
return $form;
}
/**
* Processes the search form
*/
function AdvancedSearchResults($data, $form, $request) {
// Assemble your custom query
$query = Zend_Search_Lucene_Search_QueryParser::parse(
$form->dataFieldByName('Query')->dataValue()
);
$subquery = Zend_Search_Lucene_Search_QueryParser::parse(
$form->dataFieldByName('Subquery')->dataValue()
);
$search = new Zend_Search_Lucene_Search_Query_Boolean();
$search->addSubquery($query, true);
$search->addSubquery($subquery, false);
// Get hits from the Lucene search engine.
$hits = ZendSearchLuceneWrapper::find($search);
// Convert these into a data array containing pagination info etc
$data = $this->getDataArrayFromHits($hits, $request);
// Display the results page
return $this->owner->customise($data)->renderWith(array('Advanced_results', 'Page'));
}
}
wpSearch plugin for WordPress
http://codefury.net/projects/wpSearch/
Zend_Search_Lucene documentation
http://zendframework.com/manual/en/zend.search.lucene.html
Queued Jobs module
http://www.silverstripe.org/queued-jobs-module/
Xpdf (pdftotext PDF text extraction utility)
http://www.foolabs.com/xpdf/
catdoc (MS Office text extraction utility)
http://wagner.pp.ru/~vitus/software/catdoc/
http://blog.brush.co.nz/2009/09/catdoc-windows/
http://catdoc.darwinports.com/
Module rating system helping users find modules that are well supported. For more on how the rating system works visit Module standards
Score not correct? Let us know there is a problem