Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.
Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server. A variety of text processing features enable fine-tuning Sphinx for your particular application requirements, and a number of relevance functions ensures you can tweak search quality as well. Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.
Sphinx clusters scale up to tens of billions of documents and hundreds of millions search queries per day, powering top websites such as Craigslist, Living Social, MetaCafe and Groupon... to view a complete list of known users please visit our Powered-by page. And last but not least, it's open-sourced under GPLv2, and the community edition is free to use.
You will also need to get the PHP client API. The biggest problem here is to find the one that match your Sphinx version. I had several problem regarding that point and I strongly suggest that you take time to choose the best corresponding one for your version: http://code.google.com/p/sphinxsearch/source/browse/#svn%2Ftags.
For Debian 7 and the version installed with, you need to take this one:
## Sphinx configuration for MediaWiki## Based on examples by Paul Grinberg at http://www.mediawiki.org/wiki/Extension:SphinxSearch# and Hank at http://www.ralree.info/2007/9/15/fulltext-indexing-wikipedia-with-sphinx# Modified by Svemir Brkic for http://www.newworldencyclopedia.org/## Released under GNU General Public License (see http://www.fsf.org/licenses/gpl.html)## Latest version available at http://www.mediawiki.org/wiki/Extension:SphinxSearch# data source definition for the main indexsourcesrc_wiki_main
{# data sourcetype=mysql
sql_host=localhost
sql_db=wikidb
sql_user=user
sql_pass=password
# these two are optional#sql_port = 3306#sql_sock = /var/lib/mysql/mysql.sock# pre-query, executed before the main fetch querysql_query_pre=SETNAMESutf8
# main document fetch query - change the table names if you are using a prefixsql_query=SELECTpage_id,page_title,page_namespace,page_is_redirect,old_id,old_textFROMwiki_page,wiki_revision,wiki_textWHERErev_id=page_latestANDold_id=rev_text_id
# attribute columnssql_attr_uint=page_namespace
sql_attr_uint=page_is_redirect
sql_attr_uint=old_id
# collect all category ids for category filteringsql_attr_multi=uintcategoryfromquery;SELECTcl_from,page_idAScategoryFROMwiki_categorylinks,wiki_pageWHEREpage_title=cl_toANDpage_namespace=14# used by command-line search utility to display document informationsql_query_info=SELECTpage_title,page_namespaceFROMwiki_pageWHEREpage_id=$id}# data source definition for the incremental indexsourcesrc_wiki_incremental:src_wiki_main
{# adjust this query based on the time you run the full index# in this case, full index runs at 7 AM UTCsql_query=SELECTpage_id,page_title,page_namespace,page_is_redirect,old_id,old_textFROMwiki_page,wiki_revision,wiki_textWHERErev_id=page_latestANDold_id=rev_text_idANDpage_touched>=DATE_FORMAT(CURDATE(),'%Y%m%d070000')# all other parameters are copied from the parent source}# main index definitionindexwiki_main
{# which document source to indexsource=src_wiki_main
# this is path and index file name without extension# you may need to change this path or create this folderpath=/var/lib/sphinxsearch/data/wiki_main
# docinfo (ie. per-document attribute values) storage strategydocinfo=extern
# morphology (comment it if your wiki is not full english)# morphology = stem_en# stopwords file#stopwords = /var/data/sphinx/stopwords.txt# minimum word lengthmin_word_len=1# allow wildcard (*) searchesmin_infix_len=1enable_star=1# charset encoding typecharset_type=utf-8
# charset definition and case folding rules "table"charset_table=0..9,A..Z->a..z,a..z,\U+C0->a,U+C1->a,U+C2->a,U+C3->a,U+C4->a,U+C5->a,U+C6->a,\U+C7->c,U+E7->c,U+C8->e,U+C9->e,U+CA->e,U+CB->e,U+CC->i,\U+CD->i,U+CE->i,U+CF->i,U+D0->d,U+D1->n,U+D2->o,U+D3->o,\U+D4->o,U+D5->o,U+D6->o,U+D8->o,U+D9->u,U+DA->u,U+DB->u,\U+DC->u,U+DD->y,U+DE->t,U+DF->s,\U+E0->a,U+E1->a,U+E2->a,U+E3->a,U+E4->a,U+E5->a,U+E6->a,\U+E7->c,U+E7->c,U+E8->e,U+E9->e,U+EA->e,U+EB->e,U+EC->i,\U+ED->i,U+EE->i,U+EF->i,U+F0->d,U+F1->n,U+F2->o,U+F3->o,\U+F4->o,U+F5->o,U+F6->o,U+F8->o,U+F9->u,U+FA->u,U+FB->u,\U+FC->u,U+FD->y,U+FE->t,U+FF->s,
}# incremental index definitionindexwiki_incremental:wiki_main
{path=/var/lib/sphinxsearch/data/wiki_incremental
source=src_wiki_incremental
}# indexer settingsindexer
{# memory limit (default is 32M)mem_limit=64M
}# searchd settingssearchd
{# IP address and port on which search daemon will bind and acceptlisten=127.0.0.1:9312
# searchd run info is logged here - create or change the folderlog=/var/log/sphinxsearch/searchd.log
# all the search queries are logged herequery_log=/var/log/sphinxsearch/query.log
# client read timeout, secondsread_timeout=5# maximum amount of children to forkmax_children=30# a file which will contain searchd process IDpid_file=/var/run/sphinxsearch/searchd.pid
# maximum amount of matches this daemon would ever retrieve# from each index and serve to clientmax_matches=1000# Remove warning of deprecated functioncompat_sphinxql_magics=0}# --eof--
Others infos you need to know:
Adapt the tables if you use a prefix (like you can see here with 'wiki_') on the SQL requests
I've also modified all paths to match in Debian's ones
All highlighted lines are important, I've added a comment on each that needed to bring additional informations
# Settings for the sphinxsearch searchd daemon# Please read /usr/share/doc/sphinxsearch/README.Debian for details.## Should sphinxsearch run automatically on startup? (default: no)# Before doing this you might want to modify /etc/sphinxsearch/sphinx.conf# so that it works for you.START=yes
Indexation
Index
You need to create a first indexation once you've configured your application. To prepare sphinx to search:
There are several way to test your indexation but you need to know that the search binary contains bugs. If it crash, it doesn't mean that you have a problem with. Anyway, here is how to test:
As you can see, we have results here :-). The work "test" have been found 20 times.
Incremental updates
We need to setup the incremental updates. Change it to a slower value if you need to have more often indexation. For my own usage, once by hour, is really enough. I've added the MediaWiki example here (/etc/cron.d/sphinxsearch):
# Rebuild all indexes daily and notify searchd.@dailyroot./etc/default/sphinxsearch&&if["$START"="yes"]&&[-x/usr/bin/indexer];then/usr/bin/indexer--quiet--rotate--all>/dev/null2>&1;fi# Example for rotating only specific indexes (usually these would be part of# a larger combined index).# */5 * * * * root [ -x /usr/bin/indexer ] && /usr/bin/indexer --quiet --rotate postdelta threaddelta >/dev/null 2>&1# Mediawiki0*/1***root[-x/usr/bin/indexer]&&indexerwiki_incremental--quiet--rotate>/dev/null2>&1
Debug
What if you don't see any results or you want to be sure that Sphinx receive search requests? There is a console mode:
I see my test search here :-). All is good! If you don't see anything, that should be a problem with the application API or a missmatch configuration. You can also check with tcpdump if you see network connections arriving on 9312 port.
FAQ
I don't see any search result on MediaWiki, why?
You certainly have a problem with your php API. Select another version that should match. Check also the Debug part to help you to see what's wrong.