Typo3 search solutions
The following short article attempts to outline a few of the search solutions for Typo3 that I am currently familiar with. They fit more or less into the groups: RBDMS "full-text" extensions (to MySQL) and outboard extensions using a fulltext library (Lucene, IB). To my knowledge only IB4Typo3 supports breadcrumbs in the results.
"Famous" indexed_search extension
This is the typical standard search installed on most Typo3 installations. Its quite easy to install and there are a handful of extensions that, in turn, build upon it to provide some slight enhancements to usability. Its a mature product and at this point well understood.
It is, however, officially considered suspect: see Known problems:
- Currently the extension is under observation because instances of heavy server load/unstability has been reported. It is not yet clear if THIS extension has anything to do with. So it's only under suspicion at this point until further data has been collected. But for now it is adviced to be careful with the application of the extension for mission critical, high-load environments.
- It's still uncertain how performance is under heavy load conditions and when MANY pages are indexed. Currently benchmarks has been done only up to 2000 pages indexed/approx. 400.000 relation records. It is probably that some parts has to be optimized for such scenarios.
There are some serious security design flaws but most of these (such as SQL embeding, cross-site-scripting) have given its maturity been discovered and worked around. Security, however, is on the whole less than satisfactory and versions prior to and including 4.2.3 are well known to be easy targets for "
script kiddies".
"The Indexed Search Engine (indexed_search) system extension in TYPO3 4.0.0 through 4.0.9, 4.1.0 through 4.1.7, and 4.2.0 through 4.2.3 allows remote attackers to execute arbitrary commands via a crafted filename containing shell metacharacters, which is not properly handled by the command-line indexer." — vulnerability CVE-2009-0258
General characteristics of the extension:
- White space is used to split words.
- Words are limited to a minimum of 2 characters and a maximum of 200 characters in length.
- Booleans are not supported but the concept of "all words" (AND), "any words" (OR) or "none of these words " (NOT).
- Only rendered pages are indexed and ONLY those that are cacheable. Pages where the cache is disabled are not indexed.
- Each page is uniquely identified to an ID for that page.
- Pages in more than one language must be indexed as different pages since they are IDed as id/type/language/cHashParams
- While the same page may have different content based on the user-groups (and so must be indexed once for each) such pages may just as well present the SAME content regardless of usergroups.
These are inserted into a DB table and the MySQL search is used. The MySQL fulltext search is available ONLY for MyISAM tables and both index
and data should fit in memory (with also key_buffer_size appropriately sized). Since it uses a variation of BTREE each word in the index is an index entry. This means that updating a text with 1000 words demands 1000 updates to key entries. If the index is not in memory this generates a lot of I/O and can run a server into the ground. Intel/AMD based servers are particularly prone to extreme system degradation due to I/O.
- The search itself offers acceptable but non-stellar performance as long as the data and index are in memory.
- Word positions are not stored. Only word frequencies.
- Basic ranking and sorting: frequency based ranking.
For small sites with a limited number of text blocks, not too much traffic and not any real demands of search or sorting, however, its quite acceptable given its cost (FREE) and ease of installation and management.
mnoGoSearch SQL extension
One of the most popular SQL extensions to try to provide some fulltext functionality is mnoGoSearch. It features
- external indexer (runs as a cron job at night)
- reindexes only modified pages (no need to crawl the whole site, extension tracks modified pages)
- supports word forms (do/does/did/done - all will be found when searching for "do" or "did") using Aspell
- correctly works with "index" flag for content elements (indexed search ignores it)
- search results are internally cached, so the same query returns quickly
- There are limitations in current version of TYPO3 extension:
- needs database per site or will return results from all sites
- requires one time compiling on the server
- requires PHP extension
Mnogosearch indexes are stored in the RDBMS and indexing speed is extremely slow— significantly slower than the "default" MySQL. Benchmarks performed by MySQL Inc on a dump of English Wikipedia (3400617 articles and total of 5 GB) did not complete indexing over a time of 24 hours on a AMD Sempron based machine with 1 GB RAM using Linux and SATA disks.
Indexing and search performance for small collections is acceptable.
Zend (PHP) / PowerSearch
The Zend toolkit extensions use a Java-PHP bridge to try to slap Lucene into the picture but that's just a big drain (aside from the limitations on search).
Around the Zend toolkit the most popular current search extensions for Typo3 are:
- powersearch (Basic Extension)
- powersearchui (Frontend Plugin)
As well as various varations.
Power Search Index Lucene (extension)
Compared to the indexed_search and MySQL extensions the "PowerSearch" is faster, more powerful, safer and robuster. It is, however, much more difficult to install and manage. Indexing time, however, is comparatively slow and the total system space is bloated. Under 32-bit Linux this can pose serious problems as the combination of Typo3/PHP and Java and Lucene can map or access too much memory and expose the system to the dreaded "OOM Killer" which starts shooting processes down, seemingly at random, and can quickly take the server down despite large amounts of UNUSED memory available. Since the LINUX kernel uses low memory to track allocations of all memory (including memory mapping) the more memory installed or available (including those effectively added by memory mapping), the more low memory is needed. YES you read that right:
The more RAM is installed in the system and the more virtual memory is available (via disk mapping or via memory mapped I/O) the faster the system will run out of low memory and trigger the OOM-killer which in turn will start to kill processes and eventually take a sever down!. This is particularly ugly with Java since the Java VM tends to grab memory from the system but not return any.
IB4Typo3: IB client/server solution for Typo3
User features:
- DB offloaded (client/server).
- Client side (Typo3 end) as a Typo3 extension which talks to a number of distributed (federated) servers for high availability.
- Uses powerful and highly performant IB search engine.
- Allows for the display in results of breadcrumbs etc.
Performance is higher than all of the above extensions: See
a comparison between Lucene and IB.
The IB solution is client/server and offloads fragments of DB content from behind the CMS into the IB engine. Mirroring of searchable DB content— what in the DB one wants to search (allowing for selective exclusion of information)— into a search engine improves not just query capacity and user experience but also
- Avoids expensive high-end server hardware upgrades providing over the course of typical project development significant cost savings.
- Scales very well: Backend search servers can be load distributed over a cluster.
- Helps avoid costs for planning, developing and implementing improved information retrieval performance on current RDBMS architectures.
- Avoids expensive re-indexing and rebuilding of data tables.
- Improves RDBMS performance: no need to have indexed fields (inserts into indexed fields on a RDBMS is generally very expensive since the index needs to be constantly rebuilt).
- Offer more flexible and powerful search.
- Enables the creation of user groups allowing different groups to have different views to both search and information presentation.
- Allows the integration of page attachments (and other external information objects such as PDFs, Word Files, Audio/Video) into search.
- Provides a higher level of information security:
- Removes the need for SQL queries to directly access data in search.
- A clear functional and even physical separation between search and storage.
- Full control over information provided for search: Nothing is there other than what is intended for others to see.
- The protocol itself is so designed that there is little one can accomplish with a hijacked search server.
- The identify of the search server is private to the outside world— so only insiders or those that have already broken the integrity of the platform could even know where it is.
- The search server does not need to be accessible to the outside world (not even through a firewall).
- The text information in the protocol itself is parsed by the client (Web server) and converted into a form for presentation. There is no means to embed foreign code.
- Other "better" (but more complicated protocols) such as SRW/U— to which we are active developers in the ZIG— are, of course, available since they plug right into the design of the IB engine which was developed for SGML/XML and ISO23950 interoperability.
- The IB engine offers also a more generic OO design that allows one to connect different services into search to offer a kind of search Swiss Army Knife (jack of all trades). Search can even transparently access other DBs and networked distributed objects.
The client side code (what's running in Typo3) is a few lines of PHP code. The server is built around the "nano http[d]" program. Nano_http[d] is written in Python and uses IB as a module.
Nano_http[d] is open source "freeware" covered by a liberal BSD inspired license.
Pre-requisites
- python 2.6
[sources available from python.org]
- MX extensions
- MX experimental extensions
[available from http://www.egenix.com/products/python/] - IB libs and Python loadable module
see also IB4Typo3