A newspaper is not only a collective propagandist and a collective agitator, it is also a collective organizer."-- Ленин
slideshow 1 slideshow 2 slideshow 3 slideshow 4

The Domain You Searched For Is Still Available: On Sale Now!

How Google directly charges for inclusion: PPC (Pay Per Click) advertising and ranking.

PPC schemes (such as Google Adwords) do build link popularity and are counted in the so-called "organic" Google index. Its counted in two ways:
  • Google is crawling JavaScript links on Web sites. These outbound links are, it seems, handled by Google just like any other outbound link. Google's Ad-Words uses JavaScript for links.
  • In the cached pages: The Google robot does store pages with their own PPC ad-campaigns on them. The outbound ad links at the moment of being gathered are used by Google in their link analysis. The in-bound text in the advertisement which produced the ad with the link on a page will, in turn, effect the ranking of the link. The selection of costly words (and inclusion on highly ranked sites) will drive (and this can be shown) up ranking and visibility.
Intentional? See: "Are PPC Ads Now Counting in Google Organic Backlinks?" (SearchEngineWatch.com)

Brief Fast ESP comparison

coming soon.

XML:DB API

Compatibility with the XML:DB API paradigm
General Requirements
Language Independence - The API MUST NOT preclude the usage with more than one language binding.IB provides a large number of languages bindings including C++, Tcl, Python, Perl, PHP, Ruby, Java, C#)
Textual Interface - The API MUST provide a textual representation of XML result sets.Yes.
XML-API Interface - The API SHOULD provide a SAX or DOM based representation of XML result sets.A SAX or DOM based representation of the XML result sets is available via loading the XML representation of the result sets into DOM

PHP Fulltext search

IB fulltext solutions for PHP
  • PHP development using the PHP loadable module
  • PHP inteface using Rain.
Rain: IB client/server solution for PHP Rain is what's behind this site (IBU News). Interfacing PHP (any generic web) within RAIN is just a few lines of PHP code to talk to the Nano_http server. User features:
  • DB offloaded (client/server).
  • Client side (Drupal) which talks to a number of distributed (federated) servers for high availability.
  • Uses powerful and highly performant IB search engine.
  • Supports AJAX for a host of features including Scan
  • Clickless search: selecting some on a page runs a search in a layer.

IB4J

IB4J: IB client/server solution for Java

User features:
  • DB offloaded (client/server).
  • Client side (Java) which talks to a number of distributed (federated) servers for high availability.
  • Uses powerful and highly performant IB search engine.

Raining content: IB for Drupal

Rain: IB client/server solution for Drupal

Rain is what's behind this site (IBU News). User features:
  • DB offloaded (client/server).
  • Client side (Drupal) which talks to a number of distributed (federated) servers for high availability.
  • Uses powerful and highly performant IB search engine.
  • Supports AJAX for a host of features including Scan
  • Clickless search: selecting some on a page runs a search in a layer.

IB4Typo3: IB client/server solution for Typo3

IB4Typo3: IB client/server solution for Typo3

User features:
  • DB offloaded (client/server).
  • Client side (Typo3 end) as a Typo3 extension which talks to a number of distributed (federated) servers for high availability.
  • Uses powerful and highly performant IB search engine.
  • Allows for the display in results of breadcrumbs etc.

A comparison of Typo3 search solutions

Typo3 search solutions

The following short article attempts to outline a few of the search solutions for Typo3 that I am currently familiar with. They fit more or less into the groups: RBDMS "full-text" extensions (to MySQL) and outboard extensions using a fulltext library (Lucene, IB). To my knowledge only IB4Typo3 supports breadcrumbs in the results.

"Famous" indexed_search extension

This is the typical standard search installed on most Typo3 installations. Its quite easy to install and there are a handful of extensions that, in turn, build upon it to provide some slight enhancements to usability. Its a mature product and at this point well understood.

It is, however, officially considered suspect: see Known problems:

  • Currently the extension is under observation because instances of heavy server load/unstability has been reported. It is not yet clear if THIS extension has anything to do with. So it's only under suspicion at this point until further data has been collected. But for now it is adviced to be careful with the application of the extension for mission critical, high-load environments.
  • It's still uncertain how performance is under heavy load conditions and when MANY pages are indexed. Currently benchmarks has been done only up to 2000 pages indexed/approx. 400.000 relation records. It is probably that some parts has to be optimized for such scenarios.
There are some serious security design flaws but most of these (such as SQL embeding, cross-site-scripting) have given its maturity been discovered and worked around. Security, however, is on the whole less than satisfactory and versions prior to and including 4.2.3 are well known to be easy targets for "script kiddies".

"The Indexed Search Engine (indexed_search) system extension in TYPO3 4.0.0 through 4.0.9, 4.1.0 through 4.1.7, and 4.2.0 through 4.2.3 allows remote attackers to execute arbitrary commands via a crafted filename containing shell metacharacters, which is not properly handled by the command-line indexer." — vulnerability CVE-2009-0258

General characteristics of the extension:

  • White space is used to split words.
  • Words are limited to a minimum of 2 characters and a maximum of 200 characters in length.
  • Booleans are not supported but the concept of "all words" (AND), "any words" (OR) or "none of these words " (NOT).
  • Only rendered pages are indexed and ONLY those that are cacheable. Pages where the cache is disabled are not indexed.
  • Each page is uniquely identified to an ID for that page.
  • Pages in more than one language must be indexed as different pages since they are IDed as id/type/language/cHashParams
  • While the same page may have different content based on the user-groups (and so must be indexed once for each) such pages may just as well present the SAME content regardless of usergroups.
These are inserted into a DB table and the MySQL search is used. The MySQL fulltext search is available ONLY for MyISAM tables and both index and data should fit in memory (with also key_buffer_size appropriately sized). Since it uses a variation of BTREE each word in the index is an index entry. This means that updating a text with 1000 words demands 1000 updates to key entries. If the index is not in memory this generates a lot of I/O and can run a server into the ground. Intel/AMD based servers are particularly prone to extreme system degradation due to I/O.
  • The search itself offers acceptable but non-stellar performance as long as the data and index are in memory.
  • Word positions are not stored. Only word frequencies.
  • Basic ranking and sorting: frequency based ranking.
For small sites with a limited number of text blocks, not too much traffic and not any real demands of search or sorting, however, its quite acceptable given its cost (FREE) and ease of installation and management.

mnoGoSearch SQL extension

One of the most popular SQL extensions to try to provide some fulltext functionality is mnoGoSearch. It features
  • external indexer (runs as a cron job at night)
  • reindexes only modified pages (no need to crawl the whole site, extension tracks modified pages)
  • supports word forms (do/does/did/done - all will be found when searching for "do" or "did") using Aspell
  • correctly works with "index" flag for content elements (indexed search ignores it)
  • search results are internally cached, so the same query returns quickly
  • There are limitations in current version of TYPO3 extension:
    • needs database per site or will return results from all sites
  • requires one time compiling on the server
  • requires PHP extension
Mnogosearch indexes are stored in the RDBMS and indexing speed is extremely slow— significantly slower than the "default" MySQL. Benchmarks performed by MySQL Inc on a dump of English Wikipedia (3400617 articles and total of 5 GB) did not complete indexing over a time of 24 hours on a AMD Sempron based machine with 1 GB RAM using Linux and SATA disks.

Indexing and search performance for small collections is acceptable.

Zend (PHP) / PowerSearch

The Zend toolkit extensions use a Java-PHP bridge to try to slap Lucene into the picture but that's just a big drain (aside from the limitations on search).

Around the Zend toolkit the most popular current search extensions for Typo3 are:

  • powersearch (Basic Extension)
  • powersearchui (Frontend Plugin)
As well as various varations. Power Search Index Lucene (extension) Compared to the indexed_search and MySQL extensions the "PowerSearch" is faster, more powerful, safer and robuster. It is, however, much more difficult to install and manage. Indexing time, however, is comparatively slow and the total system space is bloated. Under 32-bit Linux this can pose serious problems as the combination of Typo3/PHP and Java and Lucene can map or access too much memory and expose the system to the dreaded "OOM Killer" which starts shooting processes down, seemingly at random, and can quickly take the server down despite large amounts of UNUSED memory available. Since the LINUX kernel uses low memory to track allocations of all memory (including memory mapping) the more memory installed or available (including those effectively added by memory mapping), the more low memory is needed. YES you read that right: The more RAM is installed in the system and the more virtual memory is available (via disk mapping or via memory mapped I/O) the faster the system will run out of low memory and trigger the OOM-killer which in turn will start to kill processes and eventually take a sever down!. This is particularly ugly with Java since the Java VM tends to grab memory from the system but not return any.

IB4Typo3: IB client/server solution for Typo3

User features:
  • DB offloaded (client/server).
  • Client side (Typo3 end) as a Typo3 extension which talks to a number of distributed (federated) servers for high availability.
  • Uses powerful and highly performant IB search engine.
  • Allows for the display in results of breadcrumbs etc.
Performance is higher than all of the above extensions: See a comparison between Lucene and IB.

The IB solution is client/server and offloads fragments of DB content from behind the CMS into the IB engine. Mirroring of searchable DB content— what in the DB one wants to search (allowing for selective exclusion of information)— into a search engine improves not just query capacity and user experience but also

  • Avoids expensive high-end server hardware upgrades providing over the course of typical project development significant cost savings.
  • Scales very well: Backend search servers can be load distributed over a cluster.
  • Helps avoid costs for planning, developing and implementing improved information retrieval performance on current RDBMS architectures.
  • Avoids expensive re-indexing and rebuilding of data tables.
  • Improves RDBMS performance: no need to have indexed fields (inserts into indexed fields on a RDBMS is generally very expensive since the index needs to be constantly rebuilt).
  • Offer more flexible and powerful search.
  • Enables the creation of user groups allowing different groups to have different views to both search and information presentation.
  • Allows the integration of page attachments (and other external information objects such as PDFs, Word Files, Audio/Video) into search.
  • Provides a higher level of information security:
    • Removes the need for SQL queries to directly access data in search.
    • A clear functional and even physical separation between search and storage.
    • Full control over information provided for search: Nothing is there other than what is intended for others to see.
    • The protocol itself is so designed that there is little one can accomplish with a hijacked search server.
    • The identify of the search server is private to the outside world— so only insiders or those that have already broken the integrity of the platform could even know where it is.
    • The search server does not need to be accessible to the outside world (not even through a firewall).
    • The text information in the protocol itself is parsed by the client (Web server) and converted into a form for presentation. There is no means to embed foreign code.
  • Other "better" (but more complicated protocols) such as SRW/U— to which we are active developers in the ZIG— are, of course, available since they plug right into the design of the IB engine which was developed for SGML/XML and ISO23950 interoperability.
  • The IB engine offers also a more generic OO design that allows one to connect different services into search to offer a kind of search Swiss Army Knife (jack of all trades). Search can even transparently access other DBs and networked distributed objects.
The client side code (what's running in Typo3) is a few lines of PHP code. The server is built around the "nano http[d]" program. Nano_http[d] is written in Python and uses IB as a module.

Nano HTTP[D] License

Copyright 1994-2009 NONMONOTONIC LAB of Basis Systeme Netzwerk, Muenchen.
     Zimmermann und Poellmann GbR. All rights reserved.
     http://www.nonmonotonic.net
     http://www.bsn.com    http://www.bsn.de
The files in this directory (/opt/BSN/nano_http) are covered by an open source license: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Licensee shall provide the NONMONOTONIC LAB with all enhancements and modifications made by Licensee to the Software Materials, for a period of three years from the date of execution of this License. BSn shall have the right to use and/or redistribute the modifications and enhancements without accounting to Licensee.

Enhancements and Modifications shall be defined as follows:

  1. Changes to the source code, support files or documentation.
  2. Documentation directly related to Licensee's distribution of the software.
THIS SOFTWARE IS PROVIDED BY THE NONMONOTONIC LAB OF BSN ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE NONMONTONIC LAB, BSN OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer