Brief Lucene comparison

To compare Lucene with IB is really to compare potatoes with fish. They have quite different histories, design considerations and goals. The following short sketch attempts, however, to outline a few points since we've been asked.


  1. Design target
    • Lucene was designed to be a reasonably flexible fulltext search engine with support for fields. Its a more or less traditional unstructured text search system using traditional algorithms (inverted index).
    • IB was designed to be an object (neither text nor data centric) oriented search engine for abstract objects and structural paths (including overlays). Its been designed to try to manage wildly heterogeneous formats and extract also implicit structure. IB supports XML, SGML and other formats as-if native.

  2. Java

    • Lucene is pure Java. Its 100% written in Java. Its more or less Java thread safe but not completely.
    • IB is written in C++. Java is provided via a SWIG created JNI (Java Native Interface) module. Since IB is written in C++ and interfaces via SWIG, application development is not limited to Java but Python (perhaps the most popular choice and by far the best developed IB interface), Tcl, Ruby and a number of other languages.

    • Lucene needs Java to run (although there are a few attempts to re-write its algorithms into other languages). Java is, in our educated opinion, a fine language for developing many kinds of applications but is less than ideal for database, search and retrieval.
    • IB allows for applications to use its algorithms to be written in Java but does not require the use of Java. Our favorite language for writing applications to use IB is, in fact, Python and not Java.

  3. Portability

    • Since Lucene is pure Java its portable to platforms with suitable JVMs. Packages should just run from platform to platform. No need, in theory, for specific binaries beyond the JVM.
    • IB is written in extremely portable C++. It can be targeted to most platforms (Win32, Linux, Solaris, BSD etc. are available) but demands a package for each platform. Applications, of course, written in IB's Java (or other language API) are portable to any platform.

  4. Threads

    • Lucene is pure Java and with the exception of the query parser and a few other bits its thread and more or less process safe. Some of it is at the cost of S/R (search and retrieval) concurrency: searches are in memory (either in whole or segments) to make them thread robust (and avoid file system and I/O deadlocks) and so don't include changes to the index during the lifespan of the class (or segment read).
    • IB is designed to handle S/R concurrency. In IBU News, for instance, the indexer is running nearly continuously and search is always in-step with the index. Its built using the POSIX threads model. IB, however, isn't 100% 'thread-safe' in the sense that the programmer can use code indiscriminately from threads. Its been designed with reasonable process safety in mind to allow for robust development of search and retrieval applications. IB is typically used in Web server environments as a service (via protocols). IB can be via JNI embedded into application servers such as Tomcat but its not advised— and especially not on Linux 32-bit systems (due to the design of the 32-bit kernel and how it uses low memory to manage memory mapped pages which together with the size of Tomcat+Apache and the max. address space leaves very little room despite available memory in RAM and swap) highly discouraged. De-coupling search from the application server increases the resilience of the application server to traffic. Java, after all, is about modularization and client-server objects.

    Do threads make sense for search? See: Should one run search in threads?. In a nutshell: not really and especially not with lower cost servers as their AMD Athlon™ and Hyper-Threaded Intel Pentium™ 4 processors. They have single data ports which limit their input and output to their serial flow through the pipeline. Disks too are limited to reads within a sector. Their relatively small disk buffers favor sector-to-sector serial reads for highest throughput. Disk access will tend to use the cache less and demand more head movements (slower, more heat, more power needed etc.). In comparing queues to concurrent threads the later take more CPU and produce response times not better than the last in a sequential queue— Search performance, after all, is driven more by memory access speed and I/O latency than by CPU speed.
    See also: Theads

  5. Cost

    • Lucene is "free software" provided by the Apache Foundation. Support and integration is by external providers and can vary in quality from excellent to mediocre at a wide range of prices. There are a number of commercial packages built around the Lucene core. They are, however, not free but tend to also bundle some support.
    • IB is commercial software developed by BSn's NONMONTONIC Lab. Licenses are available from BSn and authorized and highly trained partners. Fees include a high level of support.

  6. Searching multiple indexes, JOINs etc.
    • With lucene you can only search 1 index with a query. There is no means to create virtual indexes.
    • IB supports virtual indexes and index import. One can create on demand virtual collections of indexes and transparently search them with a single query.
    • With Lucene there is no means to import and merge multiple indexes into a single index.
    • With IB can also import multiple indexes into a single index.
    • IB supports also JOINs and via the object system these joins can be to RDBMSs.

  7. Permitted document size and speed

    • Lucene normally indexes only the first 10,000 words of a document. When increasing this default out-of-memory errors can occur. Despite its memory demands it still seems to index quite slow. Its slow since it indexes on a per document basis. Eating a document, spitting out its index and then merging or optimizing into the main index.
    • IB indexes each and every word of a document. It does not matter how many words a document has. The amount of memory an indexing process uses is defined by the configuration and is not related to the size of the documents being indexed, the number or frequency of there terms. The more memory (as long as the system does not start constantly swaping) one gives a process, up to the total size of all the documents indexed, the faster the indexer will run but it can also run in a tiny memory footprint. The minimum memory an index process needs is the size of the largest record (it self-adjusts for this). IB can typically index significantly faster than one could copy the documents into a tar archive. The more and faster the I/O (memory, disk etc.) the faster the indexing process.

  8. Field length

    • Lucene sets (by default) the max. field length by default to 10000 terms. This is to set an upper bound for the amount of memory used for indexing a single document. Since this still can lead to OOM (Out Of Memory)— especially on 32-bit Linux platforms— one is often better off reducing it to half that value.
    • IB places no limitation of max. field length. Just as a document can contain an number of terms, a field can contain any number of terms— and one can have any number of fields as well.

  9. Memory Demands

    • Lucene (including Java) needs a lot of memory to run. RAM memory consumption is more or less constant at a high level during both indexing and searching activities.
    • IB can be configured to use a specific amount of memory during index. It can also self-configure itself to run in a portion of the free RAM available on the system (determined at indexing start).

    • Lucene has a high fixed memory demand for search since segments of indexes are copied into memory.
    • IB has a low fixed memory demand. The amount of memory needed by a search is related to the size of the resulting result. The larger the number of records found and the more hits they have, the larger the memory used. Once the result set is no longer used its disposed of.

    • Since Lucene is Java it is dependent upon the Java memory management (garbage collector) to manage system memory. Java tends to "hog" memory: normally takes but returns little memory to the operating system.
    • IB uses memory mapping and allocates and deallocates memory to the operating system. The model has been designed to try to run in limited resources and create a minimal impact on total system performance (other programs running).

  10. Indexing Performance
    • IB indexing the Reuters 21578 collection:

             Machine: FreeBSD 5.4 AMD Athlon 64-bit CPU single core 2.4 GHz.
             Elapsed "real" time:    36 seconds
             CPU time:               27.23 seconds
                 User time:          15.0993 seconds
                 System time:        12.1307 seconds
             CPU/Real:               75.6388%
             Max resident size:      68760k
             Shared text memory:     434k
             Unshared data:          9030k
             Unshared stack:         3472k
             Page reclaims:          14701
                    faults:          29
             Swaps:                  0
             File system in events:  10
                        out event:   825
             Context switches Vol.:  109
                            Invol.:  9685
             Total records added:    21578
             Total words added:      3579801
        

      That's under 1/2 a minute in 68 MB of memory (max total). Lucene (again not indexing all the fields/paths not parsing the dates etc.) takes over 2 times that amount of CPU time and consumes no less than 3 times the memory.

    • Running is less memory?

             Machine: FreeBSD 5.4 AMD Athlon 64-bit CPU single core 2.4 GHz.
             Elapsed "real" time:    40 seconds
             CPU time:               26.7424 seconds
                 User time:          14.799 seconds
                 System time:        11.9433 seconds
             CPU/Real:               66.8559%
             Max resident size:      42808k
             Shared text memory:     432k
             Unshared data:          9005k
             Unshared stack:         3432k
             Page reclaims:          17069
                    faults:          0
             Swaps:                  0
             File system in events:  5
                        out event:   2339
             Context switches Vol.:  361
                            Invol.:  12157
             Total records added:    21578
             Total words added:      3579801
        

      Not terribly slower (again its a tiny collection) and using a max. of only 42 MB virtual memory.

    • A larger known collection? IB Indexing the "large reuters collection":

      
            Machine: FreeBSD 5.4 AMD Athlon 64-bit CPU single core 2.4 GHz.
            Elapsed "real" time:    15074 seconds
            CPU time:               4317.21 seconds
                User time:          2112.4 seconds
                System time:        2204.8 seconds
            CPU/Real:               28.6401%
            Max resident size:      349136k
            Shared text memory:     69016k
            Unshared data:          2550694k
            Unshared stack:         552133k
            Page reclaims:          6752069
                   faults:          2149767
            Swaps:                  0
            File system in events:  173002
                       out event:   1098098
            Context switches Vol.:  2328337
                           Invol.:  1999956
            Total records added:    806791
            Total words added:      283291747
          

      That's under 72 minutes CPU (half of it system calls to open the files etc.) to index 1 year of Reuters Newswire including parsing the XML etc. with a total memory demand (entire process) of 349136k.


  11. Exclused terms / Stop words

    • Lucene normally excludes "common" words, so-called "stop words", from the index. The general use is to have these stop words also automatically chosen on the basis of their frequency in the index. The default value of 0.4 means that words that are more common than 4 per 1000 words are automatically excluded from the index.
    • IB does not demand but can allow for the use of stop words. IB supports stop words on a per-language basis (language of document) and allows for distinct lists for use during indexing (exclude from the index) and search (exclude as ordinary term for search). Common practice is to index each and every term and only use stop words, if at all, during search. The term "war", for example, might not be too significant in German ("was" in English) but means quite something else in English (conflict, name of a 1960s funk band etc.).

  12. Term length

    • Lucene places limits on the lengths of terms
    • IB is designed to handle terms/words of any length.

  13. Search Terms/Wild cards/Truncated search terms

    • Lucene expands wildcards to terms before even searching. Queries are re-written into a more basic form consisting of a set of logically combined TermQuery's. The standard limit on the total number of terms in a query is 1024. For example, if the indexed documents contain the terms "car" and "cars" the query "ca*" will be expanded to "car OR cars" before the search takes place.Lucene "pre-processes" steps:
      1. A sorted term enumeration is started from the term alphabetically closest to/after the given prefix (the term characters on the left). This enumerates all terms from all existing fields in the index.
      2. For each term, Lucene checks if the term actually starts with the prefix and belongs to the given field. If so the term is added to a BooleanQuery as a TermQuery with OR logic.
      3. The process produced a constructed BooleanQuery which contains exactly as many clauses as there were terms matching the prefix.
      For a WildcardQuery, the process is similar in that the term value string containing wildcard(s) is also expanded to all matching terms for the given field and OR-combined using a BooleanQuery.
    • IB places no similar limits but allows for wildcards in paths, terms etc. Its really a different model. The query ""ca*" will search for all the terms that start with "ca". If there is a limit defined in the search for time (which may be set to a limit or allowed to be unlimited) or number of records (which can be set to an absolute number, a number as a proportion of the total number of records in the index or unlimited) the search will stop when that mark is passed (default is unlimited number of records).
      1. IB does not expand the query into a boolean OR'd expression but searches directly for the query. This is more direct and allows for query structures to be re-used. With Lucene for each change in an index the Query must be re-parsed (Note: the Lucene query parser is NOT thread safe).
      2. IB supports not just wildcards but full glob (POSIX 1003.2, 3.13) including some extensions. These can be applied to the entire search expression including path and term components.

    • Lucene normally supports only wildcards to the right (prefix queries).
    • IB supports both right and left truncation as well as generic glob expressions.

    • Lucene does not normally allow for wildcards in field names
    • IB allows for wildcards (glob expressions) in field names and paths.

    • Lucene does not support combinations of phrase and wildcard as in the right truncated phrase search "search optim"*
    • IB allows for wildcards.


  14. Proximity

    • Lucene does not really support proximity but a concept of "phrase query slop": the maximum number of full word "moves" required to get all the words next to each other in the proper order. For simple paired expressions like "dog cat"~10 (the ~10 specifies the moves) it comes out as within 10 (or whatever positive integer specified) words.
    • IB has a concept of proximity. Distance, however, is not measured in words but bytes as in the original indexed document. This makes sense since in marked-up documents what's a word? IB also has heuristic concepts of near and can also restrict proximity to within a common field instance.

  15. Boolean operations

    • Lucene does not use the pure boolean information retrieval model or support boolean operators but simulates some of the basic user-side functionality for inclusion and exclusion via a modal prefix model. Lucene has two term prefix ops: "+" (for "must contain"), "-" (for "must not contain"). Terms without a prefix are "can contain". It supports via query parsers an emulation of "AND" "OR" "ANDNOT". The emulation, however, is somewhat quirky and often interprets queries differently than most (other than those familiar with the quirks) would ever expect.
    • IB is overloaded with operators (probably more than most people have ever heard of).


      IB Binary Operators
      OperatorSemantic
      ORUnion, the set of all elements in either of two sets
      ANDIntersection, the set of all elements in both sets
      AND:fieldElements in the same node instance of field
      JOINSimilar to AND but applicable across different indexes. Returns those elements in both sets which have common primary keys.
      JOINrJoin to the right: return a set consisting of those elements in the set to the right that has common primar keys to both
      JOINlJoin to the left: as above but from the set to the left.
      ANDNOTElements in the one set but NOT in the other
      NOTANDA NOTAND B is equivalent to B ANDNOT A
      XORExclusive Union, elements in either but not both
      ADJMatching terms are adjacent to one another (as stored on the file system)
      NEARMatching terms are "near" one another. In fielded indexes this is interpreted as "in same container" (e.g. PEER) but when there are no fields (plain fulltext) it is a (preset) minimal byte offset (as stored on the file system) of one another.
      NEAR:numThe matching terms in the sets are within "num" elements as stored on the file system (file offsets). If the value num is an integer its interpreted as bytes (octets). It its a fraction of 1 its % of the length of record
      PEERElements in the same (unnamed) final tree leaf node (container). Its like doing an AND:<Anonymous field what I don't know the name or path of>.
      PEERaPEER After
      PEERbPEER Before
      XPEERHits are elements in the intersection (both sets) but not in the same container.
      BEFORE, AFTERIn fielded records its like an ordered PEER (PERRb, resp., PEERa).
      BEFORE:field, AFTER:fieldWith a named field its like an ordered AND:field.
      BEFORE:num, AFTER:numlike NEAR:num but ordered
      FOLLOWS, PRECEDESWithin some ordered elements of one another
      ONEAR is as above but uses the query order
      FARElements a "good distance" away from each other. In fielded indexes its the same as XPEER. In non-fielded indexes (plain fulltext) its a (preset) minimal byte offset (as stored on the file system) from each another.

    • Since Lucene does not understand boolean a query parser needs to map queries it into the modal model it understands. This often is different from what one might exepect. The boolean query: "A" AND "B" OR "C", returns the same set as the search for "A". The query: "A" OR "B" AND "C" returns the same set as "B" AND "C". See the Lucene Wiki:
      http://wiki.apache.org/lucene-java/BooleanQuerySyntax
    • IB runs RPN. In IB the same infix query ("A" AND "B" OR "C") is read left to right and parsed into the RPN query: "A" "B" AND "C" OR. The implicit grouping is as-if ("A" AND "B") or "C". The query "A" OR "B" AND "C" is similarly parsed into (the RPN query) "A" "B" OR "C" AND.


  16. Unary operators

    • Lucene has effectively no unary operators. The closest to unary operations are term boost (weight) and "fuzzy" but they are limited to use as term modifiers.
    • IB has a full-blown set of operators (again from different design considerations) and includes not just a rich set of binary but also unary operators including complement, operators to sort and manipulate sets, boost weight of the expression (according to a number of models) restrict to given fields/paths etc.

      IB Unary Operators
      NOTSet compliment
      WITHIN[:field]Records with elements within the specified field. RPN queries "term WITHIN:field" and "field/term" are equivalent. (for performance the query "field/term" is prefered to "term WITHIN:field")
      WITHIN[:daterange]Record dates within the range
      INCLUSIVE[:field]Inclusive Within: ALL Hits (and ONLY THOSE) are elements that are in the specified field. Matching records are those that have their hits absolutely in the specified field and nowhere else.
      XWITHIN[:field]Absolutely NOT in the specified field
      Special Unary Operators
      BOOST:fff.ffBoost the score of the set by fff.ff (Score = Score * fff.ff)
      REDUCE[:nnn]Reduce set to those records with nnn matching terms.
      NOTE: REDUCE:metric is a special kind of unary operator that trims the result
      TRIM:fff.ffTrim to the set to contain a max. number of records. If fff.fff is an integer then its the maximum number. If fff.fff is a floating point number between 0 and 1 it is taken as a per-cent of the total number of records in the index. An index with 1 million records and TRIM:0.1 would mean max. 100000. A floating point number > 1 is taken as the integer component + the percentage of the number of records. Example: 1000.01 for above = 1000 + 10000 = 11000
      HITCOUNT:nnnTrim the set to contain only those records with, when nnn is positive, at least nnn hits. When nnn is a negative number then the set it to contain those records with no more than -nnn hits.

      Example: HITCOUNT:10 would return those with no less than 10 hits. HITCOUNT:-10 would return those records with up to 10 hits but not more. The combination HITCOUNT:-10 HITCOUNT:10 creates the set of records with exactly 10 hits.

      One may specify this as HITCOUNT>nnn (HITCOUNT>10 is equivalent to HITCOUNT:11), HITCOUNT>=nnn (same as HITCOUNT:nnn), HITCOUNT
      SORTBY:<keyword>Sort the set (reserved names for <keyword>: Key, Hits, Date, Index, Score, AuxCount, Newsrank, Function, Category, ReverseHits, ReverseDate, etc.)

      NOTE: The default sort **MUST** be set to unsorted for the query sort to propagate into the final set.
      Unary Neo-Operators (Sets)
      FILE:<filespec>The set of all records whose input file path match <filespec> (example: FILE:shakesp*.xml).
      EXTENSION:<ext>The set of all records whose input file has the extension <ext> (example: EXTENSION:cgm).
      KEY:<keyspec>The set of all records whose key match the <keyspec>.
      DOCTYPE:<doctype>The set of all records whose doctype (index format) matches <doctype>.
      Note: The above specifications fully support wildcards. They may not be used alone but only as part of a query sentence with at least 1 term and a binary operator. Example (Infix):

      "hedgehog " and FILE:shakespeare.*



  17. Query Languages/Interfaces

    • Lucene does not per say have a query language since it contains only terms and modifiers (+,-, weight). These may be processed in any order. There are a number of classes to try to convert other languages into Lucene's model.
    • IB uses a vector/boolean information retrieval model. Queries are driven by an object oriented automata. These may be created as program objects or parses from any of a number of query languages. At the heart of IB's model is an RPN stack based language and there are a number a number of classes to convert from other query languages into RPN.

    • Lucene's boolean query class limits the number of clauses (typ. 1024). This makes sense since Lucene too limits the number of terms in a query (typ. 1024).
    • IB's boolean query class places not limits on the number of clauses, terms, operators etc.

    • Lucene is best when not finding records. Since there are no operators and just terms and predicates that apply to them (either weight or modality) they can without worry be easily rearranged. The query: A -B C D +E for instance can be quickly optimized by the constraints "must have" E and "can't have B.
    • IB is run by a query automata. It can perform many optimizations but it can't just build upon short circuits and programs will need to run their course except in some simple cases such as all terms "ANDed".


  18. Does the position of the matches in the text affect the scoring?

    • In Lucene: No, the position of matches within a field does not affect ranking.
    • In IB: Depends upon the score normalization model selected at search time. The CosineMetric normalization model, for instance, does use the position of the matches in text to affect the scoring. This is all search time and user selectable.

  19. Field differences

    • Lucene lacks diagnostics. Searching even in a field that does not exist just returns no results but without reason. Since fields are Case Sensitive in Lucene this is a frequent source of error.
    • IB contains diagnostics.

    • Lucene's fields are Case Sensitive. There is, to our knowledge, no way to switch it.
    • IB by default makes field names and paths case insensitive (as the case for SGML, SQL etc). Even through XML is case sensitive (and we were among those that opposed it) we are familiar with no productive XML document types and valid instances with two (or more) siblings differing only in case of their names. While possible in XML

      <organization><name>BSn</name><NAME>Edward C. Zimmermann</NAME></organization>

      its poor design just as there are reasons why domain names and email addresses too are not case dependent.


  20. Structure search

    • Lucene is a traditional inverted index fulltext engine. Its quite good at handling a limited number of fields but is inappropriate for use with arbitary trees.
    • IB is not based on "Inverted file indexes" and uses other algorithms. It has no limits on the length of terms, their frequency and and can support arbitary structures and paths, including overlap.

    • The granularity of Lucene (unit of retrieval) is the record as defined at the time of indexing.
    • IB allows for search-time dynamic granularity. The scale of grain (sentence, paragraph, document, chapter, book, collection, source, community, hub, inter-hub bridge, sphere, inter-sphere bridge) is defined by the result of the search and by the query.

    • The product of a search in Lucene is a identification for the record. To extract elements one must load the document (parse etc.) into an object model that supports addressing elements.
    • IB has no no need for a “middle layer” of content manipulation code. Instead of getting IDs, fetching documents, parsing them, and navigating the DOMs to find required elements, IB lets you simply request the elements you need and they are returned directly.