Indexing

IB indexing


The core of the engine platform is our innovative indexing design. Its based around the desire to provide abstract search into a wide range of explicitly and implicitly structured information sources to provide advanced query capabilities for high-performance search and information access across the entire spectrum of structured and unstructured enterprise data and information resources and services (such as databases).

The IB engine is designed for and supports heterogeneous data sources. Fields, including many implicit, are automatically--- respective to their document format (such as lines, sentences, paragraphs, pages etc. of plain text; subject, author, references, email addresses etc. in email etc.)--- detected as if-tagged (implicit auto-tagging). In PDF we have, for example, not only the document properties (metadata or info section) that PDF documents define as fields (including handling of their types such as dates) but also the implicit textual structure of the content in sentences, paragraphs and pages.

Information indexed may be from a document (such as PDF, Word, eMail folders etc.), remote or the field (container) itself may be "linked" into another service.

Since IB uses other mathematical models and algorithms than the all-too-common inverted index design it has no limits on the frequency of words, term length (IB has been used, for example, in genomic search applications), number of fields or complexity of structured data and it can even support overlap--- where fields or structures cross other's boundaries (common examples are quotes, line/sentences, biblical verse, annotations).

While IB does not demand stop words it can take them into account. IB supports stop words on a per-language basis (language of document) and allows for distinct lists for use during indexing (exclude from the index) and search (exclude as ordinary term for search). Common practice is to index each and every term and only use stop words, if at all, during search. The term "war", for example, might not be too significant in German ("was" in English) but means quite something else in English (conflict, name of a 1960s funk band etc.).

IB is not just textual but contains a large number of objects: numerical, range, geospatial etc. IB is unique among full-text systems in that it also provides numerous object types with their own methods of search and allows these to be viewed parallel as text--- a date field, for instance, can be search as date but also a text searching for the words in the field. These objects don't even have to be part of any document but may be available via interface glue into other systems via ODBC, CORBA or object embedding. This allows indexing content--- for example from RSS/XML--- to be stored in and searched from other systems. This is useful in many dynamic applications in commerce and trading (keeping live counts of goods on hand, selling prices, etc.). Object don't even have to always be explicitly defined as various doctypes (document handlers) can automatically (if enabled, resp. not disabled) at index time detect a number of field datatypes (such as that something is a telephone number or a date or.. ).

Performance

Despite all these features IB indexing performance is lean and very fast. IB Indexing, for instance, the "large Reuters collection"

      Machine: FreeBSD 5.4 AMD Athlon 64-bit CPU single core 2.4 GHz.
      Elapsed "real" time:    15074 seconds
      CPU time:               4317.21 seconds
          User time:          2112.4 seconds
          System time:        2204.8 seconds
      CPU/Real:               28.6401%
      Max resident size:      349136k
      Shared text memory:     69016k
      Unshared data:          2550694k
      Unshared stack:         552133k
      Page reclaims:          6752069
             faults:          2149767
      Swaps:                  0
      File system in events:  173002
                 out event:   1098098
      Context switches Vol.:  2328337
                     Invol.:  1999956
      Total records added:    806791
      Total words added:      283291747

That's under 72 minutes CPU (half of it system calls to open the files etc.) to index 1 year of Reuters Newswire including parsing the XML etc. with a total memory demand (entire process) of 349136k.

The faster the memory (disk access and read performance) and the faster the I/O (hardware and O/S design) the faster the indexing performance. [NOTE: the older hardware for the above benchmark was intentionally chosen to represent the probably absolute lowest end of performance. Using more contemporary flash disks in more modern PCs with their high bus and memory clock rates we've seen the above reference index run in a fraction of the time.]