IB Search Engine Design // Beyond XML full-text search and beyond native XML databases

Search functionality (inclusive of ranking) is handled by the embedded IB sub-system. IB is a development of BSn's NONMONOTONIC Lab in Munich.

Background

Organizations of all sizes and within all industries generally distribute their corporate knowledge amid a variety of heterogeneous database applications: from customer relationship systems, staff directories, content management systems (CMS), electronic document and records management systems (EDRMS) to library catalogues.

Mainstream search engines are about finding any information: "a list of all documents containing a specific word or phrase”. Because of this, search engines paradoxically return both too much information (i.e., long lists of links) and too little information (i.e., links to content, not content itself). IB, by contrast, is about exploiting document structure, both implicit (XML and other markup) and explicit (visual groupings such as paragraph), to zero in on relevant sections of documents, not just links to documents.

IB intelligently provides information search and retrieval services to text, data (a large number of standard types such as numerical, ranges, dates etc), images/video/audio, geographic information, network objects and databases. It exploits document structure, both implicit (XML and other markup) and explicit (visual groupings such as paragraph), to zero in on and retrieve relevant information.

IB Key Features and Benefits:

  • Cost effective access to a heterogeneous mix of XML and other data of any shape and size
  • Sophisticated extendable type system allowing for numerical, date, geospatial and other search strategies parallel to textual methods: "Universal Indexing"
  • Clever and extendable ranking methods.
  • Distributed and highly scalable information retrieval and information discovery solution
  • No advance setup or preprocessing.
  • Easy maintenance
  • Rapid creation of scalable (XML) warehouse
  • API Language Independence— usage with multiple language bindings.
  • Among others: C++, Tcl, Python, Perl, PHP, Ruby, Java, C# APIs
  • Full ability to search specific structure/context in information without even knowing their details (such as tag or field names).
  • IB lets you simply request structural/contexual elements you need and they are returned directly.
  • User defined "search time" unit of retrieval: the structure of documents is exploited to identify which document elements (such as the appropriate chapter or page) to retrieve.
[More Key Features and Benefits]

The default modus is to index all the words and all the structure of documents. It provides powerful and fast search without prior knowledge about the content yet enables arbitrarily complex questions across all the content and from different perspectives. Not bound by the constraints of "records" as unit of information, one can immediately derive value from content with the flexibility to enhance content and the application incrementally over time without "breaking anything".

IB was designed from the ground up to address three key goals: universal SGML/XML (and other document formats) hierarchical/context search, distributed objects (transparent integrated views to other sources of information such as relational DBs, search services and object brokers) and to provide optimal support for features (current and future) of the ISO 23950 (ANSI/NISO Z39.50) Information Retrieval Protocol services standard to allow for standard interoperable interfaces. [More on engine design concept and motivations]

IB Search Sub-System

The IB engine (written in C++) provides extremely sophisticated contextual content search to heterogeneous (mix format) information including XML (with all the functionality of a native XML database) and SGML. It goes well beyond most XML full-text search solutions to support alongside text also other objects. Not limited to the XML paradigm and designed upon a more abstract model it can go beyond the more commonplace hierarchical text model (volumes, chapters, sections, paragraphs, sentences and even lines), fields and paths (XPaths and XQuery model) to (abstract hierarchical) path expressions and overlapping structures. In the IB engine a single index can have mixed document content--- and mechanisms are provided to even unify fields from different formats-- and these can be on the fly mixed with others to provide dynamic collections. While most all search engines demand that the unit of retrieval be defined prior to indexing as "record design" we not only can index a diverse collection of heterogeneously formated and structured information but allow also the "unit of retrieval" to be quasi-defined per search via abstract paths. Together these features enable state of the art context querying of a sort not found in any of the so-called XML search, full-text or IR engines.

Despite its speed, power and flexibility IB is tiny. It runs well on everything from low power embedded processors (as in our BLAU "information router" based on AMD's Geode) to supercomputer clusters and from traditional Unix environments (Solaris, True64, HP-UX, BSD) to Linux to even Microsoft's Windows (XP/Vista/Windows-7).

IB is embedded (as dynamically linked shared libs) into language interpretors such as Python, TCL, Perl, Ruby, PHP, Java and others via SWIG integration. In this application we have an interface (Drupal module) in PHP talking to (remote) HTTP based servers. The search server was written in IB embedded Python. On the search page there is a link request: which shows the server in action.

Relational enhanced hierarchical full text IR.

Trend over the past years has been to try to enhance RDBMs with some fulltext search functionality to try to provide the needed mining and information retrieval functionality. These products work but are sub-optimal: neither fish nor fowl. Their functionality is limited. Their performance tends to be mediocre. They are inflexible (needing to adhere to the relational table model) and they are large and bloated. IB goes the other way and extends the full text information retrieval paradigm to hierarchical models extended with objects, datatypes and relations.

Powerful Search

IB is, as one might expect, extremely powerful and supports:
  • heterogeneous collections of diverse information in different record formats, characters sets and languages.
  • Dynamic on-the-fly collections ("on demand" virtual indexes).
  • Index collection binding: multiple indexes can be imported into an index.
  • Exact, left and right truncated as well as general wild-card matching (glob),
  • Query by example (relevant feedback)
  • "More of same"/Similar records
  • Multiple object types (text, numbers, dates, bounded boxes and beyond)
  • Object oriented polymorphic operations (>,>=, <,<=, !=)
  • Proximity (within a container and by document order)
  • Term weighting
  • Search by named field (tag instance anywhere in a document tree) or path (in a specific location in the tree) and/or path type (a combination of "tag" and complex object attributes about the tag) but also unnamed paths (in the same field instance)
  • Fuzzy, phonetic matching
  • Polymorphic matching. A numeric field, for instance, can be searched as number (by number range) or as a series of characters (as if text).
  • Multi-lingual and multinational specialties
  • Thesauri
  • Term aliasing
  • Multiple query languages (among others CQL and its own flavours of Infix and RPN)
  • Search within the record and search within its extracorporal metadata including categorization
  • Join (a kind of intersection somewhat like AND) between result sets from different "physical" indexes (among records with the same key but in different collections).
  • Multiple models of ranking and results sorting
  • Priority (tuning some records to be more or less important in the ranking mix)
  • Search for not just results (records) but also for terms ("scan")

Query Model

Since XML (the underlying model for RSS, Atom and CAPS) has an explicit ancestry of nodes and paths we exploit them for contextual search. This allows for some interesting and powerful searches:
  • Find stories with words in the same unnamed node (for instance in the same Title or same Description).
  • Find stories with words in the same named node (for instance in the same explicit Title or same Description).
  • Find stories with words in the same named explicit node path (for instance in the same explicit Title of an image of a ...).
  • Select content of named nodes, named paths or object attributes.
  • Select content of descendants of a named ancestor of record hits. Descendants or ancestors may be specified as the name of a node or a path or even a sub-path. This allows one to dynamically at query time define a view and model (unit of recall) to content.
In RSS this means that one can search for items in the same, for example, item description and request the title and URL of the items that match. Just what one needs given the multiple items per feed!

Lets look a bit closer by example of XML fragments (from SGML/XML markup of Shakespeare's works by Jon Bosak):

<SPEECH>
 <SPEAKER>LADY MACBETH</SPEAKER>
 <LINE>Out, damned spot! out, I say!--One: two: why,</LINE>
 <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE>
 <LINE>lord, fie! a soldier, and afeard? What need we</LINE>
 <LINE>fear who knows it, when none can call our power to</LINE>
 <LINE>account?--Yet who would have thought the old man</LINE>
 <LINE>to have had so much blood in him.</LINE>
</SPEECH>
First off I we have an idea of "nearness": being in the same leaf. The words "out" and "spot" are in the same node (with path ...\SPEECH\LINE ). Its named SPEECH ancestor is the above speech--- the only speech in all of Shakespeare's works to have the words "out" and "spot" in the same LINE. The SPEAKER descendant of that SPEECH is "LADY MACBETH". The word "spot" is said within the works, by contrast, in many other speeches by speakers in addition to Lady Macbeth: SALISBURY in `The Life and Death of King John', BRUTUS as well as ANTONY in `The Tragedy of Julius Caesar', MISTRESS QUICKLY in `The Merry Wives of Windsor', VALERIA in `The Tragedy of Coriolanus', ROSALIND in `As You Like It' and MARK ANTONY in `The Tragedy of Antony and Cleopatra'.
<Videos>
  <Video ASIN="B0000DJ7G9">
 <Title Screenplay="William Shakespeare" Alt="Othello">The Tragedy of Othello: The Moor of Venice</Title>
    <Director>Orson Welles
    <Length>90 minutes</Length>
    <Format>DVD</Format>
    <Certificationg>U</Certification>
  </Video>
<Videos>
In the above the value of the Screenplay attribute to Title is William Shakespeare. In IB one can search for the content of the tag (TITLE), the content of an attribute (named mapped with a default of TITLE@SCREENPLAY in this example) but also for information like films with Venice in the title whose screenplay was from Shakespeare. Here we appeal to an implicit direct parent (an abstract virtual container that contains the both attribute and field data). In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other in a container. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. In XML we do not, however, have any well-defined order among the siblings (different LINEs). The XML 1.0 well-formedness definition specifically states that attributes are unordered and says also nothing about elements. Document order (how they are marked-up) and the order a conforming XML parser might decide to report the child elements of SPEECH might not be the same. Most systems handling XML from a disk and using popular parsers typically deliver it in the same order but the standard DOES NOT specify that it need be--- and for good reason. See also: Presentation Methods and Elements.

Order

Lady Macheth says "spot" in another speech too..
<SPEECH>
  <SPEAKER>LADY MACBETH<<SPEAKER>
  <LINE>Yet here's a spot.</LINE>
</SPEECH>
These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE" In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. One could then specify an inclusion (within the same unnamed or named field or path), an order and even a character (octet) metric. We have not attempted to implement a word metric as the concept of word is more complicated then commonly held. Is edz@bsn.com a single word? Two words? One word? Maybe even 3? What about a URL? Hyphenation as in "auto-mobile"? Two words? On the other hand what does such a distance mean? Our metric of distance is defined as the file offsets (octets) as the record is stored on the file system. This too is less than clear cut as the render of Überzeugung and &Uuml;berzeugung are equivalent but their lengths are different.
zum Thema Präsentieren und Überzeugen
The file system offsets between the word "Thema" and "und" depends upon how "Präsentieren" was encoded. As UTF-7, as UCS-2, Latin-1 or with entities &uml; and &am;;Uuml; or &#nnn form. Does one, instead, treat the rendered level? This too is misleading since different output devices might have quite different layouts. What about columns? And the tags? Tags are, of course, even worse with words since we have started to associate a semantics for distance. Look at an XML fragment like
<name>Edward C. Zimmermann</name><company>NONMONOTOMIC Lab</company>
What is the word distance between "Lab" and "Edward" keeping in mind that XML markup is equivalent if NAME is specified before or after COMPANY.
<company>NONMONOTOMIC Lab</company><<name>Edward C. Zimmermann</name>
These are equivalent content and the word distance? That's right.. its not well defined! Order and inclusion do make sense, even some rough guide to distance but words? We are, of course, open to convincing examples!

Counting containers

IB can also count containers.
GetNodeOffsetCount(GPTYPE HitGp, STRING FieldNameorPath= "", FC *Fc= NULL)
where HitGp is the address of a hit, FieldNameofPath the name, resp. path, of the container we are interested in and Fc (optional) is its calculated start/end coordinates.

This is useful to be able to determine which page (paragraph or line) a hit is on but may also be used in an abstract sense. Example:

  <ingredients>
    <item>Chocolate</item>
    <item>Flour</item>
    <item>Butter</item>
    <item>eggs</item>
  </ingredients>
The oder is the order in the mark-up. Chocolate might be the 25th item in the document but its the first item in the particular instance of INGREDIENTS. This is possible since we can check the count of the instance of INGREDIENTS and can look at the previous we can count also the offset of ITEM.

Granularity of Search Results (ex-ante neo-Records)

IB provides the effective functionality of search time definition of unit of recall.

In conventional search/retrieval systems the fundamental unit of recall is the "record" defined prior to indexing. If, for example, one indexed the complete works of Shakespeare on a per-play basis then searching for terms would only return the unit of recall play (or an element thereof, for example, its TITLE).

In IB, by, contrast, one can define in a search query the unit of recall as ancestor/descendant of found (matched) elements. One can easily request (in the Shakespeare, index per play, example) the name of the acts (and the text of acts) that match a query just as one can ask for the speech (and its speaker) or even the lines as the unit of recall.

This functionality is mission critical to, among other fields, the search of literature and legal texts.

In IBU News its used to distinguish between stories (<item>s in RSS) and the feeds in which they belong. Indexing on a per feed basis (and not item) also allows us to search for inter-relationships between different items in the same feed to ask questions like "What news feed had both an article on Putin and on Ahmadinedjad and what were the stories". Sometimes the story is told in the context of multiple articles in a feed and not just in one.

Queries a la Nicolas Bourbaki

In IB with work with sets.

Result Sets

The result of a query is a set. Sets may be combined and operations performed upon them. IB supports many binary and unary operators.

The most primitive set is the set of records that contain a single term (word or phrase). IB supports also fielded data with extremely powerful and flexible methods. In its most basic form (when performing a search) you may specify a field, or a field path--- If no field is specified it means ANYWHERE in the record. Fields are specified with a / as in

field/term
Membership of a field too can be viewed as an operation acting upon the set of records that contain a term.

Terms may contain wild cards. IB supports the so-called " Glob Expression Syntax":

SyntaxSematics
*Match zero or more characters. term*, for example, is equivalent to right truncated search
?Match one character
[...]Match any of the characters (set) enclosed by the brackets. Characters '*" and '?" are interpreted in the set as normal characters and not as wildcards.
[!...]Any character NOT in the set is matched.
[.-.]A '-' between two characters denotes a range. The set [A-C], for example would match any character between A and C: namely 'A', 'B' or 'C'.
{.,.}..{.,.}Match, for example, {1,2}{a,b} to 1a 1b 2a 2b. The term "L{e,i}banon would match "Lebanon" as well as "Libanon".
\The character '\' is an escape. When used with wildcards or other special characters it means that the character should match itself and not have its special sematics. \*, for instance, matches '*'.

Operator overload and overloaded operators

IB's query languages also provide a set of operators with sematics specific to the object type of a field:
OperatorSemantics
< <=Objects like date, numbers etc. have an order and <, resp. <=, applies as one might expect. For general "string" fields its interpreted as the equivalent to "*" (right truncation)
=For general "string" fields its interpreted just as "/", viz. a member of the field.
> >=As </<= above. For general "string" types the sematics are left truncation (as in *term)

Query Languages

The IB engines supports multiple query languages and includes a well defined class structure to extend the engine with syntax for other languages. Out of the box it supports queries expressed in both Infix and RPN notations.

Infix Notation

Infix notation is the common arithmetic and logical formula notation, in which operators are written infix-style between the operands they act on (e.g. 2 + 2). See: IB Infix Notation.

RPN Notation

Reverse Polish notation (RPN), also known as postfix notation, was invented by Australian philosopher and computer scientist Charles Hamblin in the mid-1950s. It is derived from the Polish notation, which was introduced in 1920 by the Polish mathematician Jan Łukasiewicz.

In RPN the operands precede the operator, removing the need for parentheses. For example, the expression 3 * ( 4 + 7) would be written as 3 4 7 + *. See: IB RPN Notation.

Normalization and Ranking

In search we are looking for relevance. How do we rank relevance? The first step is to normalize a score for different records and the second step is to have a model as to how to compare these scores in their context.

See Normalization and ranking.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer