IB Engine Q&A

The following questions are typical of those frequently asked (their order depends upon who we are speaking with: software developer, university researcher, potential application user etc.):


What is IB?
IB is a search engine to heterogeneous (mix format) information including SGML/XML. Not limited to the XML paradigm and designed upon a more abstract model it can go beyond the more commonplace hierarchical text model (volumes, chapters, sections, paragraphs, sentences), fields and even paths (XPaths and XQuery model) to (abstract) path expressions (allowing for search in more generic models than it is possible in XML to model).
Can IB index PDF? Word? Excel?
Yes. Unlike databases IB does not require conversion into a "common format". It supports W3C XML, ISO Standard 8879:1986 SGML and a wide range of common file (such as Word, Excel, RTF, PDF, PostScript, HTML, Mail, News), citation (such as BibTex, Endnote, Medline, Papyrus, Refer, Reference Manager/RIS, Dialog, etc), scientific, ISO and industry formats including standards such as USPTO Green Book (patents), DIF (Directory Interchange Format), CAP (Common Alerting Format) and many more formats.
Is it like google?
What they have in common is that they both are about search and can run on cheap hardware. Google is, however, about finding something, anything, in a large collection of "other people's junk" while IB is about searching specific context in a more moderate collection of more controlled information. IB is about searching for specifics and not "anything about". IB tries to model the structure and context of information as best it can. Some forms of documents are already structured (such as those marked up in SGML/XML) but many other documents have well defined implicit structures (such as email messages and even most plain text) which we can exploit.
Can IB search document "zones"?
Of course. "Zones" are just fields. Since many full-text systems use inverted indexes and other models where fields and paths are expensive they try to keep their numbers down to bare necessity. IB, by contrast, allows for unlimited term length (suitable for genetic sequences) and unlimited fields, paths and structures. IB provides unlimited query flexibility without having to know in advance what questions users are going to ask.
Can IB handle natural language queries?
Yes. Matching records are, typically by default, sorted in relevance rank order. There are many settings to tune scores (normalization) and rankings .
Can I type some words like in Google and have them "And"ed"?
Yes, if you wish but And is really boring. What does searching in the works of William Shakespeare for "love" and "hate" tell me? Not much more than that 33 plays used both words (or looking the other way, of the standard corpus, 4 plays did not use both). More interesting is to ask which play used "love" and "hate" in the same speech? That's still 21 plays. In the same line? That's 14 plays (for example in `The Tragedy of Coriolanus', the line "Coriolanus neither to care whether they love or hate"). AND on a document (or record) level does not really tell me much. AND really only gets interesting on a structural contextual level. On a record level nearly always OR with a reasonable form of term frequency based relevant ranking is better.
What other binary operations can you do?
A lot! See: Binary operators. We also have quite a few unary operators.
Do I have to type a RPN or Infix query?
Of course not. We provide all kinds of search interfaces including some that don't even require typing. We also have a search interface that we call "smart" since it can not only figure out if a query is in infix or RPN notation or just some words but also search for those words in manner that we've empirically found to best support intuitive search.
What does smart search do?
It searches for phrases and should they fail to find anything searches in context looking for the words in the same structural container. If it finds nothing it does an interesting kind of OR we call reduced OR.
What is "reduction" or reduced OR?
Its best, I think, to give an example: search for three words "apples" "oranges" "pears". If all 3 terms are found then we only return the set where all 3 terms were found. Its like an AND but we searched OR. If no records had all 3 but some records have 2 terms and others only 1 or none we return those with 2 words. This algorithm is a bit trickier but that's just "little details". The significant observation is that smart only returns results as "AND" when it makes intuitive sense!
Can you give me an example of a smart search, say applied to the Shakespeare corpus?
The search expression I am a jew does not return all the records where "I" or "am" or "a" or "jew" are but precisely the two plays `The Merchant of Venice' (where the phrase is in the line "enemies; and what's his reason? I am a Jew. Hath" as spoken by Shylock and "me in heaven, because I am a Jew's daughter: and he" by Jessica) and "love her, I am a Jew. I will go get her picture." spoken by Benedick in `Much Ado about Nothing'.
The search, however, for "love scorn" returns: `Much Ado about Nothing' and `The Two Gentlemen of Verona' since while there is no "love scorn" as a phrase the words "love" and "scorn" do appear in the same line in those plays (for example "of his own scorn by failing in love: and such a man" spoken by Benedick).
The model does not depend upon a specific known container such as line but is entirely generic. The search for enter Philo finds the stage direction "Enter DEMETRIUS and PHILO" in SCENE I/Act I of `The Tragedy of Antony and Cleopatra'.
What ranking models do you support
Ranking by score, date, a mix of both as well as sort by other "features" such as key with various other "factors" such as priority and record category.
How is score ranked and compared?
IB provides a choice of several score normalization models including Cosine and adapted metric Cosine (where the distance between terms is considered) and other more esoteric score normalization models.
What's behind "adapted metric Cosine normalization"?
We observe that standard Cosine normalization produces weights that are too large for short documents and too small for long documents. Adjusting to length is, however, insufficient as we also observe also that terms can occur in large documents in quite different sections and be less relevant than small, more concise, documents (comparing an article on a specific topic to a book covering a wider field). In adapted metric Cosine normalization we adjust for both document size as well as the distance in the document between matching terms.
You also have something called Newsrank
Yes. It throws in a non-linear model of time and significance into the adapted metric Cosine normalization mix. The basic idea is that of two similar stories (in News) the newer one should be considered more relevant.
What about favored record ranking such as "Page Rank"?
There are in IB several models to manipulate a set of linear equations with the normalized score and a priority scalar to produce a "resolved score". This can be useful to "tune" the score of specific documents to be more--- or less-- relevant. The priority scalar is defined (or calculated) a priori as a measure for each record in the index. The priority scalars are typically derived from link counts, popularity metrics and/or as the result of editorial review. Its a number de-coupled from the actual content and search and can have a history.
How else can ranking be "tuned"?
By category. Each record has a category and there too are some linear equations in the score resolver to, on the basis of some scalars (called "magnetism"), effect micro-movements of score (within a list) to bring records of similar category closer (or move them farther apart).

Does IB support federated search?
YES. IB was designed from the ground up to support ISO-23950/ANSI.NISO.Z39.50 and its developers have been active core members of the ZIG (Z39.50 Implementors Group) since the early 1990s. We've also been long involved at core level in numerous federated search projects including GILS (Global Information Locator Service), ASF (Advanced Search Facility), SRW/U (Search/Retrieve via the Web or URL).
What is a federated search?
A federated search consists of a search through a disparate group of databases connected via a network (such as "the Internet"). Queries (searches) are broadcast to the selected databases; the results are collated and brought together in a unified format.

Federated search is distinct from the functioning of traditional search engines, which are centralized depots that use robots to continuously index and crawl through web-based content, retrieving results from previously cached documents that match the query terms.

What are centroids?
One of the strengths of federated search is that its distributed across multiple targets (machines etc.). It scales well to moderate sized collections of machines (100s or even 1000s) in a search federation but not to the kind of endless sea possible in the Internet. The idea behind centroids (which goes back to DNS or domain name service models) is to be able to define dynamic federated networks as the product of search: "Who has relevant information to my query?"

Which database (RDBMS) does IB use?
It does not use a RDBMS (but can access them as external object procedures) but its own technology.
What about using a RBBMS?
Databases solve a different problem. They usually don't have many of the traditional search engine features (ranking, linguistics) and are not designed to handle typical search engine queries.
Many relational databases now offer a full-text-search feature. How do they work?
These systems have glued text-indexing into their relational model via foreign key joins. While it kind-of works its extremely inefficient and best for relatively small collections and a very limited number of fields.
Can IB do joins?
Yes. IB can do a kind of join across different indexes (akin to joining two or more tables together in a relational db). The difference, however, between IB joins and typical RDBMS joins is that IB requires common apriori (index time) keys while with a RDBMS one can select at search time any field to use a key.
Why can't I choose any field at search time for use as a key?
Aside from the observation that generic keys are extremely expensive and inefficient operations, we've not yet seen an appropriate information retrieval application that would require them. Relational databases demand them since without they are not longer relational. Relational databases are about creating and managing relations between columns and tables of data. IR systems are, by contrast, about exploiting these defined relations as structure for discovery.
Are documents stored within IB, or as separate files?
IB indexes documents on storage (disk or networks). It is not a database but an information retrieval system. The files and documents can stay where they are.
Does IB need access to these documents during search?
Yes (and no). The design of IB is to require access to the indexed files but we have several customers that use IB without them exploiting some of the failsafe and caching features of the index.
Can IB index relational databases?
Of course. Data from databases are easily exported as reports. Since IB supports fields and structure these reports can have rich structure representing the underlying relational models. Together with IB's capability for dynamic, search-time, unit of retrieval, one can gain many search possibilities--- not just performance and flexibility--- over those that were inherent in the database.
Can one glue IB together with relational databases?
Yes. IB is not just an excellent database accelerator (indexing a database) but provides models to allow data to be routed to and from relational systems. A common use of IB, for instance, is to have volatile e-commerce fields (such as price and available seats or quantity) transparently query the relevant databases (each field or path can be individually directed to a system or object broker). These objects don't need to be exported, resp. imported into IB, but can be live in their own systems for complete synchronicity: having ones metaphorical cake and eating it too!
How long does it make to import data into IB?
As fast as the data can be indexed--- and even while its being indexed--- and the indexer is pretty fast. Functional Append/Delete/Modify and transaction-consistent revision information in IB delivers consistence and up-to-date information without the time-lag typical of many search engines. Latency as is common to all popular search engines (and their appliances) when a new document is added to a system is fully unacceptable. As in a RDBMS as soon as new information is indexed (committed) it is included in all results.
Can I get all the answers to my query as I do with a RDBMS?
Of course! What's the point of being told that there are 1 million results for a query and only be able to see the first 1000 or so of them (chosen by the system using "secret" or "black box" means)? That's perhaps fine when looking for "anything" (and nothing particular) in a large collection of rubbish but its fully unacceptable when looking for specific information.


What platforms does it run on?
Various flavours of Unix including freeBSD and Solaris (Intel, AMD64 and SPARC), Apple OS-X, Linux and MS Windows (XP, Vista, Win7).
Do you have an appliance version?
YES. That's our BLAU appliance. Totally silent and drawing only 10 watts of power yet sufficient to handle GB of information its, we think, the perfect workgroup search and retrieval device.

Is the IB core open-source freeware?
No. We're strong and long standing open source developers. We've been major contributors to a large number of open source projects over the past decades. We've been major contributors to popular software packages such as wxWindows/wxWidgets, Isite/Isearch, wxPython to just name a few. IB started off life as Open-Source but we took it out following wide ranging abuse of our idealism and intellectual property by a handful of Fortune 50 corporations and DOT COM darlings. What remains "Open Source" are our applications of IB.
Is IB expensive?
We're still idealists and so pricing is on the basis of "ability to pay". There is no such thing as "can't afford" (and that includes sometimes even the hardware).
What is the relationship between IB and Isearch?
IB started off life in a cooperation between BSn and CNIDR as Isearch.
It was the first search engine to be designed from the ground up to support SGML and ISO 23950 search and retrieval. It included many innovations including the "document type" model and was one of the first engines (if not the first) to ever support XML. Development diverged to special versions for the USPTO and IB as a commercial split-off to provide a higher quality, significantly higher performance "Isearch". The IB API is a mega-super-set of Isearch and is fully compatible. IB has gone through a number of significant quantum leaps and made numerous technological breakthroughs but its lineage remains.


Can IB be licensed for OEM products?
Yes. Its already embedded into some products.
Is it network aware?
Of course. The engine was built from the ground up for use in networked applications (ISO-23950/ANSI.NISO.Z39.50 be be precise) and is fully conforming.
What kind of resources does it need
Its a small application designed to use whatever resources are available. It can run well on anything from a small embedded processor to a big Blue-Gene.
How big are the indexes?
The size of the disk space requirements for indexes depends upon a lot of factors including the structure of documents, their word distribution and frequency, size of documents and a host of other factors. We have, in fact, a mathematical model to estimate the requirements but to give a simple rough ballpark guide its anywhere between 1/3 and slightly larger than the space for the originals. We don't, however, use compression as that would, we fear, interfere with some of the embedded file system and sub-system compressions already in place on many systems. Our indexes are compressible by up-to a factor of 3.

In what language is IB written?
The core IB engine is written in C++ with hooks to standard C for customer extension. Many IB applications, however, tend to use one of several interpreted languages such a Python.
How is IB used in Python?
Its loaded a a Python extension. IB provides several loadable modules for an assortment of interpreted languages.
What other interpreted languages are supported?
The most advanced module (and the one we use ourselves most) is Python but Tcl (by far the second most popular script language for IB), Ruby, Perl and even Java and PHP are available.
What is the interface design?
We use SWIG (to which we have also been an early contributor).
Does the Python module work under wxPython?
Of course (as one of the initial developers)!