IB Engine Key Features
IB Key Features and Benefits:
In a nutshell
| Easy to use | Can automatically detect and index a large number of document formats including structured (such as XML) and unstructured types and these can be mixed.
Automatic structure recognition and identification for "unstructured" textual formats (e.g., such as, alongside metadata, lines, sentences, paragraphs and pages in PDF documents).
|
|---|
| Advanced Linguists | Match against synonyms and thesauri. Support for per-language stopwords etc. |
|---|
| Advanced Query Processing | Sophisticated natural and boolean query language supporting wildcards, query term weighting, parenthesis and ranges of values, paths etc. Support for term auto-completion, spell correction (realtime against the words in the index and not some dictionary) etc. |
|---|
| Advanced Results Processing | Sophisticated, flexible, configurable and extendable scoring and ranking system (including support for geospatial ranking). |
|---|
| Secure | Content-level security and multiple search views to the same index. |
|---|
| High Performance | Fast, low footprint and scalable |
|---|
- Cost effective access to a heterogeneous mix of XML and other data of any shape and size. Allows for the rapid creation of scalable (XML) warehouses.
- All the capabilities you can ever expect in an enterprise search solution and then some: including phrase, boolean, proximity, wildcard, parametric, range, phonetic, fuzzy, thesauri, polymorphism, datatypes (including numeric, dates, geospatial, ranges etc.) and object capabilities.
- Relevant ranking by a number of models including spatial score for geospatial queries, date, term frequency, match distribution etc.
- Extendable ranking system with a large number of scoring and sorting methods.
- Object oriented document model: Supports W3C XML, ISO Standard 8879:1986 SGML and a wide range of common file (such as Word, Excel, RTF, PDF, PostScript, HTML, Mail, News), citation (such as BibTex, Endnote, Medline, Papyrus, Refer, Reference Manager/RIS, Dialog, etc), scientific, ISO and industry formats including standards such as USPTO Green Book (patents), DIF (Directory Interchange Format), CAP (Common Alerting Format) and many more.
- Automatic structure recognition and identification for "unstructured" textual formats (e.g., such as, alongside metadata, lines, sentences, paragraphs and pages in PDF documents).
- Sophisticated extendable type system allowing for numerical, date, geospatial and other search strategies, including external datastores and brokers parallel to textual methods: "Universal Indexing".
- Synchronized information: As soon as context its indexed (appended) its available. Functional Append/Delete/Modify and transaction-consistent revision information deliver consistence and up-to-date information without the time-lag typical of many search engines.
- True search term highlighting (exactly what the query found, structure etc.) including Adobe Acrobat PDF Highlighting.
- Extendable/Embeddable/Programmable: Java, Python, Tcl, C++ and other other language APIs.
- Support for a number of information retrieval protocols including ISO 23950 / ANSI NISO Z39.50, SRW/U and OpenSearch.
- Runs on a wide range of hardware and operating systems.
- Easy to maintain, tiny, scalable and fast. Energy efficient: One can start off with inexpensive low-power hardware (our Blau Appliance, for example, draws only 10 watts of power and is sufficient to handle several concurrent user sessions searching GB of data and still deliver search performance measured in fractions of a second.).
- Does not demand advance setup or preprocessing.
- Unlike most search engines IB is not based on "Inverted file indexes". Because of the limitation of "inverted indexes" most search engines typically index text (excluding common words and long terms as "stop words") and only a fixed and limited number of pre-determined, additional fields (since they are expensive). IB, by contrast, allows for unlimited term length (suitable for genetic sequences) and unlimited fields, paths and structures. IB provides unlimited query flexibility without having to know in advance what questions users are going to ask.
- Unlike databases IB does not require conversion into a "common format".
- Unlike XML databases IB supports also non-hierarchical structures and overlap.
- Unlike search engines IB indexes all elements, their structure, and their contents. This means that one can quickly evaluate text queries, structural queries, and queries that combine both text, objects (numerical, geospatial etc.) and structural constraints (e.g.,find diagram captions that mention engine in articles whose title contains Airbus).
- Virtual "indexes" allow for the design of logically segmented information indexes and fast on-demand search of arbitrary combinations thereof. Via the field and path mapping architecture this can be implemented completely transparent to search.
- Index collection binding: multiple indexes can be imported into an index. This allows for the custom creation of indexes on the basis of a large catalog of indexes— highly relevant to publishers as their customers tend to subscribe to only a sub-set of products (e.g. journals).
- Full ability to search specific structure/context in information without even knowing their details (such as tag or field names).
- User defined "search time" unit of retrieval: the structure of documents is exploited to identify which document elements (such as the appropriate chapter or page) to retrieve. No need for intermediate documents or re-indexing.
- No need for a “middle layer” of content manipulation code. Instead of getting URLs from a search engine, fetching documents, parsing them, and navigating the DOMs to find required elements, IB lets you simply request the elements you need and they are returned directly.
- "Any-to-Any" architecture: On-the-fly XML and other formats.
The default modus is to index all the words and all the structure of documents. It provides powerful and fast search without prior knowledge about the content yet enables arbitrarily complex questions across all the content and from different perspectives. Not bound by the constraints of "records" as unit of information, one can immediately derive value from content with the flexibility to enhance content and the application incrementally over time without "breaking anything".
IB was designed from the ground up to address two key goals: universal SGML/XML (and other document formats) hierarchical/context search and to provide optimal support for features (current and future) of the ISO 23950 (ANSI/NISO Z39.50) Information Retrieval Protocol services standard.
Available Platforms
The IB engine, systems and development toolkits are available on multiple platforms. Currently supported platforms (2007):| Hardware | Operating System |
|---|
| AMD/Intel x86 | Microsoft Windows (WIN32): Windows-9x/NT/2000/XP/Vista | | AMD 64-bit | Microsoft Windows (WIN32): Windows 9x/NT/2000/XP/Vista | | AMD/Intel x86 | Linux (most distributions) | | AMD 64-bit | Linux (most distributions) | | AMD x86 | freeBSD, openBSD | | AMD 64-bit | freeBSD, openBSD | | SPARC 64 | Solaris 8 (Solaris 10: Oct 2007) | | AMD x86 | Solaris 10 |
A significant number of other platforms are available on a customer-by-customer OEM basis. This includes mobile appliance ("cell phone") devices and vanguard processors such as IBM/Toshiba/Sony's cell (among other hardware platforms Sony's Playstation-3).
On 32-bit platforms there are two distinct versions available:- 32-bit addressing (standard edition)
- Intended to index up to 4 Gbyte of text per segment for a max. total of 1 TB possible.
- 64-bit addressing (32/64 edition)
- For 16384 Tbytes per segment and a total of aprox. 4.1943e+06 Terabytes in up to 31 million records
The IB API is currently available and actively supported for the following languages: Python, Tcl, Ruby, Java (JINI), Perl. OEM customers can access the full C++ API.
Like nearly all databases and search engines IB isn't 'thread-safe' in the sense that the programmer can use code indiscriminately from threads. Its, however, been designed with reasonable process safety in mind to allows for robust development of search and retrieval applications. As an application based heavily upon I/O and less on CPU the use of haphazard threads, albeit an increasingly bad habit among Java developers, not only contains some perils but also poor performance due to the serial nature of both disk and memory I/O. See threads
|