NavigationUser login |
IB Search Engine Design // Beyond XML full-text search and beyond native XML databasesSearch functionality (inclusive of ranking) is handled by the embedded IB sub-system. IB is a development of BSn's NONMONOTONIC Lab in Munich. BackgroundOrganizations of all sizes and within all industries generally distribute their corporate knowledge amid a variety of heterogeneous database applications: from customer relationship systems, staff directories, content management systems (CMS), electronic document and records management systems (EDRMS) to library catalogues.
The default modus is to index all the words and all the structure of documents. It provides powerful and fast search without prior knowledge about the content yet enables arbitrarily complex questions across all the content and from different perspectives. Not bound by the constraints of "records" as unit of information, one can immediately derive value from content with the flexibility to enhance content and the application incrementally over time without "breaking anything". IB was designed from the ground up to address three key goals: universal SGML/XML (and other document formats) hierarchical/context search, distributed objects (transparent integrated views to other sources of information such as relational DBs, search services and object brokers) and to provide optimal support for features (current and future) of the ISO 23950 (ANSI/NISO Z39.50) Information Retrieval Protocol services standard to allow for standard interoperable interfaces.
IB Search Sub-SystemThe IB engine (written in C++) provides extremely sophisticated contextual content search to heterogeneous (mix format) information including XML (with all the functionality of a native XML database) and SGML. It goes well beyond most XML full-text search solutions to support alongside text also other objects. Not limited to the XML paradigm and designed upon a more abstract model it can go beyond the more commonplace hierarchical text model (volumes, chapters, sections, paragraphs, sentences and even lines), fields and paths (XPaths and XQuery model) to (abstract hierarchical) path expressions and overlapping structures. In the IB engine a single index can have mixed document content--- and mechanisms are provided to even unify fields from different formats-- and these can be on the fly mixed with others to provide dynamic collections. While most all search engines demand that the unit of retrieval be defined prior to indexing as "record design" we not only can index a diverse collection of heterogeneously formated and structured information but allow also the "unit of retrieval" to be quasi-defined per search via abstract paths. Together these features enable state of the art context querying of a sort not found in any of the so-called XML search, full-text or IR engines.
Relational enhanced hierarchical full text IR.Trend over the past years has been to try to enhance RDBMs with some fulltext search functionality to try to provide the needed mining and information retrieval functionality. These products work but are sub-optimal: neither fish nor fowl. Their functionality is limited. Their performance tends to be mediocre. They are inflexible (needing to adhere to the relational table model) and they are large and bloated. IB goes the other way and extends the full text information retrieval paradigm to hierarchical models extended with objects, datatypes and relations. Powerful SearchIB is, as one might expect, extremely powerful and supports:
Query ModelSince XML (the underlying model for RSS, Atom and CAPS) has an explicit ancestry of nodes and paths we exploit them for contextual search. This allows for some interesting and powerful searches:
<SPEECH> <SPEAKER>LADY MACBETH</SPEAKER> <LINE>Out, damned spot! out, I say!--One: two: why,</LINE> <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE> <LINE>lord, fie! a soldier, and afeard? What need we</LINE> <LINE>fear who knows it, when none can call our power to</LINE> <LINE>account?--Yet who would have thought the old man</LINE> <LINE>to have had so much blood in him.</LINE> </SPEECH> First off I we have an idea of "nearness": being in the same leaf. The words "out" and "spot" are in the same node (with path ...\SPEECH\LINE ). Its named SPEECH ancestor is the above speech--- the only speech in all of Shakespeare's works to have the words "out" and "spot" in the same LINE. The SPEAKER descendant of that SPEECH is "LADY MACBETH". The word "spot" is said within the works, by contrast, in many other speeches by speakers in addition to Lady Macbeth: SALISBURY in `The Life and Death of King John', BRUTUS as well as ANTONY in `The Tragedy of Julius Caesar', MISTRESS QUICKLY in `The Merry Wives of Windsor', VALERIA in `The Tragedy of Coriolanus', ROSALIND in `As You Like It' and MARK ANTONY in `The Tragedy of Antony and Cleopatra'. <Videos>
<Video ASIN="B0000DJ7G9">
<Title Screenplay="William Shakespeare" Alt="Othello">The Tragedy of Othello: The Moor of Venice</Title>
<Director>Orson Welles
<Length>90 minutes</Length>
<Format>DVD</Format>
<Certificationg>U</Certification>
</Video>
<Videos>In the above the value of the Screenplay attribute to Title is William Shakespeare. In IB one can search for the content of the tag (TITLE), the content of an attribute (named mapped with a default of TITLE@SCREENPLAY in this example) but also for information like films with Venice in the title whose screenplay was from Shakespeare. Here we appeal to an implicit direct parent (an abstract virtual container that contains the both attribute and field data). In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other in a container. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. In XML we do not, however, have any well-defined order among the siblings (different LINEs). The XML 1.0 well-formedness definition specifically states that attributes are unordered and says also nothing about elements. Document order (how they are marked-up) and the order a conforming XML parser might decide to report the child elements of SPEECH might not be the same. Most systems handling XML from a disk and using popular parsers typically deliver it in the same order but the standard DOES NOT specify that it need be--- and for good reason. See also: Presentation Methods and Elements. OrderLady Macheth says "spot" in another speech too.. <SPEECH> <SPEAKER>LADY MACBETH<<SPEAKER> <LINE>Yet here's a spot.</LINE> </SPEECH> These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE" In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. One could then specify an inclusion (within the same unnamed or named field or path), an order and even a character (octet) metric. We have not attempted to implement a word metric as the concept of word is more complicated then commonly held. Is edz@bsn.com a single word? Two words? One word? Maybe even 3? What about a URL? Hyphenation as in "auto-mobile"? Two words? On the other hand what does such a distance mean? Our metric of distance is defined as the file offsets (octets) as the record is stored on the file system. This too is less than clear cut as the render of Überzeugung and Überzeugung are equivalent but their lengths are different. zum Thema Präsentieren und Überzeugen The file system offsets between the word "Thema" and "und" depends upon how "Präsentieren" was encoded. As UTF-7, as UCS-2, Latin-1 or with entities ¨ and &am;;Uuml; or &#nnn form. Does one, instead, treat the rendered level? This too is misleading since different output devices might have quite different layouts. What about columns? And the tags? Tags are, of course, even worse with words since we have started to associate a semantics for distance. Look at an XML fragment like <name>Edward C. Zimmermann</name><company>NONMONOTOMIC Lab</company> What is the word distance between "Lab" and "Edward" keeping in mind that XML markup is equivalent if NAME is specified before or after COMPANY. <company>NONMONOTOMIC Lab</company><<name>Edward C. Zimmermann</name> These are equivalent content and the word distance? That's right.. its not well defined! Order and inclusion do make sense, even some rough guide to distance but words? We are, of course, open to convincing examples! Counting containersIB can also count containers. GetNodeOffsetCount(GPTYPE HitGp, STRING FieldNameorPath= "", FC *Fc= NULL) where HitGp is the address of a hit, FieldNameofPath the name, resp. path, of the container we are interested in and Fc (optional) is its calculated start/end coordinates. This is useful to be able to determine which page (paragraph or line) a hit is on but may also be used in an abstract sense. Example: <ingredients>
<item>Chocolate</item>
<item>Flour</item>
<item>Butter</item>
<item>eggs</item>
</ingredients>
The oder is the order in the mark-up. Chocolate might be the 25th item in the document but its the first item in the particular instance of INGREDIENTS. This is possible since we can check the count of the instance of INGREDIENTS and can look at the previous we can count also the offset of ITEM. Granularity of Search Results (ex-ante neo-Records)IB provides the effective functionality of search time definition of unit of recall. In conventional search/retrieval systems the fundamental unit of recall is the "record" defined prior to indexing. If, for example, one indexed the complete works of Shakespeare on a per-play basis then searching for terms would only return the unit of recall play (or an element thereof, for example, its TITLE). In IB, by, contrast, one can define in a search query the unit of recall as ancestor/descendant of found (matched) elements. One can easily request (in the Shakespeare, index per play, example) the name of the acts (and the text of acts) that match a query just as one can ask for the speech (and its speaker) or even the lines as the unit of recall. This functionality is mission critical to, among other fields, the search of literature and legal texts. In IBU News its used to distinguish between stories (<item>s in RSS) and the feeds in which they belong. Indexing on a per feed basis (and not item) also allows us to search for inter-relationships between different items in the same feed to ask questions like "What news feed had both an article on Putin and on Ahmadinedjad and what were the stories". Sometimes the story is told in the context of multiple articles in a feed and not just in one.
Queries a la Nicolas BourbakiIn IB with work with sets. Result SetsThe result of a query is a set. Sets may be combined and operations performed upon them. IB supports many binary and unary operators. The most primitive set is the set of records that contain a single term (word or phrase). IB supports also fielded data with extremely powerful and flexible methods. In its most basic form (when performing a search) you may specify a field, or a field path--- If no field is specified it means ANYWHERE in the record. Fields are specified with a / as in field/term Membership of a field too can be viewed as an operation acting upon the set of records that contain a term. Terms may contain wild cards. IB supports the so-called " Glob Expression Syntax":
Operator overload and overloaded operatorsIB's query languages also provide a set of operators with sematics specific to the object type of a field:
Query LanguagesThe IB engines supports multiple query languages and includes a well defined class structure to extend the engine with syntax for other languages. Out of the box it supports queries expressed in both Infix and RPN notations. Infix NotationInfix notation is the common arithmetic and logical formula notation, in which operators are written infix-style between the operands they act on (e.g. 2 + 2). See: IB Infix Notation. RPN NotationReverse Polish notation (RPN), also known as postfix notation, was invented by Australian philosopher and computer scientist Charles Hamblin in the mid-1950s. It is derived from the Polish notation, which was introduced in 1920 by the Polish mathematician Jan Łukasiewicz.
See: IB RPN Notation. Normalization and RankingIn search we are looking for relevance. How do we rank relevance? The first step is to normalize a score for different records and the second step is to have a model as to how to compare these scores in their context. By Edward C. Zimmermann at 2010-07-08 14:34 SGML | XML fulltext engine | Z39.50 | full-text | full-text search | native XML database | native XML database
|