Relational enhanced hierarchical full text IR.

Relational enhanced hierarchical full text IR.

Trend over the past years has been to try to enhance RDBMs with some fulltext search functionality to try to provide the needed mining and information retrieval functionality. These products work but are sub-optimal: neither fish nor fowl. Their functionality is limited. Their performance tends to be mediocre. They are inflexible (needing to adhere to the relational table model), demand costly indices and they are large and bloated. IB goes the other way and extends the full text information retrieval paradigm to hierarchical models extended with objects, datatypes and relations.

Relational? hierarchical? IB?

The difference between relational and hierarchical database models is the structure of the database and the methods used to locate data. Relational database are quite generic but for performance they use indices to link information from one table to another. These indices, although expensive, can be generated at will. Hierarchical databases by contrast stores data according to a set and rigid design (its hierachy) through which data is accessed. Hierarchical databases can be fast, don't have the high cost of updating indices with data updates but are mission specific: cast in stone. IB, by contrast, allows for the rapid addressing of data through a hierachy but its the one defined by the structure of the information and data.

Query Model

Since XML (the underlying model for RSS, Atom and CAPS) has an explicit ancestry of nodes and paths we exploit them for contextual search. This allows for some interesting and powerful searches:
  • Find stories with words in the same unnamed node (for instance in the same Title or same Description).
  • Find stories with words in the same named node (for instance in the same explicit Title or same Description).
  • Find stories with words in the same named explicit node path (for instance in the same explicit Title of an image of a ...).
  • Select content of named nodes, named paths or object attributes.
  • Select content of descendants of a named ancestor of record hits. Descendants or ancestors may be specified as the name of a node or a path or even a sub-path. This allows one to dynamically at query time define a view and model (unit of recall) to content.
In RSS this means that one can search for items in the same, for example, item description and request the title and URL of the items that match. Just what one needs given the multiple items per feed!

Lets look a bit closer by example of XML fragments (from SGML/XML markup of Shakespeare's works by Jon Bosak):

<SPEECH>
 <SPEAKER>LADY MACBETH</SPEAKER>
 <LINE>Out, damned spot! out, I say!--One: two: why,</LINE>
 <LINE>then, 'tis time to do't.--Hell is murky!--Fie, my</LINE>
 <LINE>lord, fie! a soldier, and afeard? What need we</LINE>
 <LINE>fear who knows it, when none can call our power to</LINE>
 <LINE>account?--Yet who would have thought the old man</LINE>
 <LINE>to have had so much blood in him.</LINE>
</SPEECH>
First off I we have an idea of "nearness": being in the same leaf. The words "out" and "spot" are in the same node (with path ...\SPEECH\LINE ). Its named SPEECH ancestor is the above speech--- the only speech in all of Shakespeare's works to have the words "out" and "spot" in the same LINE. The SPEAKER descendant of that SPEECH is "LADY MACBETH". The word "spot" is said within the works, by contrast, in many other speeches by speakers in addition to Lady Macbeth: SALISBURY in `The Life and Death of King John', BRUTUS as well as ANTONY in `The Tragedy of Julius Caesar', MISTRESS QUICKLY in `The Merry Wives of Windsor', VALERIA in `The Tragedy of Coriolanus', ROSALIND in `As You Like It' and MARK ANTONY in `The Tragedy of Antony and Cleopatra'.
<Videos>
  <Video ASIN="B0000DJ7G9">
 <Title Screenplay="William Shakespeare" Alt="Othello">The Tragedy of Othello: The Moor of Venice</Title>
    <Director>Orson Welles
    <Length>90 minutes</Length>
    <Format>DVD</Format>
    <Certificationg>U</Certification>
  </Video>
<Videos>
In the above the value of the Screenplay attribute to Title is William Shakespeare. In IB one can search for the content of the tag (TITLE), the content of an attribute (named mapped with a default of TITLE@SCREENPLAY in this example) but also for information like films with Venice in the title whose screenplay was from Shakespeare. Here we appeal to an implicit direct parent (an abstract virtual container that contains the both attribute and field data). In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other in a container. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. In XML we do not, however, have any well-defined order among the siblings (different LINEs). The XML 1.0 well-formedness definition specifically states that attributes are unordered and says also nothing about elements. Document order (how they are marked-up) and the order a conforming XML parser might decide to report the child elements of SPEECH might not be the same. Most systems handling XML from a disk and using popular parsers typically deliver it in the same order but the standard DOES NOT specify that it need be--- and for good reason. See also: Presentation Methods and Elements.

Order

Lady Macheth says "spot" in another speech too..
<SPEECH>
  <SPEAKER>LADY MACBETH<<SPEAKER>
  <LINE>Yet here's a spot.</LINE>
</SPEECH>
These "spot"s are in "PLAY\ACT\SCENE\SPEECH\LINE" In XML we not only have a parent/child ancestry of nodes but we also have within nodes a linear ordered relationship. One letter follows the next and one word follows the other. In the above example "Yet" precedes "here's" and "a" follows after and finishing with "spot". We have order and at at least a qualitative (intuitive) notion of distance. One could then specify an inclusion (within the same unnamed or named field or path), an order and even a character (octet) metric. We have not attempted to implement a word metric as the concept of word is more complicated then commonly held. Is edz@bsn.com a single word? Two words? One word? Maybe even 3? What about a URL? Hyphenation as in "auto-mobile"? Two words? On the other hand what does such a distance mean? Our metric of distance is defined as the file offsets (octets) as the record is stored on the file system. This too is less than clear cut as the render of Überzeugung and &Uuml;berzeugung are equivalent but their lengths are different.
zum Thema Präsentieren und Überzeugen
The file system offsets between the word "Thema" and "und" depends upon how "Präsentieren" was encoded. As UTF-7, as UCS-2, Latin-1 or with entities &uml; and &am;;Uuml; or &#nnn form. Does one, instead, treat the rendered level? This too is misleading since different output devices might have quite different layouts. What about columns? And the tags? Tags are, of course, even worse with words since we have started to associate a semantics for distance. Look at an XML fragment like
<name>Edward C. Zimmermann</name><company>NONMONOTOMIC Lab</company>
What is the word distance between "Lab" and "Edward" keeping in mind that XML markup is equivalent if NAME is specified before or after COMPANY.
<company>NONMONOTOMIC Lab</company><<name>Edward C. Zimmermann</name>
These are equivalent content and the word distance? That's right.. its not well defined! Order and inclusion do make sense, even some rough guide to distance but words? We are, of course, open to convincing examples!

Counting containers

IB can also count containers.
GetNodeOffsetCount(GPTYPE HitGp, STRING FieldNameorPath= "", FC *Fc= NULL)
where HitGp is the address of a hit, FieldNameofPath the name, resp. path, of the container we are interested in and Fc (optional) is its calculated start/end coordinates.

This is useful to be able to determine which page (paragraph or line) a hit is on but may also be used in an abstract sense. Example:

  <ingredients>
    <item>Chocolate</item>
    <item>Flour</item>
    <item>Butter</item>
    <item>eggs</item>
  </ingredients>
The oder is the order in the mark-up. Chocolate might be the 25th item in the document but its the first item in the particular instance of INGREDIENTS. This is possible since we can check the count of the instance of INGREDIENTS and can look at the previous we can count also the offset of ITEM.

Granularity of Search Results (ex-ante neo-Records)

IB provides the effective functionality of search time definition of unit of recall.

In conventional search/retrieval systems the fundamental unit of recall is the "record" defined prior to indexing. If, for example, one indexed the complete works of Shakespeare on a per-play basis then searching for terms would only return the unit of recall play (or an element thereof, for example, its TITLE).

In IB, by, contrast, one can define in a search query the unit of recall as ancestor/descendant of found (matched) elements. One can easily request (in the Shakespeare, index per play, example) the name of the acts (and the text of acts) that match a query just as one can ask for the speech (and its speaker) or even the lines as the unit of recall.

This functionality is mission critical to, among other fields, the search of literature and legal texts.

In IBU News its used to distinguish between stories (<item>s in RSS) and the feeds in which they belong. Indexing on a per feed basis (and not item) also allows us to search for inter-relationships between different items in the same feed to ask questions like "What news feed had both an article on Putin and on Ahmadinedjad and what were the stories". Sometimes the story is told in the context of multiple articles in a feed and not just in one.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer