Result Set Size

Date: Mon, 17 Nov 2008 18:14:03 +0100
From: "Edward C. Zimmermann"
To: "SRU (Search and Retrieve Via URL) Implementors" , ZNG@sun8.LOC.GOV
Subject: Re: result size precision All headers

On Mon, 17 Nov 2008 11:13:28 -0500, Ray Denenberg, Library of Congress wrote
> Based on discussion the past several years over the topic of result size
> precision, the OASIS Search Web Service Technical Committee proposes to define
> in SRU 2.0 an optional response element whose definition
> would be something like:
>

Confidence as a 0-100 integer does not makes any sense. What does 30 mean?
30% confidence? Confidence in what? What does 10% or 90% mean? We can
debate what these numbers mean in statistics or in empirical data quality
but here its screaming "wrong model".

Servers don't have typically any kind of "confidence" beyond "know" and
"don't know". Between these two poles its more or less "what I think at
this moment but I know its probably not right but could be.."..

> "The server's confidence in the precision of the result count reported. A
> non-negative integer from zero to 100. A value of zero means the server has no
> idea what the size of the result set is. '100' means the server guarantees
> that the value of result count is accurate. A value in between means the
> result count is an estimate, where a higher value means that the server has
> more confidence in the precision than a lower value."
>
> [Note: the committee debated an alternative, where there would be three
> values: 'exact', 'estimate', 'no idea'. However, the committee felt that
> might be inflexible, and there might eventually be implementors who would want
> four levels, five, etc. With the zero to 100 approach, a convention could be
> recommended to use zero, 50, and 100 for the three-level representation.]
>

"four levels, five etc." seems like private conversations to me..
We can, of course, think about a model and define a few different "fits"..
Beyond "exact", "no idea" are a few levels of "estimates".. Like "feeling
good estimate", "feeling not too good estimate", "best estimate"..
But we're still missing some important states.. like "volatile"..
Or "minimum" (at least this many) or maximum (probably no more than..)...

Nothing quantitative much less "linear" here...

A real application:

We've been doing some work in distributed p2p search networks.. and here
the longer a search runs the larger the set can perhaps be but it can also
shrink as we dynamically adjust the granularity of information.. converging
upon some size in unknown time--- the search is given fuel and like a motorcar
can be re-fueled.. Now.. we have "an idea".. we don't know the limit but at
any given moment we have a certain size of the set we have at that moment..
which is, of course, a different set the next bat of an eye..

What I'm suggesting is instead of this pseudo analytical 0-100 stuff we
have nice qualitative words as a minimum public vocabulary such as as
"exact", "unknown", "minimum" (its at least this many), "maximum" (its no
more than this) etc. and allow for "personal extensions" as any term other
than these (or whatever magic words we define).. together with a
controlled core list of modalities.. (such as shall be, is, was etc.)

Clients would only "need to" understand the 0 (don't know) and 1 (exact)..
but could grasp more..

> That's the server side, comments are welcome.
>
> At the other end is the client side. Should the client indicate that it does
or does not care about result size precision? It might want 10 records, any
10, and beyond that it doesn't care if there are 10 or 10 billion, and it does
not want the client to bother to even try to determine or estimate the result
size, as that may be an expensive process.
>
> The TC is inclined not to address this, the client end, unless someone can
cite a real requirement (not just "it seems useful"). So we are soliciting
feedback on this question from SRU implementors. Can someone assert that if
a request parameter were to be defined pertaining to result size precision,
you would implement it?
>
> Ray Denenberg
>
>

-------------------------------------------------------------

Date: Tue, 2 Dec 2008 20:19:36 +0100
From: "Edward C. Zimmermann"
To: "SRU (Search and Retrieve Via URL) Implementors" , ZNG@sun8.LOC.GOV
Subject: Re: result size precision // mining for gold

On Tue, 18 Nov 2008 16:56:54 -0500, Ray Denenberg, Library of Congress wrote
> ----- Original Message -----
> From: "Edward C. Zimmermann"
> > But we're still missing some important states.. like "volatile"..
> > Or "minimum" (at least this many) or maximum (probably no more than..)...
>
> Good point. I agree these are important states to incorporate somehow.

Descriptive or qualitative features are always best addressed as descriptive
or qualitative features and not pressed into ill-fitting and inappropriate
quantitative suits.

>
> > We've been doing some work in distributed p2p search networks.. and here
> > the longer a search runs the larger the set can perhaps be but it can also
> > shrink as we dynamically adjust the granularity of information..
> converging
> > upon some size in unknown time--- the search is given fuel and like a
> motorcar
> > can be re-fueled..
>
> Could you elaborate or give a concrete example? The set shrinking
> as you dynamically adjust the granularity of information; the search

Since we have via information structure facilities for a "search time"
(dynamic, either user specified or model driven) unit of retrieval decoupled
from the record as fragment we can exploit it to identify which document
elements (such as the appropriate chapter or page) or fragments are
"appropriate" to retrieve.

Books might be fixed containers but increasingly information is not limited
to these physical objects but are entirely digital. Even with books the
question is not just "which book" but where to look in the book... Or the
question is: where to look.

Retrieval granularity may be on the level of sub-structures of a given
document or page such as line, paragraph but may also be as part of a larger
collection since information structure is not only defined by markup or
implicit (such as sentences, paragraphs in text) but also with respect to the
relationships between a record (object) within a collection of objects.

When searching, for example, the collected works of Shakespeare for love
and war it makes, I'd suggest, little sense to list those plays with the
words "love" and "war" in them. We'd find on a document level all 33 records
(plays). From a retrieval viewpoint when looking at Shakespeare and not
literature as a whole we are really asking: "What's the relevant bits of
Shakespeare's plays?".

Looking, however, at a collection that includes Shakespeare's works alongside
a host of other books (such as found in a typical library) when looking for
"love" "war" we're asking first "What's the relevant class of works that
address the concepts of "love" and "war". Here the retrieval granularity
can be to "Shakespeare's works".

What's the point of searching the Internet for pages with the term "war"
using the record driven paradigms and sorted by some ranking?
http://www.isea2008singapore.org/exhibitions/air_exodus.html
(an Art project that showed a bit of the ideas)

How does this shift over time in the search in a network of servers?

As an example: Searching for war in a network.. War might first appear to
be a term conflict but may over the course of a search appear in clusters
of information about the band war (as in Eric Burdon and War), the game
warhammer (W.A.R. for Warhammer Age of Reckoning), war as past tense of
being in German.. each of these bits themselves within their own "spheres"..
for example "war" as in http://www.dictionaryofwar.org/

Inclusion in a collection, references, linkages are all relationships that
works may carry on.

Search is searching for information and retrieval may be a relevant part of
a document or a collection of relevant documents (such as a Journal,
Newspaper, Encyclopedia, Social Network etc.).

Its like mining for gold with pans..

> given fuel: fascinating ideas, but what do they mean?
>
> > What I'm suggesting is instead of this pseudo analytical 0-100 stuff we
> > have nice qualitative words as a minimum public vocabulary such as as
> > "exact", "unknown", "minimum" (its at least this many), "maximum" (its no
> > more than this) etc. and allow for "personal extensions" as any term other
> > than these (or whatever magic words we define).. together with a
> > controlled core list of modalities.. (such as shall be, is, was etc.)
> >
> > Clients would only "need to" understand the 0 (don't know) and 1 (exact)..
> > but could grasp more..
>
> In my opinion this is a good suggestion well-worth consideration. It
> is important to note, you should post this suggestion to the OASIS public
> comment list for this activity.
>
> See
> http://www.oasis-
> open.org/committees/comments/index.php?wg_abbrev=search-ws
>
> (To subscribe, send a blank email message to
> search-ws-comment-subscribe@lists.oasis-open.org. Once subscribed,
> email to search-ws-comment@lists.oasis-open.org.)
>
> Please note, I'm not saying that discussion is to be carried out on that
> list, however that list needs to officially record any comments.
>
> Thanks.
>
> --Ray

--