Ranking
IB supports many models of sorting result sets:
- By Record Key
- By order as indexed
- By an external sort
- By Score
- By Adjuncted Score
- By number of different term matches
- By Date (either forward or reverse)
- By Category
- By Newsrank (a heuristic that combines "score" with a function of chonology). The idea behind "Newsrank" is that newer stories are more significant than older ones.
Score
Score is the product of normalization of hits and there are several models.
Spatial Ranking
Spatial searches are scored according to the work "A spatial overlay ranking method for a geospatial search of text objects", Lanfear, Kenneth J. & U.S. Geological Survey, 2006, USGS Reston, Va. : http://pubs.usgs.gov/of/2006/1279/2006-1279.pdf . The spatial overlay score tries to correlate how well an object's footprint matches the search's spatial extent (defined by a bounding box).
Normalization
The IB engine supports many models of normalization. These may be chosen at search time:
- Cosine normalization: Despite being many decades old, still seems to be among the best. Its main drawback (beyond the need to create full sets) is that it tends to be biased towards shorter records, finding them more frequently than alone the linear distribution of lengths might suggest.
- Log Normalization: Normalized according to the log of document length.
- Max Normalization: Normalized to favor those with more different hits.
- Bytes Normalization: Various document length normalization models have been proposed to address the bias of Cosine Normalization towards shorter records. They, however, nearly always tend to overly penalize long records. With Byte Normalization a middle ground approach is taken: the cosine model is slightly modified to also take document length distribution into consideration.
- Cosine Metric Normalization: Yet another variant of Cosine Normalization. The byte metric distance between "hits" (multiple term searches) is used to favor records where these are closer. Since hits are always closer in shorter records than in longer we limit ourselves to the minimum of all the distances rather than an average and adjust.
Score Bias
Scores can in turn be "biased" according to "priority" (a per record scalar), "category" (a per-record predicate) and temporal (date metadata). The skew can be used to "boost" or downgrade scores. In IBU News priority is caculated as a value to represent a number of features of the information source (such as data quality).