NavigationUser login |
Document base clases ("doctypes")The IB engine is designed for and supports heterogeneous data sources. Fields, including many implicit, are automatically--- respective to their document format (such as lines, sentences, paragraphs, pages etc. of plain text; subject, author, references, email addresses etc. in email etc.)--- detected as if-tagged (implicit auto-tagging). In PDF we have, for example, not only the document properties (metadata or info section) that PDF documents define as fields (including handling of their types such as dates) but also the implicit textual structure of the content in sentences, paragraphs and pages. Several doctypes also automatically (if enabled, resp. not disabled) at index time detect a number of field datatypes (objects) and set them accordingly (telephone number, date, number etc.). The following document base classes are provided: Available Built-in Document Base Classes (v28):
AOLLIST ATOM AUTODETECT BIBCOLON
BIBTEX BINARY CAP COLONDOC
COLONGRP DIALOG-B DIF DVBLINE
ENDNOTE EUROMEDIA FILMLINE FILTER2HTML
FILTER2MEMO FILTER2TEXT FILTER2XML FIRSTLINE
FTP GILS GILSXML HARVEST
HTML HTML-- HTMLCACHE HTMLHEAD
HTMLMETA HTMLREMOTE HTMLZERO IAFADOC
IKNOWDOC IRLIST LISTDIGEST MAILDIGEST
MAILFOLDER MEDLINE MEMO METADOC
MISMEDIA NEWSFOLDER NEWSML OCR
ONELINE OZSEARCH PAPYRUS PARA
PDF PLAINTEXT PS PTEXT
RDF REFERBIB RIS ROADS++
RSS.9x RSS1 RSS2 RSSARCHIVE
RSSCORE SGML SGMLNORM SGMLTAG
SIMPLE SOIF TSLDOC TSV
XBINARY XFILTER XML XMLBASE
YAHOOLISTExtensibilityVia the various FILTER2 doctypes— FILTER2MEMO, FILTER2TEXT, FILTER2XM— other document formats, custom data cleansing and/or content enrichment packages can be inserted into the indexing data pipeline to easily provide best-of-breed 3rd party functionality. Via the object type system access can be provided to and from proprietary information and database systems as required.System developers and integrators can also via the Doctype Development Kit develop their own custom doctype plugins. The standard delivery includes the following doctype plugins: External Base Classes ("Plugin Doctypes"):
NULL: // Empty plugin
MSWORD: // M$ Word Plugin
MSRTF: // M$ RTF (Rich Text Format) Plugin [XML]
MSOLE: // M$ OLE type detector Plugin
MSEXCEL: // M$ Excel (XLS) Plugin
RTF: // "Rich Text Format" (RTF) Plugin
USPAT: // US Patents (Green Book)
ESTAT: // EUROSTAT CSL Plugin
ISOTEIA: // ISOTEIA project (GILS Metadata) XML format locator records
ADOBE_PDF: // Adobe PDF Plugin
PDFDOC: // OLD Adobe PDF Plugin
TEXT: // Plain Text Plugin By Edward C. Zimmermann at 2010-05-07 13:10 add new comment
|