Wednesday, December 24, 2008

ORACLE TEXT: Text Queries on Document Collections

In a basic text query application, users enter query words or phrases and expect the application to return a list of documents that best match the query. Such an application involves creating a CONTEXT index and querying it with CONTAINS.


Create your Text Table

The following example creates a table called docs with two columns, id and text, by using the CREATE TABLE statement. This example makes the id column the primary key. The text column is VARCHAR2.

CREATE TABLE docs (id NUMBER PRIMARY KEY, text VARCHAR2(200));

Load Documents into Table: You can use the SQL INSERT statement to load text to a table. To populate the docs table, use the INSERT statement as follows:

INSERT INTO docs VALUES(1, 'California is a state in the US.');
INSERT INTO docs VALUES(2, 'Paris is a city in France.');
INSERT INTO docs VALUES(3, 'France is in Europe.');



Create the CONTEXT index

Index the HTML files by creating a CONTEXT index on the text column as follows. Since you are indexing HTML, this example uses the NULL_FILTER preference type for no filtering and uses the HTML_SECTION_GROUP type:

CREATE INDEX idx_docs ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS ('FILTER CTXSYS.NULL_FILTER SECTION GROUP CTXSYS.HTML_SECTION_GROUP');

Use the NULL_FILTER because you do not need to filter HTML documents during indexing. However, if you index PDF, Microsoft Word, or other formatted documents, use the CTXSYS.INSO_FILTER (the default) as your FILTER preference. This example also uses the HTML_SECTION_GROUP section group which is recommended for indexing HTML documents. Using HTML_SECTION_GROUP enables you to search within specific HTML tags, and eliminates from the index unwanted markup such as font information.


Querying Your Table with CONTAINS

You query the table with the SELECT statement with CONTAINS to retrieve the document ids that satisfy the query. Before doing so, set the format of the SELECT statement's output so that it is easily readable. To do so, set the width of the text column to 40 characters:

COLUMN text FORMAT a40;
SELECT SCORE(1), id, text FROM docs WHERE CONTAINS(text, 'France', 1) > 0;


Present the Document

In a real application, you might want to present the selected document to the user with query terms highlighted. Oracle Text enables you to mark up documents with the CTX_DOC package. We can demonstrate HTML document markup with an anonymous PL/SQL block in SQL*Plus. However, in a real application you might present the document in a browser. This PL/SQL example uses the in-memory version of CTX_DOC.MARKUP to highlight the word France in document 3. It allocates a temporary CLOB (Character Large Object datatype) to store the markup text and reads it back to the standard output. The CLOB is then de-allocated before exiting:

SET SERVEROUTPUT ON;

DECLARE
mklob CLOB;
amt NUMBER := 40;
line VARCHAR2(80);
BEGIN

CTX_DOC.MARKUP('idx_docs','3','France', mklob);
DBMS_LOB.READ(mklob, amt, 1, line);
DBMS_OUTPUT.PUT_LINE('FIRST 40 CHARS ARE:'||line); DBMS_LOB.FREETEMPORARY(mklob);

END;
/

Synchronize the Index After Data Manipulation

When you create a CONTEXT index, you need to explicitly synchronize your index to keep it up to date with any inserts, updates, or deletes to the text table. Oracle Text enables you to do so with the CTX_DDL.SYNC_INDEX procedure. Add some rows to the docs table:

INSERT INTO docs VALUES(4, 'Los Angeles is a city in California.');
INSERT INTO docs VALUES(5, 'Mexico City is big.');

Since the index is not synchronized, these new rows are not returned with a query on city :

SELECT SCORE(1), id, text FROM docs WHERE CONTAINS(text, 'city', 1) > 0;

Therefore, synchronize the index with 2Mb of memory, and reexecute the query:

EXEC CTX_DDL.SYNC_INDEX('idx_docs', '2M');

COLUMN text FORMAT a50;
SELECT SCORE(1), id, text FROM docs WHERE CONTAINS(text, 'city', 1) > 0;


Related Topics:

  1. Example of Queries on Catalog Information.
  2. Example of Document Classification.

No comments: