2. Februar 2011

Using Oracle Text Search

Oracle offers with "Oracle Text" a strong and highly adjustable Text Search to their database customers. We have used Oracle for years and 2 years ago we needed a Text-Search which has the ability to create a better sort order for search results.

Requirement: It should be possible to specify different emphases in the search query.

1. Object definition
A 'Product' in our case has the following attributes:
  1. Title
  2. Alternate title
  3. Feature
    1. Feature-Title
    2. Feature-Content
  4. Tags
  5. External links
  6. Link description

2. Table defintion
We created a help table in the database. All tags assigned to the product are written space-delimited into the PRODUCT_TAG column, the same for

  • PRODUCT_ALTERNATE_TITLE 
  • PRODUCT_FEATURE
  • FEATURE_CONTENT
  • EXTERNAL_LINK
  • LINK_DESCRIPTION

The CREATE TABLE statement looks like:


CREATE TABLE PRODUCT (
    ID NUMBER(38,0) NOT NULL,
    PRODUCT_TITLE VARCHAR2(256),
    PRODUCT_ALTERNATE_TITLE clob,
    PRODUCT_DESCRIPTION VARCHAR2(4000),
    PRODUCT_TAG clob,
    PRODUCT_FEATURE clob,
    FEATURE_CONTENT clob,
    EXTERNAL_LINK clob,
    LINK_DESCRIPTION clob
);


3. Index creation
First of all we need to grant the ctxapp to our user:

grant ctxapp to MY_USER;

Then we need to create a 'Lexer' to use the extended possibilities Oracle Text delivers:
"base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for whitespace-delimited languages that have extended character sets."


begin
    ctx_ddl.create_preference(
        'german_lexer','basic_lexer'
    );
    ctx_ddl.set_attribute(
        'german_lexer','composite','german'
    );
    ctx_ddl.set_attribute (
        'german_lexer', 'SKIPJOINS', '-'
    );
    ctx_ddl.create_preference (
        'german_wordlist', 'BASIC_WORDLIST'
    );
    ctx_ddl.set_attribute (
        'german_wordlist', 'STEMMER', 'GERMAN'
    );
end;
/


We have to create a preference for the index so that we can create an index over more than one column (MULTI_COLUMN_DATASTORE).


begin
    ctx_ddl.create_preference(
        preference_name => 'PRODUCT_DATA_STORE',
        object_name => 'MULTI_COLUMN_DATASTORE'
    );
    ctx_ddl.set_attribute(
        preference_name => 'PRODUCT_DATA_STORE',
        attribute_name => 'COLUMNS',
        attribute_value => '
            PRODUCT_TITLE,
            PRODUCT_DESCRIPTION,
            PRODUCT_ALTERNATE_TITLE,
            PRODUCT_TAG,
            PRODUCT_FEATURE,
            FEATURE_CONTENT,
            EXTERNAL_LINK,LINK_DESCRIPTION'
    );
end;
/

The creation of the index looks like:


CREATE INDEX PRODUCT_FT_IDX ON PRODUCT(PRODUCT_TITLE)
    INDEXTYPE IS CTXSYS.CONTEXT
    PARAMETERS('
        datastore PRODUCT_DATA_STORE
        section group CTXSYS.AUTO_SECTION_GROUP
        LEXER german_lexer
        WORDLIST german_wordlist
        STOPLIST CTXSYS.EMPTY_STOPLIST
        SYNC (ON COMMIT)');


The 'index column' you have to specify is the first column from our MULTI_COLUMN_DATASTORE preference (here: PRODUCT_TITLE). The 'german_lexer' we created before handles all the special character stuff for us.

The 'german_wordlist' enables some extra functionality provided by oracle text:
"Use the wordlist preference to enable the query options such as stemming, fuzzy matching for your language. You can also use the wordlist preference to enable substring and prefix indexing, which improves performance for wildcard queries with CONTAINS and CATSEARCH."
The stoplist parameter specifies which words ahouldn't be indexed by Oracle; We don't want it so we specify a empty stoplist (CTXSYS.EMPTY_STOPLIST).

4. Search Query
Now the the index is ready and waits for better queries to use it. Our query looks like this:


SELECT
    score(1) as SCORE_VALUE,
    ID_PRODUCT ,
    PRODUCT_TITLE,
    PRODUCT_DESCRIPTION
FROM PRODUCT
WHERE CONTAINS(PRODUCT_TITLE,'
    (((autos) within (PRODUCT_TITLE))*10
    ACCUM (($autos) within (PRODUCT_TITLE))*9
    ACCUM ((%autos%) within (PRODUCT_TITLE))*2)
    ACCUM (((autos) within (PRODUCT_DESCRIPTION))*9
    ACCUM (($autos) within (PRODUCT_DESCRIPTION))*8
    ACCUM ((%autos%) within (PRODUCT_DESCRIPTION))*2)
    ACCUM (((autos) within (PRODUCT_ALTERNATE_TITLE))*8
    ACCUM (($autos) within (PRODUCT_ALTERNATE_TITLE))*7
    ACCUM ((%autos%) within (PRODUCT_ALTERNATE_TITLE))*2)
    ACCUM (((autos) within (PRODUCT_TAG))*7
    ACCUM (($autos) within (PRODUCT_TAG))*6
    ACCUM ((%autos%) within (PRODUCT_TAG))*2)
    ACCUM (((autos) within (PRODUCT_FEATURE))*6
    ACCUM (($autos) within (PRODUCT_FEATURE))*5
    ACCUM ((%autos%) within (PRODUCT_FEATURE))*1)
    ACCUM (((autos) within (FEATURE_CONTENT))*5
    ACCUM (($autos) within (FEATURE_CONTENT))*4
    ACCUM ((%autos%) within (FEATURE_CONTENT))*1)
    ACCUM (((autos) within (EXTERNAL_LINK))*4
    ACCUM (($autos) within (EXTERNAL_LINK))*3
    ACCUM ((%autos%) within (EXTERNAL_LINK))*1)
    ACCUM (((autos) within (LINK_DESCRIPTION))*3
    ACCUM (($autos) within (LINK_DESCRIPTION))*2.5
    ACCUM ((%autos%) within (LINK_DESCRIPTION))',1)>0


Now it is possible to adjust the criterias:
  1. Is the searchstring found as the 'complete phrase': (autos) within (COLUMN)
  2. Is the baseword of the searchstring found as the 'complete phrase': ($auto) within (COLUMN); [This means: Searching for 'cars' should also find all occurrences of 'car']
  3. Is the searchstring anywhere inside the text: (%autos%) within (COLUMN)

For all of this possiblities for every column we now can adjust the index score value: ((autos) within (PRODUCT_TITLE))*10. Oracle Text uses this value to calculate the score value (there are more criteria like: amount of occurrences).

Now we can sort the resultset after score(1) or SCORE_VALUE to get a better sort order for the search.

Keine Kommentare:

Kommentar veröffentlichen