Setting text indexing parameters
After checking the current text settings in Oracle and consulting the
Oracle Text Reference, you may want to customize the settings that will be used to index your XML documents. This requires creating the appropriate preferences in Oracle and then referencing those parameters with the ArcSDE DBTUNE parameter XML_IDX_INDEX_TEXT. Then, use the DBTUNE keyword containing the custom XML_IDX_INDEX_TEXT parameter when creating an XML column. For an ArcIMS Metadata Service you can reference a DBTUNE parameter in its ArcXML configuration file.
- Using a custom lexer
Depending on the language of your XML documents and the settings at the time Oracle was installed, the DEFAULT_LEXER parameter may not be configured to use the best lexer for your situation. A lexer preference must be set correctly for your language to successfully index XML documents.
For example, if the language you are working in is Thai, Oracle does not provide a Thai lexer and the DEFAULT_LEXER preference may be set to use the BASIC_LEXER lexer as a result. The BASIC_LEXER indexes and searches text using whitespace only, which may not offer the best results with a language such as Thai. If your XML documents are encoded as UTF-8, you may get better results using the WORLD_LEXER in this case if you are using Oracle 10g or a newer version. All XML documents published to an ArcSDE XML column by ArcIMS Metadata Services are encoded as UTF-8.
It may be helpful to review the system-provided text objects before creating custom preferences, for example, to be sure that the lexer you want to use is available with your Oracle installation.
select * from ctx_objects;
If the WORLD_LEXER is available, you can create a custom text preference referencing this lexer as follows:
exec ctx_ddl.create_preference('WORLDLEXER', 'WORLD_LEXER');
To use this preference when creating an XML column, reference it in the XML_IDX_INDEX_TEXT DBTUNE parameter. Always preface the preference name with the user who created the preference. For example, if the geodatabase administrator logged in as the user "sde" when they created the preference, reference the preference in the XML_IDX_INDEX_TEXT parameter as follows. Anyone can use a text preference created by another user.
LEXER sde.WORLDLEXER
If you can use the BASIC_LEXER to index your XML documents you may be interested in creating a custom lexer to merge text that includes non-alphanumeric characters into one word, for example. This can be accomplished using the printjoins attribute of the BASIC_LEXER, and would be useful to standardize indexing and searching of place names that use punctuation such as Coeur d'Alene.
Creating a custom lexer using the BASIC_LEXER's skipjoin attribute with the apostrophe character would index Couer d'Alene as Couer dAlene. Also, any incorrect spellings of this place using extra apostrophes such as C'ouer d'Alene would also be indexed as Couer dAlene which could produce better search results. For example, you would create a text preference like this as follows:
begin
ctx_ddl.create_preference('BASIC_LEX_SKIP', 'BASIC_LEXER');
ctx_ddl.set_attribute('BASIC_LEX_SKIP', 'skipjoins', '''');
end;
/
Use care when changing how punctuation is indexed. The same preference will be used when indexing all text in your XML documents. Values in one XML element may use punctuation differently than in others.
A custom lexer preference may also be required in cases where the DEFAULT_LEXER is correctly set to use the BASIC_LEXER, but the attributes of the BASIC_LEXER are not correct for your language. The BASIC_LEXER attributes for index stems, composite word indexing, alternate spelling, and new German spelling must be set appropriately for your language to correctly index XML documents. This is particularly important for Western European languages. After checking the appropriate values for these attributes for your language in the
Oracle Text Reference, create a custom lexer preference following the example above and then reference it in the XML_IDX_INDEX_TEXT DBTUNE parameter.
Using a custom wordlist or stoplist
Depending on the language of your XML documents and the settings at the time Oracle was installed, the DEFAULT_WORDLIST and DEFAULT_STOPLIST preferences may not have the correct configuration for your language.
The wordlist preference must be set correctly for your language to successfully search XML documents. This is particularly important for Western European languages, where the BASIC_WORDLIST preference must have the correct language set for stemming and fuzzy matching.
The stoplist preference controls which words in the XML documents will not be indexed and, therefore, not available for searching. Stoplists are available for several languages. You can also customize the stoplists if this is appropriate for your situation.
After checking the appropriate values for these settings for your language in the
Oracle Text Reference, create a custom preference and then add it to the XML_IDX_INDEX_TEXT DBTUNE parameter's value.
Setting storage parameters for XML columns
Many different DBTUNE parameters can be used to define the location in which objects associated with an XML column should be stored.
An XML column will always have an XML document table, which stores the XML documents. A text index is always created on all content in these documents to support finding all documents containing a given word. Storage parameters for this table can be defined as follows.
- Storage parameters for the XML document table are defined with the XML_DOC_STORAGE parameter.
- Storage parameters for indexes on the XML document table's columns are provided with the XML_DOC_INDEX parameter.
- XML documents are stored in this table as large objects. Storage parameters for these large objects are defined with the XML_DOC_LOB_STORAGE and XML_DOC_VAL_LOB_STORAGE parameters, which are discussed in more detail in the section 'Configuring LOB storage'.
An XML column may optionally have an XML index table, which stores the information in individual XML elements that can be searched. A text index is always created for this table to support searching descriptive text. Storage parameters for this table can be defined as follows.
- Storage parameters for the XML index table are defined with the XML_IDX_STORAGE parameter.
- Storage parameters for indexes on the XML index table's columns are provided with the XML_IDX_INDEX_PK, XML_IDX_INDEX_ID, XML_IDX_INDEX_TAG, XML_IDX_INDEX_DOUBLE, and XML_IDX_INDEX_STRING parameters.
- Information from the XML documents that is indexed as text is stored as large objects. Storage parameters for these are defined with the XML_IDX_TEXT_TAG_STORAGE parameter, which are discussed in more detail in the section 'Configuring LOB storage'.
Typical storage parameters can be defined for all of the above tables and indexes with the exception of the specialized storage parameters for the large objects. However, storage parameters for an XML column's text indexes must be defined differently.
With the default DBTUNE parameters, four tables are created to manage each text index. An Oracle text preference must be created to define storage parameters for these tables, then the storage preference must be referenced in the XML_IDX_INDEX_TEXT parameter. Both text indexes must use the same storage preference.
The following example creates a text preference named "XML_STORAGE" that places any tables created for an XML column's text indexes into a tablespace named "TXTIDX".
BEGIN
CTX_DDL.CREATE_PREFERENCE('XML_STORAGE', 'BASIC_STORAGE');
CTX_DDL.SET_ATTRIBUTE('XML_STORAGE', 'I_TABLE_CLAUSE',
'TABLESPACE TXTIDX STORAGE (INITIAL 1K)');
CTX_DDL.SET_ATTRIBUTE('XML_STORAGE', 'K_TABLE_CLAUSE',
'TABLESPACE TXTIDX STORAGE (INITIAL 1K)');
CTX_DDL.SET_ATTRIBUTE('XML_STORAGE', 'R_TABLE_CLAUSE',
'TABLESPACE TXTIDX STORAGE (INITIAL 1K) LOB
(DATA) STORE AS (CACHE)');
CTX_DDL.SET_ATTRIBUTE('XML_STORAGE', 'N_TABLE_CLAUSE',
'TABLESPACE TXTIDX STORAGE (INITIAL 1K)');
CTX_DDL.SET_ATTRIBUTE('XML_STORAGE', 'I_INDEX_CLAUSE',
'TABLESPACE TXTIDX STORAGE (INITIAL 1K) COMPRESS 2');
CTX_DDL.SET_ATTRIBUTE('XML_STORAGE', 'P_TABLE_CLAUSE',
'TABLESPACE TXTIDX STORAGE (INITIAL 1K)');
END;
/
To use this storage preference when creating an XML column, add it to the XML_IDX_INDEX_TEXT parameter. For example, if the geodatabase administrator logged in as the user "sde" when they created the storage preference, reference it in the XML_IDX_INDEX_TEXT parameter as follows.
LEXER sde.WORLDLEXER STORAGE sde.XML_STORAGE
Consult the
Oracle Text Reference for more information about creating a custom storage preference. Be sure to always include the storage clauses recommended by Oracle for text indexes.
Configuring LOB storage
XML documents are stored as large objects (LOBs) in the XML document table in the xml_doc and xml_doc_val columns, and in the XML index table in the text_tag column. It is important to configure these columns accurately to achieve the best possible search performance.
Using in-line or out-of-line storageLOBs are stored inline if the LOB data is stored in the same block as the rest of the data in the row. However, in-line storage is only possible if the LOB data is less than 4KB in size. With out-of-line storage, the data is stored in the LOB segment and only the LOB locator is stored with the rest of the data in the row.
By default, Oracle stores LOB data inline. You can specify whether LOB data associated with an XML column is stored inline or out of line using the ArcSDE DBTUNE parameters XML_DOC_LOB_STORAGE and XML_DOC_VAL_LOB_STORAGE for the XML document table and XML_IDX_TEXT_TAG_STORAGE for the XML index table.
When LOB data is stored out of line for an XML column, by default, ArcSDE places that data in the same tablespace as the XML document table. The LOB data may be moved to a different tablespace than the one containing the XML document table by using the XML DBTUNE parameters XML_DOC_STORAGE, XML_DOC_VAL_STORAGE, and XML_IDX_TEXT_TAG_STORAGE.
A typical XML document that contains metadata describing a GIS resource will be greater than 4KB in size. Tests show XML columns associated with ArcIMS Metadata Services perform best when the LOB data is stored out of line in a separate tablespace from the XML document table. Specify out-of-line storage for the XML column by setting these parameters to "DISABLE STORAGE IN ROW". After configuring another tablespace to store the LOB data, add that tablespace's name to the parameter values.
An ArcIMS Metadata Service may contain gazetteer data instead of typical metadata XML documents; this place name information is very small, typically less than 100 bytes in size. XML columns containing small documents will perform best when the LOB data is stored inline. The default DBTUNE keyword IMS_GAZETTEER specifies in-line storage for LOB data by setting the XML_DOC_STORAGE and XML_DOC_VAL_STORAGE parameters to "ENABLE STORAGE IN ROW".
When defining storage for XPath indexes, you should store the data out of line. This is because these indexes can be of varying sizes and, therefore, some might be stored inline and others out of line. Having to search both inline and out of line can slow down the indexes. Only store the data inline if you are certain all data is of the same size and small.
The XML_IDX_TEXT_TAG_STORAGE parameter does not have a default value in the DEFAULTS keyword and is not included in the IMS_GAZETTEER keyword. Specify the following storage information for the XML_IDX_TEXT_TAG_STORAGE parameter in the DBTUNE table so XPath index storage is out of line:
NOCACHE NOLOGGING CHUNK 4K PCTVERSION 5 DISABLE STORAGE IN ROW
When appropriate, use a similar value with "ENABLE STORAGE IN ROW" to store the data inline.
For more information about inline and out of line storage, see Oracle's LOB performance guidelines.
- LOB caching and logging
LOB data may be read and written using the buffer cache. When the LOB data isn't accessed using the buffer cache, it is read and written by directly accessing the data each time. Caching can improve search performance for XML columns for which LOB data is stored out of line, particularly for frequently accessed LOBs. By default, Oracle will not cache LOB data that is stored out of line.
If logging is specified, changes to out-of-line LOB data will be added to the redo logs. Logging is enabled when LOB data is cached. This may decrease performance for publishing XML documents.
With ArcSDE XML columns, LOB data will initially be stored as NOCACHE NOLOGGING. However, XML columns that store LOB data out of line should be set to CACHE LOGGING to optimize search performance. For XML columns associated with an ArcIMS metadata service, both LOB columns in the XML document table will be set to CACHE LOGGING when the aimsmetaindx command is run.
The caching and logging properties won't affect LOB data that is stored inline such as with XML columns that contain gazetteer data.
For more information about caching and logging, see Oracle's LOB performance guidelines.
- Tablespace block size and buffer cache configuration
When out-of-line LOB data is not present in the buffer cache, the block size can affect the amount of time required to read data from disk. You can improve search performance against an XML column if you know the data's distribution, and you set the tablespace and buffer cache to use the correct block size.
For example, if most of the LOB data is less than 8 KB in size and the tablespace block size is 16 KB, more time will be spent reading data from the disk than necessary. In this situation, you would improve performance if you used Oracle's multi-block-size configuration, storing the out-of-line LOB data in a tablespace that was set to use 8 KB blocks.
Setting the tablespace block size must happen in combination with allocating memory for the buffer cache. The size of the buffer cache depends on the block size of the tablespace used to store the out-of-line LOB data. If the tablespace uses an 8 KB block size, memory must be allocated for an 8 KB buffer cache. You must be sure enough system memory has been allocated to support the cache.
Oracle uses a tablespace block size of 16 KB by default. Initially, you should use a tablespace with the default block size and buffer cache settings to store out-of-line LOB data. Then, once XML documents have been published to the XML column, you can analyze the distribution of that data using the xml_lob_block_distribution script that is available in the SDEHOME\tools\oracle directory.
ESRI tests show that typical metadata XML documents published to an XML column will be less than 8 KB in size. To optimize search performance for this type of XML column, you can take advantage of Oracle's multi-block-size configuration to store the out-of-line LOB data in a tablespace with an 8 KB block size, while other tablespaces can continue storing data in 16 KB blocks. First, the Oracle instance must be configured to allow a block size of 8 KB, then you can create a new tablespace that uses the 8 KB block size and 8 KB buffer cache to store the out-of-line LOB data. Modify the ArcSDE DBTUNE XML_DOC_STORAGE, XML_DOC_VAL_STORAGE, and XML_IDX_TEXT_TAG_STORAGE parameters to use the new tablespace and create a new XML column using those parameters. Finally, republish your XML documents to the new XML column.
Ideally, all out-of-line LOB data blocks would be stored in the buffer cache to optimize search performance. You can use the xml_lob_cache_size script in the SDEHOME\tools\oracle directory to estimate the cache size required to store all the LOB blocks. Before running this script, change the block_size variable at the beginning of the script to the actual block size for the tablespace storing the LOB data. Depending on available physical memory, you could adjust the buffer cache to ensure that as many LOB data blocks will be stored in the buffer cache as possible.
See the Oracle documentation for more information about the tablespace db_block_size parameter and the buffer cache.
Updating an XML column's text indexes
By default, when XML documents are stored in an XML column the text indexes associated with the column will not be automatically updated to include text from the new documents. Until the text indexes have been updated, any XML documents that have not been indexed can't be found by a search.
ArcSDE does not provide any utilities that can be used to manually trigger updates to an XML column's text indexes. For XML columns associated with an ArcIMS Metadata Service, ArcIMS provides configuration options and a command-line utility can be used to update the text indexes.
Set the optimizer parameters for Oracle9i
For Oracle9
i, a few optimizer parameters should be set in the database for searches to perform well. Setting these parameters will not adversely affect other client software that accesses your ArcSDE database. You do not need to set these parameters for newer versions of Oracle.
For Oracle9
i, the following parameter must be added to your Oracle server's init.ora file:
OPTIMIZER_MODE=CHOOSE
Only the optimizer mode must be set; the other parameters used with Oracle8
i aren't required.
NOTE: Beginning with ArcSDE 9.2, Oracle 8i is not supported. However, if you are using ArcSDE 9.1 or lower with Oracle 8i, the following parameters must be added to your Oracle server's init.ora file:
OPTIMIZER_INDEX_CACHING = 90
OPTIMIZER_INDEX_COST_ADJ = 10
OPTIMIZER_MODE=CHOOSE
Refer to the Oracle8
i documentation for a complete discussion of these parameters. In general, setting these parameters as specified directs the Oracle8
i optimizer to favor index scans and nested loop execution plans over a full table scan or a sort merge.