How Lucene Indexing Works in the Geoportal, Under the Hood
Indexing is important because it determines what search results are returned when a user submits search criteria to the geoportal. When publishing a metadata document, certain content from the document will be submitted for indexing by the search engine. To facilitate the more advanced features of Lucene, this information is assigned a particular meaning. This 'meaning' determines how Lucene will index the content and how it may be used in searching.
Before a 'meaning' value can be used, it has to be defined in a file called "property-meanings.xml", located in the \\geoportal\WEB-INF\classes\gpt\metadata folder. Lucene references "property-meanings.xml" to index the metadata value for search and retrieval. Before adding new meanings, we strongly suggest using the existing meanings. This will minimize effort migrating to future versions of the Geoportal extension. The existing meanings should satisfy most of the search needs.
Assign a 'meaning' attribute to relevant parameters by altering the <parameter ... 'meaning'> attribute in the definition.xml files in your \\geoportal\WEB-INF\classes\gpt\metadata folder. An example parameter for indexing keywords is shown below:
<parameter key="keywordinfo" meaning="title">
When a user searches by 'title', Lucene is searching all its 'title' indexes for the search term. That 'title' index is defined by the definition.xml file having a metadata parameter defined with the 'meaning="title"' attribute. Note that after you modify a parameter in your definition.xml file, records that are already published in your Geoportal extension will have to be re-Approved to index the new meaning term.
Open the property-meaning.xml file. Notice that there are several "property-meaning" parameters. Most have a name, meaningType, valueType, and comparisonType. These attributes for property-meanings are described below.
Attribute Name |
Description |
name |
Unique name for the meaning in this file, and should match the meaning="" attribute in the definition.xml file. The name designated becomes a Lucene field that can be used for advanced searches, as per Lucene documentation. For example, designating a name of 'title' and then typing 'title: water" on your GPT search page will only return items with "water" in the index Lucene has associated with the property-meaning 'title'. |
meaningType |
Used to flag metadata elements that are tied to functionality within the Portal. It is good practice to avoid altering the meaningType of a property-meaning |
valueType |
Data type of the attribute. Examples are String, Timestamp, and geometry |
comparisonType |
Indicates how Lucene will analyze the terms in the element for that meaning. There are three options defined in the property-meaning.xml file:
-
term: phrases associated with this attribute are tokenized. For example, if "San Diego" is the word that is being stored, if it is associated with a meaning that has a comparisonType of 'term', it will be stored as two separate words "San" and "Diego". Terms are also stored in a lowercase form, e.g. "san" and "diego".
-
keyword: phrases associated with this attribute are not tokenized. For example, if "San Diego" is the word that is being stored, if it is associated with a meaning that has a comparisonType of 'keyword', it will be stored as one phrase. A search for "San" will not return the record; only a search for "San Diego". Terms are also stored in a lowercase form, e.g., "san diego".
-
value: items associated with this attribute are stored as values, not phrases or words. Items are case-sensitive. An example would be the fileIdentifier meaning. Parameters with a meaning="fileIdentifier" likely hold unique identification strings, such as "{F56408D6-4325-484C-B753-5E8FD4421E31}". Searching for part of the string, such as "E31" will not retrieve the record because the string is stored as a complete value and not parsed. Searching for the string "{f56408d6-4325-b753-5e8fd4421e31}" will also not return the record because the value stored is case-sensitive.
|
Some property-meanings have one or two additional tags,
<dc> and
<consider>.
- The <dc> tag stands for "Dublin Core". If a meaning in the property-meaning file corresponds to elements in the Dublin Core metadata standard, it is indicated in this tag.
- The <consider> tag defines other property-meanings that should be included when a search for that attribute is conducted through CS-W. For example, the property-meaning for anytext is shown below. Because anytext has four other property-meanings listed in its <consider> tag, when a CS-W request searches for 'anytext', elements with the title, abstract, keywords, and body meaning attributes are searched.
Note: The
anytext meaning is a special case. If you set the meaning of a parameter to
anytext, that parameter will be indexed as a
body meaning. Search results for
anytext:searchTerm (where searchTerm is the word or phrase for which you are searching) will not retrieve results. However,
body:searchTerm will retrieve results with search phrases in the elements where the meaning is set to
anytext. The
anytext index itself is reserved for CS-W alone.
Working Example
In this example we will alter the form for ISO 19115/19139 Datasets to make the element called "data type" searchable from the basic search field.
- Open the iso19139-coregeog-definition.xml file from the \\geoportal\WEB-INF\classes\gpt\metadata folder in a text editor.
- Find where Data Type is defined:
- Add a "meaning" attribute to this parameter. Because we are not mapping data type to a property-meaning that should only have one value per document (such as abstract, title, or fileidentifier), let's set the meaning to "body".
- Save the iso19139-coregeog-definition.xml file. Restart Tomcat for changes to take affect. Conceptually, now any ISO 19139 dataset document published to the Geoportal will have its data type value searchable from the search page.
- Now, find in the iso19139-coregeog-definition.xml file where the abstract parameter is defined. Notice that it has a meaning attribute set to meaning="abstract". This means that if a user types "abstract:whateverPhrase" in the search field on the search page, the Geoportal will search all elements with a meaning of 'abstract' for the phrase 'whateverPhrase' and return matching records as search results.
Note: The geoportal can be customized so that it automatically indexes all metadata content, regardless of which parameter it is associated with in the metadata. To enable this customization, see
Index All Metadata Content Overview.