The indexing process

Top  Previous  Next

For your information, we have documented the basic process in which text is extracted from the HTML source file. This may be useful if you are looking into the operation of the Indexer to determine why certain content is not being indexed, or if you are attempting to make advanced changes to the script.

1.Remove all text between PHP markers ( <? ... ?> )
2.Remove all text between ASP markers ( <% ... %> )
3.Remove all text between Zoom comments  (<!--ZOOMSTOP--> ... <!--ZOOMRESTART-->)
4.Remove all text between HTML comments ( <!-- ... --> )
5.Remove all Java scripting sections ( <script> ... </script> )
6.Remove all style sections ( <style> ... </style> )
7.Extract the Title information and description meta-tag information from the file header.
8.Remove all HTML tags ( <...> )
9.Convert HTML character entities and numerical entities back to plain text.
10.Index all remaining text.
11.Index any additional key words found in the “ZOOMWORDS” and “Keywords” meta-tag.