More progress on tokenizing/parsing etc.
I now have the XSLT breaking down each text node into a series of components: either whitespace (passed through as plain text), punctuation sequences (tagged with <pc>
) or word[-fragment]s (tagged with <w>
, with much more tagging due in subsequent phases).
My current problem is the requirement to record the offset and length of each word in the original text node, so that a search engine can find its way from the modernized source back to the original text. Length is easy, but offset is proving difficult. I have a question posted on the XSLT list in the hope of some help, but it may be that we have to go in two stages: pre-process to create the <ab>
element, which is stored in a variable, and then post-process, where the <ab>
element and its contents are re-analyzed and additional tagging is added based on that analysis, before the resulting enriched ab is output.