More progress on tokenizing/parsing etc.
Posted by mholmes on 16 Mar 2012 in Activity log
I've had to resort to a second pass through the data to count offsets, and that's now working reliably. I've also got the reconstitution of hyphenated words at linebreaks working, but only most of the time; for some reason, when the linebreak precedes a <fw>
element, the reconstitution fails. I'm still working on that, but it's very mysterious. I'll probably have to create some test data rather than working on real files until I get it sorted out.
All in all, though, very promising progress.
This entry was posted by Martin and filed under Activity log.