With MH, LG and CC, did the second interview, which went well; a good hour of material. Then had small project meeting (minutes to go on GitHub).
The search now more effectively sorts hits in order of relevance, prioritizing those which contain more of the search terms, and within those sets sorting by total number of instances. Tweaked the styling a bit too, and fixed a couple of oddities.
It's might simple and it's fast. More features I'm thinking about:
- In addition to the stemmed version of a word, we could also index the unstemmed version; that would give documents which have the unstemmed version a higher score than those which only have a related form, at the cost of probably doubling (or more) the number of the JSON files.
- When I rank documents for relevance right now, I'm just counting any occurrence of any of the words; we could give extra points to documents which include more of the words.
- It would actually be possible to support + and - (must include and must not include) with the current infrastructure; is it worth it?
- It is theoretically possible to build an ngram index in the same way, but of course the JSON scale would be hugely increased. But if we're not supporting quoted phrases, we should warn of this, or perhaps simply strip quotes out before doing the search.
- Should we warn about stopwords, or strip them out?
Did some work over the weekend, and a lot more today, and I'm almost there. I had a working version and then I broke it towars the end of the day. The difficult thing was handling chained Promise objects, mixed in with object methods and variable binding, and needing to ensure that when something is not retrieved, that's OK because it means the token doesn't exist in the index. Should get there some time tomorrow, hopefully.
One of our project goals is to investigate the practicality of developing a local search "engine" which does not require any server-side support, to find out whether it is possible to do this, and if so, how large a site can be before it's impractical. Today I did the first half of this work, with the Keats site as a pilot (because it's modern(ish) English, it's of a size which is not huge but not trivial, and it doesn't have any back-end and probably shouldn't). This is what I've got so far:
- XSLT tokenizes all the content files, duplicating some bits to create simplistic weighting. I attempt to preserve proper names by retaining capitalization for all words which don't appear in a small English word-list (40,000 words).
- A python Porter stemmer stems all the non-proper-name tokens.
- XSLT amalgamates all the token-counts and their source documents.
- XSLT generates a separate JSON file for each token, containing a list of all documents containing it, and how many hits there are in that document.
Next, we write a search engine interface in which we use JavaScript to:
- stem each search term (unless it's a proper name). I've found a JS implementation of the Porter stemmer.
- retrieve the JSON files for each of the search terms
- unify them to get hit counts for each individual document
- display (paged?) results
Should be doable in a few hours.
With LG, EC, and MH, did the first Endings interview, and got a good hour's rich material. Made two backup recordings.
Project meeting; wrote up the notes afterwards and added them to the repo.
With JT, working on the article for submission to DSH. I'll submit tomorrow.
During the meeting we brainstormed around the plan for the DHSI course, and then JJ and I drafted it, including a day-by-day plan. It's now with the group for feedback; will submit at the end of the week.