I continued to spend a lot of my time working on the Speak For Yersel project this week. We had a team meeting on Monday at which we discussed the outstanding tasks and particularly how I was going to tackle converting the quiz questions into dynamic answers. Previously the quiz question answers were static, which will not work well as the maps the users will reference in order to answer a question are dynamic, meaning the correct answer may evolve over time. I had proposed a couple of methods that we could use to ensure that the answers were dynamically generated based on the currently available data and we finalised our approach today.
Although I’d already made quite a bit of progress with my previous test scripts, there was still a lot to do to actually update the site. I needed to update the structure of the database, the script that outputs the data for use in the site, the scripts that handle the display of questions and the evaluation of answers, and the scripts that store a user’s selected answers.
Changes to the database allow for dynamic quiz questions to be stored (non-dynamic ones have fixed ‘answer options’ but dynamic ones don’t). Changes also allow for references to the relevant answer option of the survey question the quiz question is about to be stored (e.g. that the quiz is about the ‘mother’ map and specifically about the use of ‘mam’). I made significant updates to the script that outputs data for use in the site to integrate the functions from my earlier test script that calculated the correct answer. I updated these functions to change the logic somewhat. They now only use ‘method 1’ as mentioned in an earlier post. This method also now has a built-in check to filter out regions that have the highest percentage of usage but only a limited amount of data. Currently this is set to a minimum of 10 answers for the option in question (e.g. ‘mam’) rather than total number of answers in a region. Regions are ordered by their percentage usage (highest first) and the script iterates down through the regions and will pick as ‘correct’ the first one that has at least 10 answers. I’ve also added in a contingency in cases where none of the regions have at least 10 answers (currently the case for the ‘rocket’ question). In such cases the region marked as ‘correct’ will be the one that has the highest raw count of answers for the answer option rather than the highest percentage.
With the ‘correct’ region picked out the script then picks out all other regions where the usage percentage is at least 10% lower than the correct percentage. This is to ensure that there isn’t an ‘incorrect’ answer that is too similar to the ‘correct’ one. If this results in less than three regions (as regions are only returned if they have clicks for the answer option) then the system goes through the remaining regions and adds these in with a zero percentage. These ‘incorrect’ regions are then shuffled and three are picked out at random. The ‘correct’ answer is then added to these three and the options are shuffled again to ensure the ‘correct’ option is randomly positioned. The dynamically generated output is then plugged into the output script that the website uses.
I then updated the front-end to work with this new data. This also required me to create a new database table to hold the user’s answers, storing the region the user presses on and whether their selection was correct, along with the question ID and the person ID. Non-dynamic answers store the ID of the ‘answer option’ that the user selected, but these dynamic questions don’t have static ‘answer options’ so the structure needed to be different.
I then implemented the dynamic answers for the ‘most of Scotland’ questions. For these questions the script needs to evaluate whether a form is used throughout Scotland or not. The algorithm gets all of the answer options for the survey question (e.g. ‘crying’ and ‘greetin’) and for each region works out the percentage of responses for each option. The team had previously suggested a fixed percentage threshold of 60%, but I reckoned it might be better for the threshold to change depending on how many answer options there are. Currently I’ve set the threshold to be 100 divided by the number of options. So where there are two options the threshold is 50%. Where there are four options (e.g. the ‘wean’ question) the threshold is 25% (i.e. if 25% or more of the answers in a region are for ‘wean’ it is classed as present in the region). Where there are three options (e.g. ‘clap’) the threshold is 33%. Where there are 5 options (e.g. ‘clarty’) the threshold is 20%.
The algorithm counts the number of regions that meet the threshold, and if the number is 8 or more then the term is considered to be found throughout Scotland and ‘Yes’ is the correct answer. If not then ‘No’ is the correct answer. I also had to update the way answers are stored in the database so these yes/no answers can be saved (as they have no associated region like the other questions).
I then moved onto tackling the non-standard (in terms of structure) questions to ensure they are dynamically generated as well. These were rather tricky to do as they each had to be handled differently as they were asking different things of the data (e.g. a question like ‘What are you likely to call the evening meal if you live in Tayside and Angus (Dundee) and didn’t go to Uni?’). I also made the ‘Sounds about right’ quiz dynamic.
I then moved onto tackling the ‘I would never say that’ quiz, which has been somewhat tricky to get working as the structure of the survey questions and answers is very different. Quizzes for the other surveys involved looking at a specific answer option but for this survey the answer options are different rating levels that each need to be processed and handled differently.
For this quiz for each region the system returns the number of times each rating level has been selected and works out the percentages for each. It then adds the ‘I’ve never heard this’ and ‘people elsewhere say this’ percentages together as a ‘no’ percentage and adds the ‘people around me say this’ and ‘I’d say this myself’ percentages together as a ‘yes’ percentage. Currently there is no weighting but we may want to consider this (e.g. ‘I’d say this’ would be worth more than ‘people around me’).
With these ratings stored the script handled question types differently. For the ‘select a region’ type of question the system works in a similar way to the other quizzes: It sorts the regions by ‘yes’ percentage with the biggest first. It then iterates through the regions and picks as the correct answer the first it comes to where the total number of responses for the region is the same or greater than the minimum allowed (currently set to 10). Note that this is different to the other quizzes where this check for 10 is made against the specific answer option rather than the number of responses in the region as a whole.
If no region passes the above check then the region with the highest ‘yes’ percentage without a minimum allowed check is chosen as the correct answer. The system then picks out all other regions with data where the ‘yes’ percentage is at least 10% lower than the correct answer, adds in regions with no data if less than three have data, shuffles the regions and picks out three. These are then added to the ‘correct’ region and the answers are shuffled again.
I changed the questions that had an ‘all over Scotland’ answer option so that these are now ‘yes/no’ questions, e.g. ‘Is ‘Are you wanting to come with me?’ heard throughout most of Scotland?’. For these questions the system uses 8 regions as the threshold, as with the other quizzes. However, the percentage threshold for ‘yes’ is fixed. I’ve currently set this to 60% (i.e. at least 60% of all answers in a region are either ‘people around me say this’ or ‘I’d say this myself’). There is currently no minimum number of responses limit for this question type, so a region with 1 single answer that’s ‘people around me say this’ will have a 100% ‘yes’ and the region will included. This is also the case for the ‘most of Scotland’ questions in the other quizzes, as we may need to tweak this.
As we’re using percentages rather than exact number of dots the questions can sometimes be a bit tricky. For example the first question currently has Glasgow as the correct answer because all but two of the markers in this region are ‘people around me say this’ or ‘I’d say this myself’. But if you turn off the other two categories and just look at the number of dots you might surmise that the North East is the correct answer as there are more dots there, even though proportionally fewer of them are the high ratings. I don’t know if we can make it clearer that we’re asking which region has proportionally more higher ratings without confusing people further, though.
I also spent some time this week working on the Book and Borrowing project. I had to make a few tweaks to the Chambers map of borrowers to make the map work better on smaller screens. I ensured that both the ‘Map options’ section on the left and the ‘map legend’ on the right are given a fixed height that is shorter than the map and the areas become scrollable, as I’d noticed that on short screens both these areas could end up longer than the map and therefore their lower parts were inaccessible. I’ve also added a ‘show/hide’ button to the map legend, enabling people to hide the area if it obscures their view of the map.
I also sent on some renamed library register files from St Andrews to Gerry for him to align with existing pages in the CMS, replaced some of the page images for the Dumfries register and renamed and uploaded images for a further St Andrews register that already existed in the CMS, ensuring the images became associated with the existing pages.
I started to work on the images for another St Andrews register that already exists in the system, but for this one the images are a double page spread so I need to merge two pages into one in the CMS. The script needs to find all odd numbered pages then move the records on these to the preceding even numbered page, and at the same time regenerate the ‘page order’ for each record so they follow on from the existing records. Then the even page needs its folio number updated to add in the odd number (e.g. so folio number 2 becomes ‘2-3’. Then I need to delete the odd page record and after all that is done I need to regenerate the ‘next’ and ‘previous’ page links for all pages. I completed everything except the final task, but I really need to test the script out on a version of the database running on my local PC first, as if anything goes wrong data could very easily be lost. I’ll need to tackle this next week as I ran out of time this week.
I also participated in our six-monthly formal review meeting for the Dictionaries of the Scots Language where we discussed our achievements in the past six months and our plans for the next. I also made some tweaks to the DSL website, such as splitting up the ‘Abbreviations and symbols’ buttons into two separate links, updating the text found on a couple of the old maps pages and considering future changes to the bibliography XSLT to allow links in the ‘oral sources’
Finally this week I made a start on the Burns manuscript database for Craig Lamont. I wrote a script that extracts the data from Craig’s spreadsheet and imports it into an online database. We will be able to rerun this whenever I’m given a new version of the spreadsheet. I then created an initial version of a front-end for the database within the layout for the Burns Correspondence and Poetry site. Currently the front-end only displays the data in one table with columns for type, date, content, physical properties, additional notes and locations. The latter contains the location name, shelfmark (if applicable) and condition (if applicable) for all locations associated with a record, each on a separate line with the location name in bold. Currently it’s possible to order the columns by clicking on them. Clicking a second time reverses the order. I haven’t had a chance to create any search or filter options yet but I’m intending to continue with this next week.