Home | Jantrid
Before settling on the project to implement a full accessibility cache in Firefox, I investigated several other alternatives.
In my last post, I discussed asynchronous accessibility APIs.
Another alternative I considered is switching to UI Automation and having the UIA tree accessed directly in web content processes, rather than communicating via the main UI process.
While this is not possible with IAccessible2 because of browser sandboxes, it could theoretically be possible with UIA because it has an intermediary Windows component called UIAutomationCore which sits between the client and the server.
As an operating system component, there are less concerns with regard to allowing it in the browser sandbox.
Unfortunately, this proved to be infeasible then and I think it is still infeasible now:
- First and foremost, UIA would need to be enhanced to support this.
- UIA calls would need to arrive on a background worker thread.
Providers can specify ProviderOptions_UseComThreading to ensure that calls come in on the correct COM thread, which both Firefox and Chromium do in the main process so that calls hit the main (STA) thread.
However, Firefox (and Chromium) content processes are sandboxed and do not have a UI.
In Win32 terminology, they don’t have an HWND, and in fact the sandbox blocks Win32k calls altogether these days.
That means that there is no COM STA in the main thread, so all UIA calls have to be handled from the background worker thread.
- That is problematic for both browsers because accessibility needs to be called on the main thread.
Proxying individual calls from the UIA thread to the main thread isn’t workable because UIA might be handling a bulk request such as a tree search or caching request which makes many calls to the provider in quick succession.
To fix this, UIA would need to introduce some mechanism to allow a provider to marshal and run the entire batch in a different context and then marshal the result back to UIACore.
Such a mechanism does exist internally to some extent in order to support ProviderOptions_UseComThreading, but this would need to be extended further.
- While it’s reasonable for users to see tabs as entirely separate, users perceive iframes as just another part of the content.
The iframe isn’t a separate document as far as the user is concerned, even though it is very separate in the web stack.
With site isolation (where iframes are rendered in separate processes), that is extremely problematic for UIA because each iframe is sandboxed in its own process.
While a UIA client can happily unify the tree structure across processes, there are many things it cannot handle across processes.
For example, the text pattern has no mechanism for walking across process boundaries, and remote ops is also explicitly constrained to a single process.
That means that assistive technology products would have to treat iframes as entirely separate entities, which gets really messy (if not impossible) for things like quick navigation or text searching.
- Assuming we could solve all of the above, there’s still the problem that UIA isn’t really suitable for the web.
That would all need to be fixed in order for AT to seriously consider switching to UIA, which it would need to do in order for browsers to even consider dropping IA2.
- Even if we could get NVDA to do this - and it does already have some usable support for UIA on the web - we would need to convince other AT vendors to do this and wait for it to be done.
I think the chances of convincing other vendors are extremely low.
- Even if we wanted to switch to this approach now, we still couldn’t drop the full accessibility cache unless we could get other platforms to adopt an equivalent mechanism.
Mac and ATK both have mechanisms for cross-process accessibility embedding, but I believe they (Mac at least) suffer from the same problems regarding isolated iframes; see point 4 above.
Accessibility API queries are generally synchronous: each query blocks both the client and the server until the query completes and returns its result.
This causes significant challenges for modern, multi-process web browsers, requiring them to cache the accessibility trees from all other processes in the main UI process.
This raises the question: why not make accessibility APIs asynchronous?
As usual, there is a theoretical/principle answer and a pragmatic answer.
Let’s start with theory.
For a browser, synchronous accessibility APIs are a real problem.
So much of the complexity in multi-process browser accessibility architecture comes back to the need to support synchronous APIs.
If they were async, we might not need to maintain a cache in the main process at all, which would save us a lot of pain, complexity and performance problems.
From that perspective, async APIs would be great.
Async APIs would likely also help assistive technology products to avoid hangs due to queries taking a long time or apps which stop responding, which is a real source of pain for users.
But here’s where pragmatism comes in.
Between 2017 and early 2023, Firefox didn’t have a full cache in the parent process.
In Windows, we used a lot of obscure COM magic to facilitate a partial cache.
It worked better than I ever thought it would, but it still suffered from serious performance and stability problems.
Ultimately, we took the same route as Chromium and implemented a full cache.
But before that, I considered a lot of other options, including looking into the possibility of creating an entirely new async API.
In the end, I gave up on all other options, as much as it pained me.
I came to understand that it would just require too much change across the entire industry to pull this off:
- Getting this to work in Windows alone would require significant new code in all browsers, all assistive technology products (including commercial ones) and likely even Windows itself.
- It would fundamentally change the way just about everything works in AT, since a lot of AT code is rather centred around the idea of sync queries.
- Worse, all of the software would still need to support the old APIs while the new ones developed and became adopted, so we’d all be trying to support sync and async paradigms simultaneously, as if supporting async by itself wasn’t hard enough.
- Commercial AT users often don’t update for years because the software is so expensive, so that would mean a further massive delay to reasonable adoption.
- For providers other than browsers, there isn’t really an incentive to adopt this.
For most providers, sync APIs work just fine.
Browsers are somewhat unique in their multi-process complexity.
- Even if we got it working in Windows, we’d still have to work hard to get other operating system vendors (Apple, Google, Linux folks, etc.) on board, and even if they agreed, they would have to get around to doing that work.
- We couldn’t get rid of any of the complex architecture in browsers until everyone had made the switch.
- Even assuming all of that happened, it could take many years.
- Meanwhile, as far as I’m aware, accessibility everywhere is significantly under-resourced as it is and we have so, so many problems to solve that impact real users every day.
A massive effort like this would make that other work even harder to fit in, even if it made some of it slightly easier.
None of this is to suggest that I am fundamentally against the idea of async APIs.
I’d potentially even be happy to be involved in such a project and perhaps we can try to make progress in small ways.
But at this point, I’m skeptical about the return on investment.
UI Automation (UIA) is Microsoft’s recommended accessibility framework for Windows, replacing the earlier Microsoft Active Accessibility (MSAA) framework.
Despite this, to access web content, screen readers such as NVDA and JAWS continue to use IAccessible2, an open source API based on MSAA.
This is not just because of the significant effort involved in switching to a different API, although that is certainly a factor.
More importantly, it is because UIA is currently insufficient as an accessibility API for the web.
Here are some of the reasons:
- UIA relies on LocalizedControlType to distinguish many controls, rather than adding more proper semantic ControlTypes.
This means that aside from using a different message to report the control, clients can’t reliably identify the types of these controls in order to choose different behaviour for them.
- This also extends to landmark types that UIA doesn’t support.
For these, LandmarkType is specified as custom and the client can only use LocalizedLandmarkType to distinguish them.
- This conflation of semantics continues to properties like aria-errormessage.
In UIA, aria-errormessage gets mapped to ControllerFor, with no indication that this is an error message.
- The LabeledBy property only allows the provider to expose a single static text target.
This is problematic on the web, where an element can be labelled by multiple targets which may not even contain static text or may contain multiple static text nodes.
- There is a single live region changed event.
This doesn’t provide sufficient granularity to determine what text actually changed in the live region, which is problematic for non-atomic live regions.
While the eventual goal is to replace live regions with ariaNotification, live regions are still going to be around for a while.
- Ideally, ARIA would map cleanly to each OS API, appearing just like a native application, with the API being enhanced as new concepts are needed.
Instead, UIA has properties like AriaProperties and AriaRole, which smush all of the ARIA stuff that doesn’t fit UIA properly into a separate world.
I understand the need for pragmatism - this is probably never goingto be 100% clean - but some of this feels like a bolted on solution, treating the web as a second class citizen, rather than trying to harmoniously integrate it into the rest of the ecosystem.
As an example, being able to determine whether a text box is multi line or single line feels like something that should be possible everywhere, but UIA instead shoves it into AriaProperties.
- Historically, Microsoft has been unwilling (or at least very reluctant) to consider changes to UIA to improve this situation.
For example, for many years, UIA didn’t provide any way to identify a dialog or to retrieve group position information, despite me requesting this multiple times right back to 2009.
These two deficiencies were eventually addressed years later with the addition of the IsDialog, PositionInSet and SetSize properties.
Even so, Microsoft is going to need to be a lot more open (or at least faster to respond) to improvements like these if UIA is to become fully sufficient for the web, especially given the rapid pace of web technology.
All of this is rather unfortunate because UIA has a relatively clean design and has the potential to provide significant advantages for accessibility in performance, security, etc.
Sometimes, it is necessary to move a row up or down in a Google Sheet.
For example, I might be maintaining a list of tasks ordered from highest to lowest priority and realise that a task later in the sheet is actually higher priority than earlier tasks.
This Google support article says you can do this by selecting the row and then choosing Move row up (or down) from the Edit menu.
However, when I selected the cells in the row, these Move row items still didn’t appear in the Edit menu.
The solution is to press shift+space twice to select the row, after which the Move sub-menu will appear in the Edit menu, from which you can choose Row up or Row down.
Shift+space is the keyboard shortcut to select a row, but the first press only selects the cells in the row.
To select the row itself, which you must do in order to move it, you need to press it again.
Putting it all together, the entire keyboard sequence to move a row up in Firefox on Windows is shift+space, shift+space, alt+shift+e, m, k.
The keyboard shortcuts may be slightly different in other browsers or on Mac.
Moving columns is similar, except that you press control+space twice to select a column.
I couldn’t find this documented or discussed anywhere, so I thought it worth posting here in the hopes that someone else might find it useful.
Thanks to Josh Grams on Mastodon for figuring this out.
Everyone loves paying bills, right?
One of the best parts about paying bills is surely that they usually provide all the amounts, billing codes, reference numbers, etc. with punctuation and spaces; e.g. amount $1,234, reference code 123 456-789.
But banking and payment portals - sometimes even the payment portal for the organisation issuing the bill! - won’t accept spaces and punctuation!
This means you can’t simply copy and paste the numbers, which is really annoying.
Wouldn’t it be awesom if there were some easy, automatic way to strip out everything except numbers and the decimal point before pasting?
I recently started using Ditto, an absolutely superb clipboard manager for Windows.
It turns out you can do this with Ditto using a paste script.
I originally thought that paste scripts always ran, which I wouldn’t want in this case, but it turns out that they actually appear as items in the Special Paste menu.
To set this up, go to Ditto -> Options -> General -> Advanced -> On Paste Scripts and add this script:
clip.AsciiTextReplaceRegex("[^\\d.]", "");
return false;
The name you specify for the script is what appears in the Special Paste menu.
You can even assign a shortcut key to that Special Paste item.
Then, you just have to copy the problematic number to the clipboard, open Ditto, go to the item you want to paste (which will be the top item if you just copied it) and press the shortcut key (or use the Special Paste menu).
Done!
Now you have 123456789 instead of 123 456-789, and your payment portal (and you) are much happier.
The Firefox accessibility engine is responsible for providing assistive technologies like screen readers with the information they need to access web page content.
For the past couple of years, the Firefox accessibility team have been working on a major re-architecture of the accessibility engine to significantly improve its speed, reliability and maintainability.
We call this project “Cache the World”.
In this post, I explain the reasons for such a massive undertaking and describe how the new architecture solves these issues.
The need for speed
The biggest motivation for this project is to make Firefox faster when used with screen readers and other assistive technologies, particularly on Windows.
Let’s start by taking a look at some numbers.
The table below provides the approximate time taken to perform various tasks with Firefox and the NVDA screen reader, both before and after this re-architecture.
|
Before (no cache) |
After (with cache) |
Load nsCSSFrameConstructor.cpp on Searchfox, which contains a table with over 12000 rows |
128 sec |
6 sec |
Load the WHATWG HTML spec, a very large document |
175 sec |
15 sec |
Open a Gmail message from the inbox |
200 ms |
100 ms |
Close a Gmail message, returning to the inbox |
410 ms |
150 ms |
Switch Slack channels |
620 ms |
330 ms |
These times will differ widely depending on the computer used, whether the page has been loaded before, network speed, etc.
However, the relative comparison should give you some idea of the performance improvements provided by the new architecture.
So, why were things so slow in the first place?
To understand that, we must first take a little trip through browser history.
Note that I’ve glossed over some minor details below for the sake of brevity and simplicity.
In the beginning
Once upon a time, browsers were much simpler than they are now.
The browser was a single operating system process.
Even if there were multiple tabs or documents with iframes, everything still happened within a single process.
This worked reasonably well for assistive technologies, which use the accessibility tree to get information about the user interface and web content.
Operating system accessibility APIs were already being used to expose and query accessibility trees in other applications.
Although these APIs had to be extended somewhat to expose the rich semantics and complex structure of web content, browsers used them in fundamentally the same way as any other application: a single accessibility tree exposed from a single process.
Assistive technologies sometimes need to make large numbers of queries to perform a task; e.g. locating the next heading on a page.
However, making many queries across processes can become very slow due to the overhead of context switching, copying and serialising data, etc.
To make this faster, some assistive technologies and operating system frameworks ran their own code inside the browser process, known as in-process code.
This way, large batches of queries could be executed very fast.
In particular, Windows screen readers query the entire accessibility tree and build their own representation of the document called a virtual buffer.
Multiple processes
As the web grew rapidly in usage and complexity, so too did the risk of security exploits.
To improve performance, stability and security, browsers started to move different web pages into separate processes.
Internet Explorer 8 used different processes for different tabs, but a web page was still drawn in the same process in which the page was loaded.
The accessibility tree was also exposed from that same process and assistive technologies could still inject code into that process.
This meant that there was no change for assistive technologies, which could still get direct, fast access to the content.
To further improve security, Chrome took a stricter approach as a fundamental part of its foundational design.
Web content processes were sandboxed so that they had as little access as possible, delegating tasks requiring more privileges to other processes through tightly controlled communication channels.
This meant that assistive technologies could not access the web content process containing the accessibility tree, nor could they inject code into that process.
Several years later, Firefox adopted a similar design, resulting in similar problems for accessibility.
The discovery of the Meltdown and Spectre attacks led both browsers to go even further and isolate iframes in their own processes, which made the situation even more complicated for accessibility.
Chrome’s solution
At first, Chrome experimented with handling accessibility queries in the main UI process and relaying them to the appropriate web content process.
Because accessibility API queries are synchronous, the entire UI and the web content process were blocked until each accessibility query completed and returned its result.
This made this unacceptably slow, especially for large batches of queries as described above.
This also caused some obscure stability and reliability issues.
Chrome abandoned that approach in favour of caching the accessibility trees from all other processes in the main UI process.
Rather than synchronous queries between processes, Chrome asynchronously pushes the accessibility trees from each web content process.
This does require some additional time and processor power when pages load and update, as well as using extra memory for the cache.
On the other hand, it means that assistive technologies have direct, in-process, fast access to the content as they did before in other browsers.
Firefox’s solution, take 1
Firefox was designed long before Chrome and long before the complex world which necessitated multiple, sandboxed processes.
This meant that re-architecting Firefox to use multiple processes was a massive undertaking which took years and a great deal of resources.
Great care had to be taken to ensure that Firefox remained reliable for the hundreds of millions of users who depended on it every day.
Firefox built a very minimal cache in the main process containing only the tree structure and the role (button, heading, etc.) of each node.
All other queries were relayed synchronously to the appropriate web content process.
On Linux and Mac, where large batches of queries are far less common and virtual buffers aren’t used, this was acceptable for the most part.
On Windows, as Chrome discovered, this was completely unacceptable.
Not only was it unusably slow, it was very unstable due to the fact that COM (the Windows communication mechanism used by accessibility) allows re-entry; i.e. another call can be handled while an earlier call is still running.
The Firefox multi-process communication framework wasn’t designed to handle re-entry.
Thus, another approach was required on Windows.
The accessibility team considered implementing a full cache of all accessibility trees in the main process.
However, Mozilla needed to ship multi-process Firefox as soon as possible.
It was believed that it would take too long to implement a full cache and get it right.
Getting anything wrong could result in the wrong information being communicated to assistive technologies, which could be potentially disastrous for users who already depended on Firefox.
There are also other downsides to a full cache as outlined earlier.
Instead, Firefox used some advanced (and somewhat obscure) features of COM to allow assistive technologies to communicate with the accessibility tree in content processes.
To mitigate the performance problems caused by large batches of queries, a partial cache was provided for each node.
Querying a node still required a synchronous, cross-process call, but instead of just returning one piece of information, the cache was populated with other commonly retrieved information for that single node.
This meant that some subsequent queries for that node were very fast, since they were answered from the cache.
All of this was done using a COM lightweight client-side handler.
The entire cache for all nodes was invalidated whenever anything changed.
While naive, this reduced the risk of stale information.
The performance with assistive technologies took a massive step backwards when this was first released in Firefox 57.
Over time, we were able to improve this significantly by extending the COM handler cache.
Eventually, we reached a point where we could not improve the speed any further with the current architecture.
Because software other than assistive technology uses accessibility APIs (e.g. Windows touch, East Asian input methods and enterprise SSO tools), this was even impacting users without disabilities in some cases.
Furthermore, COM was never designed to handle the massive number of objects in many web accessibility trees, resulting in severe stability problems that are difficult or even impossible to fix.
The complexity of this architecture and the need for different implementations on different operating systems made the accessibility engine overly complex and difficult to maintain.
This is particularly important given the small size of our team.
When we revamped our Android and Mac implementations in 2019 and 2020, we had to implement more operating system specific tweaks to ensure decent performance, which took time and further complicated the code.
This wouldn’t have been necessary with the full cache.
Of course, maintaining the caching code has its own cost.
However, this work can be more easily distributed across the entire team, rather than relying on the specific expertise of individual team members in particular operating systems.
Enter Cache the World
Our existing architecture served us well for a few years.
However, as the problems began to mount, we decided to go back to the drawing board.
We concluded that the downsides of the full cache were far outweighed by the growing problems with our existing architecture and that careful design could help us mitigate those downsides.
Thus, the Cache the World project was born to re-architect the accessibility engine.
In the new architecture, similar to Chrome, Firefox asynchronously pushes the accessibility trees from each web content process to the main UI process.
When assistive technologies query the accessibility tree, all queries are answered from the cache without any calls between Firefox processes.
When a page updates, the content process asynchronously pushes a cache update to the main process.
The speed improvement has far exceeded our expectations, and unlike the old architecture, we still have a great deal of room to improve further, since we have complete control over how and when the cache is updated.
As for code maintenance, once this is fully released, we will be able to remove around 20,000 lines of code, with the majority of that being operating system specific.
The journey to the world of caching
Aside from the code needed to manage the cache and update it for many different kinds of changes, this project required a few other major pieces of work worth mentioning.
First, Firefox’s desktop UI is largely written using web technologies (HTML, CSS and JavaScript), but this must be rendered in the main process.
The cache isn’t needed for this, but we wanted to share as much code as possible between the cached and non-cached implementations.
In particular, there is a layer of code to support the accessibility APIs specific to each operating system and we didn’t want to maintain two completely separate versions of this.
So, we created a unified accessibility tree, with a base Accessible
class providing an interface and functionality common to both implementations (LocalAccessible
and RemoteAccessible
).
Other code, especially operating system specific code, then had to be updated accordingly to use this unified tree.
Second, the Windows specific accessibility code was previously entangled with the core accessibility code.
Rather than being a separate class hierarchy, Windows functionality was implemented in subclasses of what is now called LocalAccessible
.
This made it impossible for the Windows code to support the separate cached implementation.
Fixing this involved separating the Windows implementation into a separate class hierarchy (MsaaAccessible
).
Third, the code which provided access to text (words, lines, formatting, spelling errors, etc.) depended heavily on being able to query Firefox’s layout engine directly.
It dealt with text containers rather than individual chunks of text, which was not ideal for efficient caching.
There were also a lot of bugs causing asymmetric and inconsistent results.
We replaced this with a completely new implementation based on text ranges called TextLeafRange
.
It still needs to use the layout engine to determine line boundaries, but it can do this for individual chunks of text and it provides symmetric, consistent results.
TextLeafRange
is far better suited to the Mac text API and will make the Windows UI Automation text pattern much easier to implement when we get to that.
We also replaced the code for handling tables, which similarly depended heavily on the layout engine, with a new implementation called CachedTableAccessible
.
Fourth, our Android accessibility code required significant re-design.
On Android, unlike other operating systems, the Firefox browser engine lives in a separate thread from the Android UI.
Since accessibility queries arrive on the UI thread, we had to provide thread-safe access to the accessibility cache on Android.
Finally, screen coordinates and hit testing, which is used to figure out what node is at a particular point on the screen, were an interesting challenge.
Screen positioning on the modern web can be very complicated, involving scrolling, multiple layers, floating content, transforms (repositioning/translation, scaling, rotation, skew), etc.
We cache the coordinates and size of each node relative to its parent and separately cache scroll positions and transforms.
This minimises cache updates when content is scrolled or moved.
Using this data, we then calculate the absolute coordinates on demand when an assistive technology asks for them.
For hit testing, we use the layout engine to determine which elements are visible on screen and sort them from the top layer to the bottom layer.
We cache this as a flat list of nodes called the viewport cache.
When an assistive technology asks for the node at a given point on the screen, we walk that list, returning the first node which contains the given point.
How is Firefox’s cache different to Chrome’s?
While Firefox’s cache is similar to (and inspired by) Chrome’s, there are some interesting differences.
First, to keep its cache up to date, Chrome has a cache serialiser which is responsible for sending cache updates.
It starts by notifying the serialiser that a node has changed.
The specific change is mostly irrelevant to the serialiser; it just re-serialises the entire node.
The serialiser keeps track of what nodes have already been sent.
When walking the tree, it sends any new nodes it encounters and ignores any nodes that were already sent and haven’t been changed.
In contrast, Firefox uses its existing accessibility events and specific cache update requests to determine what changes to send.
When a node is added or removed, Firefox fires a show or hide event.
This event is used to send information about a subtree insertion or removal to the main process.
The web content process doesn’t specifically track what nodes have been sent, but rather, it relies on the correctness of the show and hide events.
For other changes to nodes, Firefox uses existing events to trigger cache updates where possible.
Where it doesn’t make sense to have an event, code has been added to trigger specific cache updates.
The cache updates only include the specific information that changed.
We’ve spent years refining the events we fire, and incorrect events tend to cause problems for assistive technologies and thus need to be fixed regardless, so we felt this was the best approach for Firefox.
Second, Chrome includes information about word boundaries in its cache.
In contrast, Firefox calculates word boundaries on demand, which saves memory and reduces cache update complexity.
We can do this because we have access to the code which calculates word boundaries in both our main process and our content processes, since our main process renders web content for our UI.
Third, hit testing is implemented differently.
I described how Firefox implements hit testing earlier.
Rather than maintaining a viewport cache, Chrome first gets an approximate result using just the cached coordinates in the tree.
It then sends an asynchronous request to cache a more accurate result for the next query at a nearby point on the screen.
Our hope is that the viewport cache will make initial hit testing more accurate in Firefox, though this strategy may well need some refinement over time.
So, when can I use this awesomeness?
I’m glad you asked!
The new architecture is already enabled in Firefox Nightly.
So far, we’ve received very positive feedback from users.
Assuming all continues to go well, we plan to enable this for Windows and Linux users in Firefox 110 beta in January 2023.
After that, we will roll this out in stages to Windows and Linux users in Firefox 111 or 112 release.
There is still a little work to do on Mac to fully benefit from the cache, particularly for text interaction, but we hope to release this for Mac soon after Windows.
Now, go forth and enjoy a cached world!
I recently got a new laptop: Dell XPS15 9510.
While this is a pretty nice machine overall, its audio drivers are an abomination.
Among other things, the Waves MaxxAudio software it ships with eventually leaks all of your system memory if you use audio constantly for hours, which is the case for screen reader users.
I eventually got fed up and disabled the Waves crap, but this makes it impossible for me to use the headset mic on my Earpods.
To work around that, I bought an Apple USB-C to 3.5-mm Headphone Jack Adapter.
As well as supporting the mic on the Earpods, this adapter also supports the volume and play/pause buttons!
However, play/pause only plays or pauses.
In contrast, on the iPhone, pressing it twice skips to the next track and pressing it thrice skips to the previous track.
I discovered that these buttons simply get sent as media key presses.
So, I wrote a little AutoHotkey script to intercept the play/pause button and translate double and triple presses into next track and previous track, respectively.
It also supports holding the button.
Single press and hold currently does nothing, but you could adjust this to do whatever you want.
On the iPhone, double/triple press and hold fast forwards/rewinds, respectively.
Support for triggering fast forward/rewind in Windows apps is sketchy - I couldn’t get it to work anywhere - so these are currently mapped to shift+nextTrack and shift+previousTrack.
This way, you have the option of binding those keys in whatever app you’re using.
You can get the code or download a build.
I’m finally learning to cook some decent food, so I need to be able to read recipes.
For a while, I was reading them out of Simplenote on my iPhone.
However, I encountered several frustrations with this approach (and this applies to any notes or text app really):
- When you’re not editing, Simplenote shows the note such that each line is an item for VoiceOver; i.e. you flick right to read the next line.
However, if you bump the screen or perform the wrong gesture accidentally, you can easily lose your position.
If the screen locks or you have to switch apps, you lose your position completely, since VoiceOver doesn’t restore focus to the last focused item in apps.
- When editing, you can review the note line by line using the rotor.
The advantage here is that the cursor doesn’t get lost when you switch apps or the screen locks.
However, lines can be smaller than is ideal due to the screen size, so one recipe instruction might get split across multiple lines.
Also, moving the editing cursor with VoiceOver is notoriously buggy, often getting stuck, etc.
Finally, again, if you bump the screen or perform the wrong gesture, you can lose your position (or worse, accidentally type text into the document).
- The screen lock problem could be solved by disabling auto lock, but that obviously has an impact on battery.
- Having to repeatedly take my phone out of my pocket to read the next instruction was impractical, especially given the risk of losing my spot in the recipe.
Leaving it on a bench somewhere meant I had to keep walking back to wherever my phone was located, which was similarly tedious.
This might seem simple enough, but when you’re moving around a lot, using your hands for other things, getting your hands dirty, etc., it just isn’t efficient.
I considered a couple of solutions:
- I tried looking for an iOS app that could read recipes using Siri.
If such an app exists, I couldn’t find it.
Any normal recipe app would likely have the same problems as above for VoiceOver users.
- Google Home and Amazon Alexa can read recipes interactively using voice commands.
I don’t own either of those, but I was willing to consider the purchase.
However, they can only read recipes from partner sites.
This means you can’t read recipes from other sources or recipes you’ve customised… and I tend to tweak recipes quite a bit for my own convenience.
So, I resigned myself to developing my own solution to read recipes with Siri.
This isn’t specific to recipes.
It can be any line based text.
For example, it could be equally useful for other kinds of instructions where you need to be able to move step by step, but might have delays (maybe many minutes) between reading each step.
How it Works
As explained above, I need to be able to edit and read customised recipes.
I find it much easier to edit long text on my laptop.
So, my solution takes the text from a simple text file stored on iCloud Drive.
This way, I can edit the text on my laptop, save it directly to iCloud Drive and have it reflected almost immediately in my reader solution without any extra effort.
Recipes usually have at least two sections (e.g. Ingredients and Method).
It’s sometimes helpful to be able to jump between those.
The solution allows me to use a Markdown style heading (# heading text
) to mark section headings.
The solution can be used while the phone is locked.
An added advantage of this is that it can even be triggered from Homepod with the responses read on the homepod, though I usually prefer to use my Airpods.
I can then use these Siri commands (i.e. after saying “Hey Siri”):
- Read next: Read the next line of text.
- Read previous: Read the previous line of text.
- Read repeat: Repeat the line of text that was last read.
- Read next section: Jump to the next section heading and read it.
- Read previous section: Jump to the previous section heading and read it.
In all cases, the solution keeps track of the last line that was read until you next use a command.
Even if I wait an hour, I’ll still be exactly where I last left it.
Sometimes, I want to be able to quickly review many instructions; e.g. if I’m looking for multiple ingredients or reading ahead to see what’s coming up.
In this case, I can use the Siri command “read browse” while the phone is unlocked.
This presents the instructions in a WebView so I can flick right and left between them with VoiceOver.
I can also use the headings rotor to move between headings.
When it opens, it focuses the line I last read.
Furthermore, if I double tap one of the lines, it sets that as the “bookmark”; i.e. the last line read.
For example, if I double tap the second instruction in the method section of a recipe and later say “Hey Siri, read next”, Siri will read me the third instruction in the method section.
While in this browsing view, each line occupies almost the entire screen.
This might be useful if you’re reading notes for a live talk you’re giving and don’t want to risk losing your spot if you bump the screen.
Reading on Apple Watch
A few weeks ago, I bought an Apple Watch.
I began to wonder: could I somehow make use of the Apple Watch for something similar to “read browse”?
The Apple Watch has three nice advantages here:
- It’s on your wrist, so you don’t have to worry about locating it, pickig it up, accidentally dropping it, etc.
- Although it does go to sleep, you don’t have to unlock it once it’s on your wrist and unlocked.
- It’s much more water resistant than phones, so I’m less worried about sticking my grubby hands all over it.
Now, the solution works on Apple Watch too.
For technical reasons, it unfortunately can’t focus the last line I read with Siri and doesn’t support heading navigation.
When the Apple Watch wakes up after going to sleep, VoiceOver doesn’t restore focus to the line I last read.
However, because the screen is so small and the scroll position is kept, I can just tap the screen to read the last line (or at least one very nearby), so this isn’t a real problem on the watch.
Interestingly, I find I now use the watch to read recipes far more than Siri.
Before I got the watch, I’d been using the Siri solution with my phone for a few months and was reasonably happy with it.
However, speaking Siri commands can be slow if you’re reading several instructions in quick succession.
Also, Siri would sometimes misunderstand my commands; e.g. trying to read text messages instead of “read next”.
Also, I found I wanted to read ahead more often with some recipes and having to find my phone, pick it up, unlock it, etc. to use “read browse” was slightly annoying.
That said, I suspect I’ll still use Siri in some cases.
It’s useful being able to interactively read instructions in multiple ways depending on what I need at the time.
Implementation
The current solution is implemented using iOS Shortcuts and Scriptable.
Scriptable is an absolutely fantastic iOS app that allows you to “automate iOS using JavaScript”.
It exposes many iOS APIs in JavaScript and allows integration with Siri shortcuts, among many other features.
Here’s the code for the Scriptable script.
Unfortunately, getting this set up is pretty tedious because you have to manually create a bunch of Siri shortcuts.
- Import the script into Scriptable.
The easiest way to do this is to copy the file into the Scriptable folder on iCloud Drive.
- In the Shortcuts app, create a shortcut for “Read next”:
- Add the Scriptable “Run Script” action and choose the SiriInteractiveReader script.
- Tap the “Show More” button for that action.
- Under “Texts”, add a new item and enter the text: nextLine
- Ensure “Run In App” and “Show when Run” are both off.
- Add the “Show Result” action.
- Name the shortcut “Read next”.
- Duplicate this shortcut for the rest of the Siri reading commands.
Aside from the name of the shortcut, the difference in each shortcut will be the text entered under “Texts” in the “Run Script” action:
- For Read previous: previousLine
- For Read repeat: repeatLine
- For Read next section: nextSection
- For Read previous section: previousSection
- For the “Read browse” shortcut, the text under “Texts” should be “browse”.
“Run In App” must be on.
The “Show Result” action should be removed (so there’s only the “Run Script” action).
- If you have an Apple Watch, you can add a shortcut to support this.
- Again, you need the “Run Script” action with the script set to SiriInteractiveReader.
- Under Texts, enter the text: list
- Add the “Get Dictionary Value” action.
- Set the key for that action to: list
- Add the “Choose Item from List” action.
- Name the shortcut “Read watch” and ensure “Show on Apple Watch” is on.
- Note that you should run this action from the Shortcuts app on the watch, not from Siri.
The text you want to read should be placed in a file called SiriInteractiveReader.txt
in the Scriptable
folder on iCloud Drive.
Learnings
I learned a great deal throughout the process of implementing this.
Here are some learnings that might be of interest to others working with iOS Shortcuts and Scriptable.
- If you want to have Siri read a response without also saying “That’s done” or similar, you need to turn off “Show when Run”, return the text as output from your Scriptable script and use the “Show Result” action in Shortcuts.
The intuitive way to speak text using Siri is to use Scriptable’s Speech.speak function.
If you do this, Siri seems to want to speak “That’s done” or similar.
In contrast, this doesn’t happen when you use “Show Result”.
The added advantage is that the shortcut will display the text on screen if run outside of Siri.
- If you have Scriptable present a WebView with “Run In App” turned off, you won’t be able to activate anything in the WebView.
- The
alert
and window.close
functions don’t work in Scriptable WebViews.
- Scriptable’s WebView.evaluateJavaScript function doesn’t work correctly while the WebView is being presented.
I was hoping to use this to handle moving the reading bookmark when the user taps a line.
Instead, I used a
scriptable://
URL to open the script with a specific parameter, which also dismisses the WebView.
- There is no Scriptable app for Apple Watch (yet).
However, as long as you don’t try to present any UI whatsoever from within Scriptable, you can still make use of Scriptable in shortcuts run on the watch.
The script will run on the phone.
You can still present UI, but you have to do it with Shortcuts actions, which are able to run on the watch.
You can get creative here to present UI based on output from a Scriptable script, as I do using the “Choose Item from List” in the watch shortcut above.
The Future
Ideally, it’d be good to develop this into an app so it’s not so tedious (probably impossible for many users) to install.
I considered this, but I’m one of these strange developers that still uses a text editor and prefers to design GUI using markup languages or code.
The prospect of learning and using Xcode and having to use a GUI builder is not something I’m at all motivated to do in my spare time.
I’ve read you can design iOS GUI in code to some extent, but it looks super painful.
Introduction
An awesome feature in Firefox that has existed forever is the ability to assign keywords to bookmarks.
For example, I could assign the word “bank” to take me directly to the login page for my bank.
Then, all I have to do is type “bank” into the address bar and press enter, and I’m there.
Another awesome feature in Firefox is the ability to use the address bar to switch to open tabs.
For example, if I want to switch to my Twitter tab, I can type “% twitter” into the address bar, then press down arrow and enter, and I’m there.
Inspired by these two features, I started to wonder: what if you could have tab keywords to quickly switch to tabs you use a lot?
If you only have 8 tabs you use a lot, you can switch to the first 8 tabs with control+1 through control+8.
If you have more than that, you can search them with the address bar, but that gets messy if you have multiple pages with similar titles or a page title doesn’t contain keywords that are quick to search.
For example, if you have both Facebook Messenger and Twitter Direct Messages open, you can’t just type “% mes” because that will match both.
If you’re on bug triage duty and have a bug list open, the list might not have a uesful title.
Wouldn’t it be nice to just type “tm” to switch to Twitter Direct Messages or “fm” to switch to Facebook Messenger?
Implementation
Now you can!
Trying to integrate this into the Firefox address bar seemed pretty weird.
Among other things, it wasn’t clear what a good user experience would be for setting a keyword for a tab.
So, I decided to do this in an add-on.
Rather than writing my own from scratch, I found Fast Tab Switcher and contributed the feature to that.
Version 2.7.0 of Fast Tab Switcher has now been released which includes this feature.
How it Works
First, install Fast Tab Switcher if you haven’t already.
Then, to assign a keyword to a tab:
- Switch to the tab.
- Press control+space to open Fast Tab Switcher.
- Type the
=
(equals) character into the text box.
- Type the keyword to assign and press enter.
For example, to assign the keyword “fm”, you would press control+space, type “=fm” and press enter.
To switch to a tab using its keyword, press control+space, type the keyword and press enter.
Note that the keyword must be an exact match.
Keywords stay assigned to tabs even if you close firefox, as long as you have Firefox set to restore tabs.
Enjoy super fast tab switching!
At CSUN this year, I attended the open source math accessibility sprint co-hosted by the Shuttleworth Foundation and Benetech, where major players in the field gathered to discuss and hack on various aspects of open source math accessibility. My team, which also included Kathi Fletcher, Volker Sorge and Derek Riemer, tackled reading of mainstream math with open source tools.
Last year, NVDA introduced support for reading and interactive navigation of math content in web browsers and in Microsoft Word and PowerPoint. To facilitate this, NVDA uses MathPlayer 4 from Design Science. While MathPlayer is a great, free solution that is already helping many users, it is closed source, proprietary software, which severely limits its future potential. Thus, there is a great need for a fully open source alternative.
Some time ago, Volker Sorge implemented support for math in ChromeVox and later forked this into a separate project called Speech Rule Engine (SRE). There were two major pieces to our task:
- SRE is a JavaScript library and NVDA is written in python, so we needed to create a "bridge" between NVDA and SRE. We did this by having NVDA run Node.js and writing code in Python and JavaScript which communicated via stdin and stdout.
- One of the things that sets MathPlayer above other math accessibility solutions is its use of the more natural ClearSpeak speech style. In contrast, MathSpeak, the speech style used by SRE and others, was designed primarily for dictation and is not well suited to efficient understanding of math, at least without a great deal of training. So, we needed to implement ClearSpeak in SRE. Because this is a massive task that would take months to complete (and this was a one day hackathon!), we chose to implement just a few ClearSpeak rules, just enough to read the quadratic equation.
Our goal for the end of the day was to present NVDA and SRE reading and interactively navigating the quadratic equation in Microsoft Word using ClearSpeak, including one pause in speech specified by ClearSpeak. (ClearSpeak uses pauses to make reading easier and to naturally communicate information about the math expression.) I'm pleased to say we were successful! Obviously, this was very much a "proof of concept" implementation and there is a great deal of further work to be done, both in NVDA and SRE. Thanks to my team for their excellent work and to Benetech and the Shuttleworth Foundation for hosting the event and inviting me!
As a result of this work, I was subsequently nominated by Kathi Fletcher for a Shuttleworth Foundation Flash Grant. In short, this is a small grant I can put towards a project of my choice, with the only condition being to "live openly" and share it with the world. And I figured polishing NVDA's integration with SRE was a fitting project for this grant. So, in the coming months, I plan to release an NVDA add-on package which allows users to easily install and use this solution. Thanks to Kathi for nominating me and to the Shuttleworth Foundation for supporting this! Watch this space for more details.