More Text-to-Speech (TTS) Flexibility and Cross-Brand Predictability Needed

Making website pages renderable through text-to-speech (TTS) accessibility support is made much harder by keeping secret how each major TTS program will render what we write. Most of us don’t have time to test several programs and don’t even have the several platforms we’d need to test them on. The result is not that we spend lots more time on accessibility development but that we don’t provide that level of accessibility at all. People with visual impairments do what they can but probably don’t get what we intended to give them.

Even TTS that complies with some open standards does not let website designers and page authors specify several important characteristics. It is possible to code every string separately, but that can require so much work as to be prohibitive. If individualized coding is through a *.pls file and that file is large, the repeated reading of multiple files slows page rendering for either visualization or TTS. That could be a concern, as a slowdown could hurt positioning in Google search results. It would be helpful to code once for a page or a website. Support is already available through CSS for speech, such as for a voice-family. Support could be expanded.

Consistency across brands could mean that we who code websites do not need to address each TTS program differently or code for the lowest common denominator, especially hard when we don’t even know what that denominator is in TTS. We need to be able to predict accessibility regardless of which TTS program a user chooses.

More functionality should be supported for content where it matters, and this may vary by context, such as in a dialogue between people in different roles tending to speak the same visually-renderable text differently:

— Numbers in different contexts are spoken in different ways. The digit “0” could be /zero/ or /oh/. Should “2213” be said as /two thousand two hundred thirteen/, /two thousand two hundred and thirteen/, /twenty-two hundred thirteen/, /twenty-two thirteen/, /twenty-two one three/, /two two one three/, or some other way? Is “2.5” /two point five/ or /two and five tenths/? At least it would not be /two and a half/ because we don’t want the TTS rendering to disagree with the visual rendering, and they would disagree even though the ultimate meaning is the same. But “2 1/2” could be /two and a half/ or /two and one half/ and, depending on spacing and other context, could be either one number or two. Other number systems, including advanced and ancient, must be recognized. Ruling governors, usually monarchs, in many countries like to have numbers after their names, but, depending on the country, is “John II” /John the Second/, /John Second/, or some other form? Is “20/20” /twenty twenty/ as in vision or /twenty twentieths/ as a fraction reducible to 1? If “94304-1112” is a U.S. Zip code, it could be rendered as /9 4 3 0 4 1 1 1 2/ or /9 43 0 4 dash 11 12/; if it’s arithmetic, as /94,304 minus 1,112/; those are only examples.
— Superscripts can be ambivalent. Consider this sentence: “The speed until yesterday in miles per hour was 10² but now it’s less.” Is “10²” supposed to mean /ten squared/, /ten (footnote 2)/, /ten (endnote 2)/, or /ten (note 2)/?
— Symbol fonts cannot be rendered as if nonsymbolic. This is a challenge, in that a character in a symbol font is often mapped to a character that has the same pattern of bits as a character in a text font, especially when 7- or 8-bit representations are used rather than 32-bit Unicode representations. Thus, a TTS program must recognize the applicable font before rendering and, if the font thus discovered is a symbol font, it must comprehend which symbol must be rendered. This probably means describing every single glyph in words and that quickly gets slow and inconsistent. Related to symbol fonts are characters that are easily confused or have multiple meanings, such as fences (parentheses, brackets or square brackets, braces, and angle brackets), various dashes, and a mix of list item markers. If a symbol font does not contain wordy descriptions of characters, such as “eagle in flight”, then TTS has to deduce the description or be told it by another source, making consistency and accuracy less likely.
— Unnumbered lists, especially in outlines that have multiple levels, can benefit from detailing the TTS rendering (the beginning and the end of each list item and what level of depth it’s at) and from being searchable by level.
— Styles, at least some styles, must be accommodated. This is increasingly important as a website may apply one style in one set of circumstances and another style in another set of circumstances and different styles might expose different content for visualization. (For example, a website may expose different text in narrower viewports than it does in wider viewports. Even that raises a second issue: The TTS could be applied to any current viewport, but because the rendering is aural a viewport’s width may have little relevance and another criterion may be more logical to apply for TTS.) Thus, TTS rendering, to be accurate, must determine which style applies before rendering the resulting content through TTS.
— Formality of enunciation varies in life. Is the speaker remembering their third-grade teacher’s instructions on precision of pronunciation or is the speaker in a rush? Rushed speakers tend to drop a few sounds that usually are not absolutely necessary to comprehension. Even speakers not in a rush may normally drop the /r/ or the hard /g/. Eye dialect for colloquial pronunciation is generally discouraged in print because its use tends to reflect a biased view of speakers that is rejected by linguists (an unbiased approach, if one would write one speaker’s pronunciation of the first person singular pronoun as “Ah”, would be to write another speaker’s as “Eye” and no one’s as "I”), so to write biased eye dialect on a page is usually not an option for a page author and TTS needs another way to identify what otherwise would be that eye dialect and to render it accurately.
— Tonal pitch is common not only in Chinese but also, as noted in a book on acting, in English. While the English version may be more subtle, an American broadcast news reporter in World War II, whose scripts were censored by the Nazis applying the language standard of British English, broadcast his scripts with American English nuances that were not clear strictly from the written text but which led the Nazi embassy in Washington to try to get that reporter out of Germany. Sarcasm is often communicated through aural nuance as a substitute for rewording. (Example: “You went to the moon yesterday, right?” “Riiiiight.”) Visual writing will usually not show nuances of sound but some method is needed.
— Poetry and musical lyrics may unavoidably need individualized markup. I’m not a poet or a lyricist, but perhaps one could suggest ways of easing support.
— Computer languages are the basis of many statements that do not conform to the structure of a normal sentence in a natural language, yet must be rendered in TTS in a way that is both accurate and understandable to a listener, including one who wants just a quick idea of what the page says and another who wants to program exactly what the statement says. For some languages, and maybe all widely-used languages, there are multiple conventions for how they should be read aloud to students, experienced programmers, and others.
— Specialties often bring their own vocabularies, sometimes bring their own meanings for components found in a natural language, and occasionally bring their own syntactical conventions, and these need recognition for TTS.
— Foreign languages and foreign scripts, especially in quotations embedded in the text of another language, need accommodation for TTS.
— Font information can be rich in visual renderings. Sizes vary. Particular fonts are chosen for their visual effect on readers, starting with a choice as basic as between serif and sans serif. Underlining is meaningful, but it has more than one meaning and that doesn’t consider multiple styles of underscoring with their different meanings. I don’t even know (as I draft this) how bold and italic are rendered in TTS; the obvious choices are louder and slower. If that’s the case, how should a larger size be rendered in TTS? And with how many gradations? Gradations must be distinct enough so that listeners are aware of a change from one to the next, but wider distinctions require a wide end-to-end range for possible TTS renderings. How about smaller, like should legal fine print be almost inaudible? I doubt the website’s lawyer would approve of that. Font size judgments should be relative to a page norm, but would font size be the size given to an unclassified p element (which is often unknown because a website cannot access a user’s brower’s built-in stylesheets) or would it be the most common size on the page or an average of several of the most common sizes? Should a fancy font, like old English used for some newspaper mastheads, be voiced differently? TTS normally uses a single voice for a single string, but should a headline be rendered by a chorus? If all members of the chorus were speaking with identical timing, and a computer can probably unify the voices’ timings better than a human manager can, how could a listener ever know that there is a chorus? If the timing is slightly off, would that be confusing? Is a compromise possible at all? In olden days, fonts came in several sizes, each as a separate font file; and they could include some unusual characters that were entirely different glyph-wise according to the size-specific files; how should they be rendered? How should case be rendered in TTS? We have lower case, all capitals (upper case), small capitals, title case, sentence case used for a sentence, sentence case used for a title (as librarians often do), and maybe we don’t want to mention titling capitals quite yet (oops, that cat’s out of the bag). Font style can also be used to indicate what exact text is to be deleted or inserted at a place; those two styles are often used to show amendments to a text.
— Emoticons and ASCII art would often be coded individually, but sometimes a general recognition of them would be helpful.
— Colors have names, but should the color vocabulary be more approximate or less? Should the colors be described as falling under only about a dozen or two dozen color names, should a thousand names be invoked as needed, or should a color name be replaceable on command by a numerical identifier sufficient for millions of colors? (In theory, that could be billions of colors, and perhaps a computer language already relies on that possibility, but I doubt any computer monitor is capable of displaying that many, except perhaps a specialized model built for scientific use.)
— Contrast is a related issue. While TTS either renders low-contrast text or does not, the visualization may have been designed for subtlety and TTS should have some way to indicate subtleness from contrast being low.
— Clickable image maps are generally considered bad unless specially designed for accessibility, but maps without special adaptations are still in use and if anyone can figure out how to make accessibility good for those constructs, then hooray, and we can address them at that time.
— Menus, I assume, are already rendered through TTS, even though technically a few kinds of menus exist. If they’re not all okay, then TTS needs to handle them. If TTS programs cannot handle a menu of a certain kind of design, the design standards need to require a TTS-compatible alternative.
— For disappearing content, such as some tooltips and entire pages, the TTS rendering must be of the full content, ignoring the disappearance. If the disappearance cannot be ignored, such as if the page is auto-redirected too soon, the redirection should be delayed until TTS is finished. Fast auto-redirection already is generally against good accessibility design, but it’s still common, so maybe a firmer standard is needed, perhaps one that senses that TTS is in use for a page.
— For content not to be rendered except visually, notice through TTS must clearly tell the user that it is not being rendered except visually, whether it is mandatory that it not be rendered except visually, and, within legal limits, the nature of the content that is not to be rendered except visually. Examples include movies and examinations. Movies can be examples even when the TTS rendering would be optional because, other than for the very shortest movies, a TTS equivalent would often be prohibitively expensive to produce and may not be artistically approved by the creator of the movie. Visual art of all kinds generally has the same problems. Laws on accessibility may include an exemption on an economic ground. Examinations, such as for school graduation or website access control, may have bona fide requirements specified by a test preparer or administrator. If accessibility via TTS is protected by law, then a legal challenge may have to be pursued and resolved in favor of TTS rendering before the TTS rendering can be provided.

I doubt I’ve included everything.

If I’ve included too much because a feature is already widely available, was that publicly announced with the clarity that programmers and page authors need? There should be a single repository of these announcements or a unified feature list, so we don’t have to search through a dozen or more websites of various TTS producers. We have other things we have to do, too.

Topping all this, a market problem exists: While producers of TTS programs may be sensitive to the demands of users who have visual impairments, website designers who make their sites accessible tend not to do the harder stuff that’s time-consuming and expensive, so they tend not to demand better TTS support. TTS programmers may not even be aware of what site designers and their clients who have 20/20 vision would like in order to attract and keep wider audiences.