Text-to-speech in Vim (for proofreading)

November 10, 2020
Tags: vim
Length: medium

Update 2020-11-17 — spelling corrections (ironically)

I am terrible at proofreading my own texts, which many of my colleagues can attest to. I was therefore happy to discover that MacOS has quite a good text-to-speech feature, and that this can be accessed from the command line. This means that I could quite easily integrate it into my Vim workflow, a prerequisite for me using it at all. This has turned out to be incredibly useful, and is something I now use everyday for all texts I write. When the text is read back to me by the friendly lady inside my computer, I can easily hear if something is spelled incorrectly, since she, unlike me, doesn’t know how to gloss over miss-spelled or repeated words. Below I describe how I use the speech synthesis in Vim and thereafter provide the code I have in my .vimrc to get this functionality.

The basic functionality of this implementation is that text marked in visual mode is read aloud by pressing z. The language of the speech synthesis based on the spelllanguage setting (in my case Swedish or English). It is stopped with the ESC key. I typically do something like vipz to have a paragraph read aloud, stop it with ESC to edit something, and then have the rest of the paragraph read aloud with v}z.

I mostly write text in pandoc flavored markdown, and the speech synthesis reads some of the markup aloud as well, which is annoying. I therefore have several substitution commands that the text is sent through before it is passed to the speech synthesis. These substitutions either removes things or convert them to things that make more sense when read aloud. These substitutions makes the speech synthesis:

  • ignore all *, <, >, and $
  • read citation keys (for me formatted as <author>_<noun>_<year>) as “citation” and the year of the publication
  • read markdown footnotes as “footnote: <label>” or “footnote text:” followed by the footnote content, depending on formatting
  • ignore the link target in [<link text>](<target>) type links

It also does some other more general things, like reading URLs as simply “URL” instead of reciting the entire address.

After some experimentation, I have set the reading rate to 250 words per minute, which is quite fast. I have to focus to keep up, and stop it whenever I want to do an edit.

The code below is what I have in my .vimrc to get this functionality. It was written with a lot of help from this thread in r/vim. I am not a programmer, and I am sure this can be done more elegantly. It has nevertheless worked well for me thus far. If you use the code below, you will most likely want to adapt it to your needs and tastes, especially the mapping and the list of sed substitutions. There are several similar open source alternatives one could use instead of say, such as espeak.

function! TTS()
    if &spelllang == 'sv'
      let s:voice = 'Alva'
      let s:voice = 'Allison'
    call system('echo '. shellescape(@x) .'
         \ | sed -E "s/[<>$]//g"
         \ | sed -E "s/@[a-z-]+_[a-z-]+_([0-9]{4,4})/, citation: \\1/g"
         \ | sed -E "s/\\[\\^([a-z]+)\\]/ footnote: \\1./g"
         \ | sed -E "s/\\]{(\\.[^}]+)}//g"
         \ | sed -E "s/\\^\\[([^]]+)\\]/ ... footnote text: \1. /g"
         \ | sed -E "s/\\[([^]]+)\\]\\([^)]+\\)/\\1/g"
         \ | sed -E "s/https?[^ ]+/URL /g"
         \ | sed -E "s/&nbsp;/ /g"
         \ | say --voice='. s:voice . ' -r 250 &')
    nnoremap <buffer><silent> <esc> :call system('killall say')<CR>

vnoremap z "xy:call TTS()<cr>