Uppercase Alif

Antalet arabisktalande i Sverige: ursprungsländer och historiskt utveckling

2024-03-07T00:00:00+00:00

Jag fick nyligen en fråga från en journalist om hur länge det funnits en arabisktalande minoritet i Sverige. Detta fick mig att börja fundera på vad det finns för information kring detta. Sverige för ingen statistik över medborgarnas språk (se Parkvall, 2009 för diskussion). Däremot har Statistikmyndigheten statistik över antalet utlandsfödda personer och deras ursprungsland. Dessa siffror kan ge en indirekt men ändå talande bild av antalet arabisktalande i Sverige, hur länge de funnits en landet som språklig minoritet, samt vilka arabiska dialekter de talar.

Grafen ovan bygger på sådan data från Statistikmyndigheten. Ursprungsländer är ordnade uppifrån och ner efter storleksordning 2023. Grafen visar hur arabiska som minoritetsspråk i Sverige, trots att det nu är Sveriges näst största modersmål (Parkvall 2018), är ett relativt nytt fenomen. Datan innehåller siffror ända från år 1900, men innan 1960 fanns det endast en handfull personer i Sverige födda i arabisktalande länder (7 personer år 1900 och 57 personer år 1950, varav 50 från Egypten). År 1900–1959 är därför inte inkluderade i grafen. Motsvarande siffra för 1960, första året i grafen, är 197. 2023, sista året i datan, fanns det enligt Statistikmyndigheten 452\ 327 utrikes födda från arabisktalande länder i Sverige, vilket motsvarar 4,3% av befolkningen.

Dessa siffror är inte en exakt avspegling av antalet arabisktalande personer, framför allt av två anledningar:

Personer från arabisktalande länder har ofta andra förstaspråk än arabiska. För de aktuella länderna handlar det framför allt om kurdisktalande från Irak och Syrien och amazightalande från Marocko och Algeriet. Dessa är dock ofta tvåspråkiga och talar då alltså även den arabiska dialekten från sina respektive länder.
Det finns många personer som är födda i Sverige och som är uppvuxna i arabisktalande hem och som därför talar arabiska som sitt förstaspråk, så kallade arvspråkstalare. Eftersom de inte är utrikes födda är de inte inkluderade i den här statistiken. Om dessa räknades med skulle den totala mängden arabisktalande bli något högre, speciellt bland de dialektgrupper som funnits i Sverige länge.

Med dessa förbehåll ger siffrorna ändå en ganska bra bild över antalet arabisktalande och över de arabiska dialekternas representation i Sverige. Vi kan se att den störta gruppen är syrier med 197,000 personer för 2023, följt av irakier med 146,000 personer. Däremot har den irakiska dialekten en längre historisk närvaro i Sverige, från 1990-talet, medan nästan alla syrier anlänt efter 2010. Antalet libanesiska och marockanska utrikes födda har varit stabilt sedan 90-talet. De mindre nations-/dialektgrupperna har ökat något sedan 2010.

Sammanfattningsvis är de syriska och irakiska dialekterna de i särklass största bland arabisktalande personer i Sverige. Den irakiska dialekten har en betydande närvaro i landet sedan 90-talet medan den syriska dialekten idag antagligen är den största. Den tredje största dialekten är den libanesiska, med ett talarantal som varit relativt stabilt sedan 90-talet.

Referenser

Parkvall, M. (2018). Arabiska Sveriges näst största modersmål. Svenska Dagbladet. https://www.svd.se/arabiska-sveriges-nast-storsta-modersmal/av/mikael-parkvall

Parkvall, M. (2009). Sveriges språk: Vem talar vad och var? Institutionen för lingvistik, Stockholms universitet. https://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-28743

Introduktion till arabiska: kompendium med övningar

2023-10-13T00:00:00+00:00

Jag har nyligen färdigställt ett kompendium om ca. 50 sidor som nu används på introduktionskursen i arabiska vid Göteborgs universitet. Kompendiet går igenom hur man skriver och uttalar arabiska och innehåller en mängd övningar med tillhörande facit. Många övningar har ljud- eller videomaterial som är länkat i pdf:en. Det finns också övningar i att transkriberar arabiska till latinska bokstäver enligt det system som rekommenderas av Språkrådet (Library of Congress systemet, ḍ, kh, ā). Kompendiet är uppdelat i två delar. Första delen går igenom alfabetet och uttal och den andra dalen presenterar ett vardagsnära ordförråd om ca. 100 ord, användbara vardagliga fraser och några grundläggande grammatiska strukturer. Fokus ligger på standardarabiska, men det finns några exempel på levantiska talspråksformer i gloslistorna. Kompendiet kan laddas ner här.

Kompendiet har nu använts en vända i introduktionskursen i arabiska på GU och en del småfel och otydligheter har korrigerats och hamrats ut. Jag kommer kontinuerligt att uppdatera och förbättra materialet allteftersom det används i framtida kurser. Den närmaste planen är att lägga till fler övningar. Jag tar gärna emot feedback från eventuella andra läsare.

Ny bok: Om arabiska

2023-09-28T00:00:00+00:00

Boken Om arabiska: en kort språkvetenskaplig introduktion, som jag har arbetat med ungefär ett år, är nu klar och kan laddas ner gratis som Open Access. Inom kort kommer den också finnas tillgänglig i tryck i de vanliga nätbokhandlarna. Den ursprungliga tanken med den här boken var att det finns många saker om arabiska, utöver de rena språkkunskaperna, som jag önskar att någon tydligare hade förklarat för mig när jag började lära mig språket. Idén om att samla dessa saker utvecklades sedan till en tanke om en bok som skulle kunna vara av intresse även för personer som inte studerar arabiska men som av olika anledningar är intresserade av det. Boken används som kompletterande kurslitteratur i grundkursen i arabiska vid Göteborgs universitet men riktar sig också till en bredare läsekrets. Ingen del av boken förutsätter några förkunskaper i arabiska.

Boken består av sju kapitel:

Introduktion
Uttal
Skriften
Arabiska med latinska bokstäver
Dialekterna
Standardarabiska och arabisk diglossi
Grammatisk översikt

Kapitlen är helt fristående så att man kan läsa enstaka kapitel efter intresse och i vilken ordning som helst. Har man inga förkunskaper i arabiska är det bäst att läsa alla kapitel i ordning. Läsaren som studerar arabiska på nybörjarnivå kommer ha mest nytta av kapitel 2, 3, 4 och 6. För lärare som har arabisktalande elever finns i kapitel 2, 3 och 7 information som kan vara till nytta för att förstå elevernas språkliga bakgrund och hur detta kan påverka deras inlärning av svenska. Personer som talar arabiska men som inte studerat modern språkvetenskap kommer finna nya sätt att tänka om språket i kapitel 6, och kommer nog tycka att kapitel 5 är rolig läsning.

Varje kapitel avslutas med en kort kommenterad bibliografi med rubriken Vidare läsning, uppdelad efter de teman som diskuteras i kapitlet.¹ Dessa bibliografier har två syften. För det första är det ett sätt att formellt ange referenser för den information som presenteras i kapitlet och leda läsaren vidare till mer detaljerad information kring något hen fann speciellt intressant (eller är skeptiskt till). För det andra kan studenter som skriver uppsats i arabiska här finna tips på litteratur inom olika delar av arabisk språkvetenskap.

Jag har tillåtit mig en del typografisk extravagans, som exempelord i arabisk skrift i marginalen och grå boxar kring bokstäver för att förklara grammatik. Tillsammans med de många korta listorna med exempelord, illustrationer och tematiska inforutor ger detta visuellt livliga och varierade sidor som jag hoppas inbjuder till bläddring. (Det ger också viss kontrast till det torra omslaget.)

Hoppas att du kommer att gilla den.

Formatet på dessa är stulet från The Language Hoax av John H. McWhorter (Oxford University Press, 2014) som har ett innovativt och intressant sätt att hantera referenser i akademisk text. ↩

Ergonomic Arabic transcription in Vim

2023-05-28T00:00:00+00:00

In a previous post I described how I have implemented the Alt-Latin keyboard layout for Arabic transcription in Vim. While this works reasonably well for one-off words, it gets tedious and cumbersome if you do a lot transcription, for example for writing a large number linguistic examples or entire paragraphs.¹ This is because the layout includes some awkward key-chords, like Alta+u for ū or Altw+g for ġ. (A plus indicates sequential key presses.) These are not only physically cumbersome to type but are also somewhat difficult to remember. This got me thinking about other solutions that may be faster and more intuitive and ergonomic. After some testing I came up with the following scheme that I find works much better.

This alternative approach relies on linearly sequenced key presses without the use modifier keys. It exploits the fact that some character sequences, such as aa, .d, and _t, are rare or non-existent in English and in many other languages. These sequences can therefore be used to insert transcription-specific characters without interfering with other typing. The entire scheme can be described as follows:

Long vowels with macron (ā, ī, ū) are typed as the double corresponding letter.²
a+a = ā, etc.
Dotted versions of letters (ḍ, ṭ, ġ, etc.) are typed with a dot followed by the letter. The dot is places above or below as appropriate.
.+d = ḍ,
.+g = ġ, etc.
Underlined letter are typed with an underscore followed by the letter.
_+d = ḏ, etc.³
š is typed with v followed by s.
v+s

All these also have corresponding uppercase versions, typed as you’d expect, .+D gives Ḍ, for example.

For ʿayn and hamza I have not figured out a good combination so I (somewhat hesitantly) keep the Alt-Latin chording:

Altp = ʿ
AltP = ʾ

The code below is what I have in my .vimrc to provide this functionality. It is toggled on and off for the current buffer with :EALLToggle. The code is rather primitive, just enabling and disabling a bunch of insert mappings, but it is simple, easy to modify to other transcription systems or user preferences, and it gets the job done.

Overall, I have found this scheme to offers much more comfortable and ergonomic typing of Arabic transcription than does Alt-Latin style key-chording.

function! EALLToggle()
  if !exists("b:eallmappings")
    let b:eallmappings = 0
  endif 
  if b:eallmappings == 0
    let b:eallmappings = 1
    echo "EALL mappings activated for this buffer"
    inoremap <buffer> <M-p> ʿ
    inoremap <buffer> <M-P> ʾ
    inoremap <buffer> aa ā
    inoremap <buffer> ii ī
    inoremap <buffer> uu ū
    inoremap <buffer> AA Ā
    inoremap <buffer> II Ī
    inoremap <buffer> UU Ū
    inoremap <buffer> .d ḍ
    inoremap <buffer> .D Ḍ
    inoremap <buffer> .t ṭ
    inoremap <buffer> .T Ṭ
    inoremap <buffer> .s ṣ
    inoremap <buffer> .S Ṣ
    inoremap <buffer> .r ṛ
    inoremap <buffer> .R Ṛ
    inoremap <buffer> .z ẓ
    inoremap <buffer> .Z Ẓ
    inoremap <buffer> .h ḥ
    inoremap <buffer> .H Ḥ
    inoremap <buffer> .g ġ
    inoremap <buffer> .G Ġ
    inoremap <buffer> vs š
    inoremap <buffer> vS Š
    inoremap <buffer> _d ḏ
    inoremap <buffer> _D Ḏ
    inoremap <buffer> _t ṯ
    inoremap <buffer> _T Ṯ
  elseif b:eallmappings == 1
    let b:eallmappings = 0
    echo "EALL mappings deactiviated for this buffer"
    iunmap <buffer> <M-p> ʿ
    iunmap <buffer> <M-P> ʾ
    iunmap <buffer>aa
    iunmap <buffer>ii
    iunmap <buffer>uu
    iunmap <buffer>AA
    iunmap <buffer>II
    iunmap <buffer>UU
    iunmap <buffer>.d
    iunmap <buffer>.D
    iunmap <buffer>.t
    iunmap <buffer>.T
    iunmap <buffer>.s
    iunmap <buffer>.S
    iunmap <buffer>.z
    iunmap <buffer>.Z
    iunmap <buffer>.h
    iunmap <buffer>.H
    iunmap <buffer>.g
    iunmap <buffer>.G
    iunmap <buffer>vs
    iunmap <buffer>vS
    iunmap <buffer>_d
    iunmap <buffer>_D
    iunmap <buffer>_t
    iunmap <buffer>_T
  endif
endfunction

command! EALLToggle call EALLToggle()

Arabic transcription of entire paragraphs is generally a bad idea, but may be required for academic publishing in certain journals. ↩
Of course, if you want to extend this to non-standard long vowels like ō, it will sōn run into trouble. ↩
This is not optimal, but -d intervenes with the hyphenated article and double dd etc. are too common. ↩

Att skriva arabiska på dator (instruktionsfilm)

2023-03-29T00:00:00+00:00

Detta är en kort instruktionsfilm (16 och en halv minut) om hur man skriver arabiska på dator som jag spelade in för några år sedan för grundkursen i arabiska på Göteborgs universitet. Den riktar sig till nybörjare i språket. I filmen diskuterar jag (a) hur man aktiverar arabiskt tangentbord i operativsystemet (MacOS eller Windows), (b) praktikaliteter i att skriva arabiska på dator och (c) hur man kan lära sig den arabiska tangentbordsuppsättningen.

Unicode for Arabists

2022-10-30T00:00:00+00:00

This post is based on a presentation gave in the Digital Area Studies seminar at University of Oslo, Department of Culture Studies and Oriental Languages, May 30, 2022.

This post is a practically oriented introduction to Unicode for people regularly writing in or about Arabic. In Arabic digital text, a lot of work is done under the hood to rearrange and connect letters for correct display. Quite often, however, this system produces undesired results, such as punctuation jumping around or words appearing in the incorrect order. Understanding these problems, and solving them, often requires some basic understanding of Unicode in order to engage with the text on the level of digital encoding, rather than on the level of visual display.

Vim: Throughout this post I have included boxes with tips on how to do things in Vim/Neovim, my editor of choice. If you are not a Vim user, these boxes can be ignored, and I hope they are not too distracting.

Contents

Unicode basics
Directionality
Arabic combining characters
Quranic orthography
Letter binding control
Transcription
- Hyphen in Arabic transcription
Conclusion
References

Unicode basics

Unicode has been the standard for digital text encoding since the early 2000s. It provides one coherent system for encoding virtually all forms of written language in current us (as well as those not in use) and replaces the plethora of different encoding system that were used previously. If you are typing letters other than those in the English alphabet, or non-English text on screen, the text is almost guaranteed to be encoded in Unicode.

For a more detailed yet accessible explanation of Unicode, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. I also highly recommended having a look at The Unicode Standard (2019), the official documentation. It is a highly readable and accessible document, even for a non-specialist. I recommend reading or skimming sections 1 Introduction and 2 General structure (around 70 pages), which give a good general understanding of the system on a conceptual level, and then reading the section on the specific language you are interested in.

From ASCII to Unicode

The Unicode standard for text encoding replaced the various extensions of ASCII (American Standard Code for Information Interchange) that had been used since its creation in 1963. ASCII encodes 128 characters: the upper- and lowercase Latin letters used in English, digits, basic punctuation, various non-printable control characters (that for our purposes can be ignored), as well as some mathematical and other symbols. These 128 characters have come to form the backbone of computer text. Programming languages only use these characters, for example

On the most basic level, computers store information in binary form as ones and zeroes. Any ASCII character can be expressed as a series of seven ones and zeroes, seven bits. In the table, below, the three bits on the top row shows the first three bits of character and the bits in the leftmost column show the first four. The first three bits can more conveniently be expressed as the digits 0-7 and the last four. These four bits can combine in 16 different ways. Rather than labeling these combinations as numbers 1–16, we label them with hexadecimals, from 0–F (like the normal decimal system of 0–9 but extended with A-F to get a total of sixteen, A=11, B=12, etc.). This hexadecimal system will be important later.

		000	001	010	011	100	101	110	111
		0	1	2	3	4	5	6	7
0000	0	NUL	DLE	space	0	@	P		p
0001	1	SOH	DC1	!	1	A	Q	a	q
0010	2	STX	DC2	”	2	B	R	b	r
0011	3	ETX	DC3	#	3	C	S	c	s
0100	4	EOT	DC4	$	4	D	T	d	t
0101	5	ENQ	NAK	%	5	E	U	e	u
0110	6	ACK	SYN	&	6	F	V	f	v
0111	7	BEL	ETB	’	7	G	W	g	w
1000	8	BS	CAN	(	8	H	X	h	x
1001	9	HT	EM	)	9	I	Y	i	y
1010	A	LF	SUB	*	:	J	Z	j	z
1011	B	VT	ESC	+	;	K	[	k	{
1100	C	FF	FS	,	<	L	\	l
1101	D	CR	GS	-	=	M	]	m	}
1110	E	SO	RS	.	>	N	^	n	~
1111	F	SI	US	/	?	O	_	o	DEL

ASCII, as the name implies, was developed as an American system. It is a very efficient way to store text digitally—if you only need to write text in English. To be able to write non-English letters, people started to device extensions of ASCII to allow for more characters to be included. Most of these extensions used an additional eighth bit, doubling the amount of code points to 256. Imagine the table above twice next to one another. The old ASCII characters were typically retained in positions 0–128, as above, with the new positions 129-256 being used for new characters. For example, The Multinational Character Set extends ASCII to include letters required in many European languages, and ISCII extends ASCII to write various Indic languages.

The problem with these extensions was that there as soon a number of different standards floating around, and you needed to know when opening a file in what standard it was encoded, otherwise the characters would be jumbled and incomprehensible. If a text was written in the Indian ISCII and decoded with the Multinational Character Set, the bit sequence 1111010 would be displayed as § rather than the indented उ, and similar things would happen to all characters in the file. Opening text files with all characters jumbled used to be quite a common experience. And, of course, writing in several languages in the same text was quite complicated. Overall, multilingual digital text was a bit of a mess.

Enter Unicode. The idea behind Unicode is to create a scheme for character encoding in which all languages are represented on an even keel in one and the same scheme. The ASCII inventory is neat in that it is has a very small and carefully selected inventory of 128 code points, expressed in a mere seven bits, giving 2⁷ (128) code points. In Unicode, by contrast, each code points is at least sixteen bits, giving 2¹⁶ (65,536) code points, but they have variable length and can be up to 32 bits. In total Unicode has 1,114,112 code points, that is, over a million different characters can be encoded. This is a lot. It is more than enough space to encode all characters from all written languages that have ever been in use. As of 2021 (Unicode v.14), 144,697 of these code points are actually assigned to characters. The biggest chunk of the, 92,865 code points, are assigned to Chinese characters. And there is lots of space to spare for future expansions. Crucially, Unicode incorporates ASCII in that the first 128 code points of Unicode are identical to ASCII, so that any text written in ASCII is also readable with Unicode encoding

Unicode currently encodes 159 scripts. Note that scripts are not the same as languages. For example, Swedish and English both use the same Latin script, and Arabic and Farsi use the same Arabic script, or slightly different sets of characters from the same script. The number of languages fully represented in Unicode is therefor far higher than the number of scripts. Indeed, I dare you to find a language that cannot be written in Unicode. In additions to languages of the normal sort, a large number of other sign systems are covered; forms of musical notion, emojis, astrological and alchemical symbols, typographical ornaments, and what have. Here is a small tasting, just to give some a sense of the breadth of things covered:

A
Æ
我 (some random Chinese character)
𐎄 (Ugarit letter delta)
ﷺ (Arabic ligature Peace be upon him.)
𓀍 (Egyptian hieroglyph A011)
😼
ᚍ (letter ngedal from the Ogham writing system used in Medieval Welsh inscriptions)
𐩣 (Old South Arabian letter men)
⤴
᭷ (Balinese musical symbol, right-hand closed tak)
༗ (Tibetan astrological sign sgra gcan)

To familiarize yourself with the Unicode world, and to get some sense of its vastness, it is a nice little exercise to casually explore the Unicode inventory. This is a good place to start. On Mac, if you have enabled Show keyboard and emoji viewers in menu bar in the keyboard settings, you can do this with brows the symbols in Show Emoji & Symbols from that menu. On Windows, you can find a similar functionality under System Tools/Symbol Table.

Unicode design principles

As explained above, any given character from any script has a unique code point, essentially its binary sequence. Most characters consist of 16 bits (16 ones and zeroes), and every set of four bits can be expressed with one hexadecimal number. A code point in Unicode, which references a character, can therefore be express by four hexadecimals. For example, the letter Æ has the code point 00C6. This is preceded by U+ to indicate that this number is referring to a Unicode code point: U+00C6. The Arabic letter ج has the code point U+0626. Having access to these code points comes in handy, as we will see below.

Vim: With the excellent unicode.vim plugin you can do :UnicodeTable to open the entire Unicode inventory as a massive table in plain text to browse through at your convenience.

Unicode thus provides a framework to digitally encode text from any given language. A Unicode text (as well as text in other coding scheme) is thus a series of numbers, normally expressed in hexadecimals, that each reference a character. When the computer reads this in order to display it on screen, it looks up matches to those numbers in the font you are using and displays these matches as human readable text. Naturally, no font has has all these 1,114,112 characters represented. If the file contains a character that is not represented in the font you are using, the computer will instead show a replacement character, often �. Some software will look for the character in other fonts available in the system, and, if found, display that one character in this other font. MS Word does this, for example. While this does show the character, the result is often note very pleasing.

In Unicode, each character (each code point) is associated with a set of properties that determine, among other things, how the character interact with other characters. These properties are not stored in the file itself, which is just a list of numbers, but is references in a separate database. The most important of these properties for our purposes are:

Name, an English descriptive name that is unique to that character, typically including the name of the script. This name is by convention in all ASCII uppercase letters. The name of Æ, for example, is LATIN CAPITAL LETTER AE.
Category. Whether the character is punctuation, letter, digit, upper or lower case, a control character, etc.

The latter, control character, is a particularly important category for our purposes. These are characters that have no visual appearance and take up no horizontal space. They are thus invisible. As shown below, you may want to be able to manipulate them, which can be tricky in commonly used word processors.
Writing direction, most commonly left-to-right (LTR), right-to-left (RTL), or neutral.
Combining or non-combining, whether the character graphically combines with previous characters, as is the case for diacritics and Arabic vowel-signs.

We well return to these properties below.

Typing Unicode

The true power of Unicode is that it gives you access to some 100,000 characters in one unified framework. However, no (practically useful) keyboard has 100,000 keys, and you only ever need a very small subset of all these characters, even in complex multilingual text.

There are three basic ways to access these characters in order to type or otherwise inciter them in a file:

With the keyboard. This is, of course, the most basic and everyday way to inter characters into a file. Every key on the keyboard is assigned a Unicode code point associated with a character. Which character is assigned to which key is essentially arbitrary, allowing for different keyboard layouts to be used for different language and purposes on the same physical keyboard. When using a Swedish layout, some keys will produce different characters then when using an American layout. This method, however, only provides access to a small set of characters (those that can fit on the keyboard) and typically only from one language at the time. Often this is all you need.
Manual selection. There are many programs that allow you to browse through or search the Unicode inventory for a character that you can then copy or otherwise insert into a document. Most operating systems ships with applications that do this (see above). A simple, low-tech way to do this is to do an internet search for Unicode and the name of the character you are looking for and then copy and paste the character from the browser.

Vim: With the unicode.vim plugin, you can type part of the name of Unicode character and while still in insert mode do Ctrl+x Ctrl+z to get a list of characters with a name that match that string. One of these can then be selected and inserted.
By code-point. Many applications allow you to do some keyboard shortcut in combination with the hexadecimal code-point to insert a character. Windows, for example, has a nice feature where you can type out the hexadecimal code in a document, e.g., 1F63C, highlight this string, and then do Alt+x to convert the string to the respective Unicode character (😼 CAT FACE WITH A WRY SMILE).

Vim: CTRL-v followed by u and then the hexadecimal code point in insert mode inserts the character.

It is also very useful to be able to easily identify a character that you come across in a file. There is, however, no built-in way of easily doing this in the common OSs as far as I know.

Vim: ga in normal mode displays information on the character under the cursor.

Directionality

Arabists often find themselves writing bidirectional text, for example a text in English with a few words in the Arabic script. This can be a real hassle because the software that displays the text typically tries to reorder letters you type, that is, it reorders letters to display the Arabic parts RTL while the English parts remain LTR. It does not always get this right, or it does it in a way you’d not expect. If you are regularity working with bidirectional text, it is worth taking the time to understand how this works, so you can control and manipulate it.

Digital text is stored as a simple long list of characters (including spaces, line breaks, paragraph switches, etc.). These characters only have a certain order but no inherent direction. If this list contains letters that all are meant to be read in the same direction, say LTR, the computer can just list all the characters on the screen in in that same direction. If the text contains text meant to be read in different directions, e.g., English LTR and Arabic RTL, there are two options for how this may be displayed on the screen: logical order and visual order.

The logical order is the most basic (but less common) way to display bidirectional text. This is where the characters are simply spewed out in the order in which they as they are stored in the file. You can think of this as the order in which letters are typed. This can be in either direction, either

left-to-right

‭Hello, hello. اسمي اندرياس. Hello again.

or right-to-left

‮Hello, hello. اسمي اندرياس. Hello again.

This is, clearly, not how the text is intended to be read by humans. Either the Arabic or the English is incorrectly displayed. Nevertheless, displaying text in this way often makes editing it much more convenient.

Most software displaying bidirectional text will instead rearrange letters to display them it in the visual order, that is, how it is intended to be read by humans. Here, the computer uses the directionality property for each character specified in Unicode to try to rearrange the characters for human consumption. The exact way this is done is quite complex as it tries to to account for punctuation, word boundaries, etc. The exact way this is done is specified in the Unicode Bidirectional Algorithm (which I have tired, and failed, to fully understand). The same line as above rearranged with this algorithm looks like this:

Hello, hello. اسمي اندرياس. Hello again.

Letters are now reorder for both scripts to be displayed in their visually correct direction (which is not their order in the file). Your word processor, or in this case your browser, does this automatically. (In order to prevent this rearranging in the previous examples I inserted the control characters LEFT-TO-RIGHT OVERRIDE (U+202D) and RIGHT-TO-LEFT OVERRIDE (U+202E) at the start of the line to force a specific consisted display direction.)

This reordering often requires some manual tweaking, most commonly to deal with punctuation. In the visually reordered example above, you may have noticed that the period associate with the Arabic segment is to its right, rather than at the end of the segment to its left. The period and most other forms of punctuation have a directionality property set to neutral, meaning that they adapt to the main directionality of the paragraph, in this case is LTR. The rearranging mechanism, in effect, sees a series of Arabic letters, rearranges these to be RTL, then sees the period and places it after the Arabic segment as a LTR character. Punctuation jumping around seemingly uncontrollably is one of the most common problems when typing in Arabic.

You can control the placement of characters with neutral directionality with the control characters

LEFT-TO-RIGHT EMBEDDING (U+202a)
RIGHT-TO-LEFT EMBEDDING (U+202b)
POP DIRECTIONAL FORMATTING (U+202C)

The first two introduce an embedded segment that is to be displayed in LTR or RTL, and the latter ends this segment, going back to whatever directionality is the main one of the paragraph. The following example is the same line as above but with RIGHT-LEFT EMBEDDING just before the first Arabic word اسمي and POP DIRECTIONAL FORMATTING just after the second dot:

Hello, hello. ‫اسمي اندرياس.‬ Hello again.

Note how your browser now places dot in accordance with the Arabic visual ordering. If this same line displayed in an editor that shows control characters and displays the line in logical order, it looks something like this:

‭Hello, hello. <202b>اسمي اندرياس.<202c> Hello again.

Displaying text like this is very helpful when editing bidirectional text in that you can see everything going on under the hood and don’t have to wrestle with the computer rearranging the text.

Vim: See a previous post on how to work with bidirectional displayed in logical order in Vim.

Arabic combining characters

A class of Unicode characters that are of particular importance for Arabists is the combining characters. These are characters that a) take up no horizontal space and b) modify or adds to the preceding character. Arabic vowel signs (fatḥa, ḍamma, kasra, etc.) and letter diacritics used in linguistic transcription (ḥ, š, etc.) are of this class. Combining characters essentially stack on the preceding character.

Unicode has no restrictions on how combining character can be combined with one another or with non-combining characters (typically a letter) from any script. This means that you can do silly things such as ḍ̣̣͑͑͑, a d with three COMBINING LEFT RING ABOVE and three COMBINING DOT BELOW, have an Arabic letter with a bunch of fatḥas and kasras تََََََِِِِِِِ, or do something more creative: ( ▀ ͜͞ʖ▀). Since the combining characters are encoded as separate characters, even though they do not appear as such, hitting Backspace with the cursor after a letter with a diacritic will only remove the diacritic, the last character before the cursor in the logical order.

Combining characters, by their nature, are not meant to be displayed in isolation without a letter to serve as their base. If you need show them isolation for purposes of demonstration the convention is to use ◌ DOTTED CIRCLE (U+25CC) as a place holder letter. A lone fatḥa is thus displayed like this: ◌َ. (Personally, I don’t like the look of this for Arabic vowel diacritics and often prefer to use the Arabic taṭwīl character as a placeholder letter: ـَ, ـِ, ـّ, etc.)

Vowel diacritics (taškīl)

The combining characters that are used in modern typography (taškīl, vowel diacritics) are accessible with the Shift-layer of standard Arabic keyboard layouts. The basic ones are

◌ً U+064B ARABIC FATHATAN
◌ٌ U+064C ARABIC DAMMATAN
◌ٍ U+064D ARABIC KASRATAN
◌َ U+064E ARABIC FATHA
◌ُ U+064F ARABIC DAMMA
◌ِ U+0650 ARABIC KASRA
◌ّ U+0651 ARABIC SHADDA
◌ْ U+0652 ARABIC SUKUN

So, for example, ب directly followed by ◌َ will show up as بَ, both occupying the same horizontal space. An effect of this is that you cannot place the cursors between the letter and the diacritic and hit Backspace to erase only the letter, nor can you highlight the diacritic without also highlighting the letter, and vice versa. Editing vowel diacritic thus takes some getting used to. Furthermore, the letter and the diacritic must be directly adjacent to show up correctly. This is way you cannot, for example, show a letter and its diacritic in different colors. The (typically invisible) code that ends the color marking segment for the letter would have to be placed between the letter (ب) and the following diacritic (◌َ), breaking this sequence. Such effects, having different font features of for a letter and its diacritic, can only be achieved in very complex and roundabout ways.

Dotted letters

Arabic vowel diacritics should not be confused with the dots used to separate letters (ʾiʿjām), and the two are treated very differently in Unicode. The Arabic dotted letters are, of course, its own characters (ت U+062A, etc.), not combinations of characters, as are letters with vowel diacritics. However, Unicode does provide dotless forms of dotted letter shapes, e.g., ٮ (ARABIC LETTER DOTLESS BEH U+066E). This is useful for transcribing historical texts where letter dots are used inconsistently or not at all as well as for pedagogical purposes. These dotless characters make it possible to show, for example, how the word فبقى ‘and he stayed’ is written before dots or added: ڡٮٯى. There are also characters for writing the letter-dots in isolation, e.g., ﮴ (ARABIC SYMBOL TWO DOTS ABOVE U+FBB4). These are, however, not combining characters and cannot be used to with the dotless forms “reconstruct” the dotted letter.

Enclosing marks

A nice, if somewhat obscure, feature in the Arabic Unicode is the characters used for traditional Arabic typesetting of end of ayas, page numbers, years, etc. These are enclosing combining marks; they enclose any Arabic digits directly following them. There are five of these characters:

؀ U+0600 ARABIC NUMBER SIGN
؁ U+0601 ARABIC SIGN SANAH
؂ U+0602 ARABIC FOOTNOTE MARKER
؃ U+0603 ARABIC SIGN SAFHA
۝ U+06DD ARABIC END OF AYAH

Below they are written with and without a space between the enclosing mark and the following digits.

؀٣٤ ؀ ٣٤

؂٣٤ ؂ ٣٤

؃٣٤ ؃ ٣٤

؁٣٤ ؁ ٣٤

۝٣٤ ۝ ٣٤

(This enclosing may not show up properly in your browser, depending on your font setup. This page tries to show these examples, as well as the Quranic examples ones below, in the Amiri font. If this is unsuccessful, some characters may not appear in their intended shape. As a last resort, you could always copy lines form this page to a word processor and play around with different Arabic fonts until you find one that can display them properly.)

Quranic orthography

The Quran has a number of orthographic features that are not uses in other texts. The ARABIC END OF AYA character above is an example of this. For these characters, Unicode has got you covered. To achieve correct Quranic orthography, you do, however, need to dig a little deeper into the Arabic section of the Unicode inventory, beyond what is found on the regular Arabic keyboard layout. Many Arabic fonts lack glyphs for these characters, and you may need a specialized or advanced font to display all of them.

Consider the following two examples:

‮أَنَّ ٱللهَ بَرِىٓءࣱ مِّنَ ٱلۡ‍مشۡرِكِينَ وَرَسُولُهُ

‮إِنَّمَا يَخۡشَى ٱللهَ مِنۡ عِبَادِهِ ٱلۡعُلَمَـٰۤؤُاْ

(For a real-life use of these examples, see Hallberg 2016: 73.)

Note how the sukūn does not have the normal circular form of modern typography but the open form used in the Quran; how the double ḍamma in بَرِىٓءࣱ is two visually separated signs; how there is a small alif with madda on top of a letter mīm in ٱلۡعُلَمَـٰۤؤُاْ. All of specialized characters are can be inserted with reference to their code points or by manual lookup. There are a whole bunch of such Quran-specific characters Unicode. Unfortunately, they are a bit spread out on the space and do not have names that explicitly mentions the Quran, so it is a bit difficult to locate them all. These are the ones I have identified:

◌ࣰ	U+08F0	ARABIC OPEN FATHATAN
◌ࣱ	U+08F1	ARABIC OPEN DAMMATAN
◌ࣲ	U+08F2	ARABIC OPEN KASRATAN
◌ࣿ	U+08FF	ARABIC MARK SIDEWAYS NOON GHUNNA
◌٘	U+0658	ARABIC MARK NOON GHUNNA
ۖ◌	U+06D6	ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
ۗ◌	U+06D7	ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
ۘ◌	U+06D8	ARABIC SMALL HIGH MEEM INITIAL FORM
ۙ◌	U+06D9	ARABIC SMALL HIGH LAM ALEF
ۚ◌	U+06DA	ARABIC SMALL HIGH JEEM
ۛ◌	U+06DB	ARABIC SMALL HIGH THREE DOTS
ۜ◌	U+06DC	ARABIC SMALL HIGH SEEN
۝	U+06DD	ARABIC END OF AYAH
۞	U+06DE	ARABIC START OF RUB EL HIZB
◌ۡ	U+06E1	ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
◌ۢ	U+06E2	ARABIC SMALL HIGH MEEM ISOLATED FORM
◌ۣ	U+06E3	ARABIC SMALL LOW SEEN
◌ۤ	U+06E4	ARABIC SMALL HIGH MADDA
ۥ	U+06E5	ARABIC SMALL WAW
ۦ	U+06E6	ARABIC SMALL YEH
◌ۧ	U+06E7	ARABIC SMALL HIGH YEH
◌ۨ	U+06E8	ARABIC SMALL HIGH NOON
۩	U+06E9	ARABIC PLACE OF SAJDAH
◌۪	U+06EA	ARABIC EMPTY CENTRE LOW STOP
◌۫	U+06EB	ARABIC EMPTY CENTRE HIGH STOP
◌ۭ	U+06ED	ARABIC SMALL LOW MEEM
﴾	U+FD3E	ORNATE LEFT PARENTHESIS
﴿	U+FD3F	ORNATE RIGHT PARENTHESIS
◌ٓ	U+0653	ARABIC MADDAH ABOVE
◌ٔ	U+0654	ARABIC HAMZA ABOVE

The last two also appear in modern non-Quranic Arabic orthography, but only as part of complete letter forms (آ, أ، ؤ, etc). In the Quran, however, they are used more freely and therefore also appear in Unicode as combining characters. ORNATE LEFT PARENTHESIS and ORNATE RIGHT PARENTHESIS are not used in the Quran per se, but to delimit citations from the Quran in other texts.

Letter binding control

In digital Arabic typesetting, letters change form automatically to correctly bind with adjacent letter. Typing م then ا will produce ما. If you want to show bound forms in isolation, such as ‍ه‍, you need to be able to manipulate this behavior. This is useful, for example, in pedagogical contexts and in descriptions of Arabic typography or the Arabic writing system.

Unicode features number of control characters to manipulate letter-binding. The most useful is ZERO WIDTH JOINER (U+200D). As its name implies, it takes no space but causes adjacent letter to bind with it. With careful placement of this character, you can show all fours forms a letter they way they are typically displayed in Arabic textbooks:

ه ه‍ ‍ه‍ ‍ه

This line displayed in logical with control characters visible shows as

‭ ه ه<200d> <200d>ه<200d> <200d>ه

A similar effect can be achieved by the taṭwil/kašīda character ـ (ARABIC TATWEEL U+0640), a horizontal line at the baseline that is normally used to elongate the connection between letter (تطويـــــل). This character can be accessed on the standard Arabic the keyboard layout with Shift+J. The taṭwīl, however, adds a horizontal line, which may or may not be what you want. (Compare ه‍ and هـ).

The ZERO WIDTH JOINER can also be used to disable unwanted ligatures introduced by the typeface. Depending on the typeface, ligatures kick in when two or more specific letters are used next to one another in a given sequence. ZERO WIDTH JOINER between these letters breaks the sequence, negating the ligature, yet allows the letters to connect. To be precise, both letters connect to the intervening ZERO WIDTH JOINER, but visually they will connect to one another. Here are a few examples with words to the right having a ZERO WIDTH JOINER to negate ligatures:

(These are indented to be displayed in the Amiri font, which has a large number of ligatures. If they are displayed in another font with fewer ligatures, the left and right-hand side may for some words be identical.)


لا	ل‍ا
لله	ل‍له
يحب	ي‍حب
محل	م‍حل
المسلم	ال‍مسل‍م
كما	ك‍م‍ا

There is also a somewhat less useful (for Arabic) ZERO WIDTH NON-JOINER (U+200C) that can be inserted between letters to prevent connections without using a word space: م‌ر‌ح‌ب‌اً‌.

Transcription

Specialized characters are also needed for Latinate transcription of Arabic. This is primarily done with diacritics that are added to letters from the Latin alphabet: ḥ, ā, š, etc. These are handled in one of two ways in Unicode. The first is the same way as the Arabic vowel diacritics, with combining characters, independent characters that do not take up any horizontal space but add to the preceding letter. Thus, an a directly followed by ◌̄ (COMBINING MACRON U+0304) is displayed as ā. This ā is two separate characters displayed on one and the same letter position. If you place the cursor after this character and hit Backspace, it will only delete the last character in the sequence, i.e., the macron, leaving a lone a.

These are the combining character you should need for the system of Arabic transcription most commonly used in Arabic linguistics:

◌̄ U+0304 COMBINING MACRON
◌̱ U+0331 COMBINING MACRON BELOW
◌̇ U+0307 COMBINING DOT ABOVE
◌̣ U+0323 COMBINING DOT BELOW
◌̌ U+030C COMBINING CARON

As mentioned above, combining characters can be freely combined, to produce, for example, the ḏ̣ used in some transcription systems.

Note that all these characters also have “MODIFIER LETTER” versions that are identical in shape but that are not combining characters. They take up their on horizontal space like normal letters (e.g., ˉ MODIFIER LETTER MACRON U+02C9), aˉ.

The second way this type of diacritic is handled in Unicode is in recombined characters. Continuing with our example, there is a precombined ā (LATIN SMALL A WITH MACRON U+0101). Since this is one single character, hitting Backspace after this character deletes the whole thing.

Then you only need

ʾ U+02BE MODIFIER LETTER RIGHT HALF RING
ʿ U+02BF MODIFIER LETTER LEFT HALF RING

for ʿayn and hamza and you’re set.

Now, if you type a lot of transcribed text, inserting these characters one by one with manual look-up, copying and pasting, or by typing character codes is tedious. For typing Arabic transcription, Mamlūk Studies Review provides the Alt-Latin keyboard layout. This layout extends the American QUARTY-layout to include these extra characters with key-combinations. It comes highly recommended. In short it uses

Alt+a [letter] for letter with macron
Alt+. [letter] for letter with dot below
Alt+v [letter] for letter with caron
Alt+w [letter] for letter with dot above

The system is neatly and clearly explained on the webpage. The layout can be downloaded for Mac or Windows and is easy to install. It can then be accessed with the operating system’s keyboard switching functionality. The characters produced by the Alt-Latin layout are the recombined versions of these letters, i.e., they are not an underlying letter and combining character together.

Vim: See this previous post on how to implement the same functionality internally in Vim.

Hyphen in Arabic transcription

A final (mildly pedantic) comment on Arabic transcription is on the use of the hyphen to delineate the definite article al- and other morphemes. The hyphen in transcription is functionally different from the hyphen used in normal text. The normal hyphen (HYPHEN MINUS U+002D) allows for line-breaks and is used for compound words or inserted to break up long words at the end of a line in justified text (text with a straight right-hand margin). It is entered by pressing the --key on the keyboard. Using this normal hyphen in transcription may produce line-breaks within transcribed Arabic words. The problem with this is that the hyphen reads as being inserted for line-breaking rather than as part of the word. The following two examples (from Versteegh 1983: 140 and Suleiman 2011: 20) illustrate this:

To avoid this, you can instead use ‑ (NON-BREAKING - U+2011). This character is visually identical to the normal hyphen but, as the name suggest, does not allow for line-breaking and ensures that the entire transcribed word is always on the same line.

Conclusion

Unicode, its pretty awesome.

References

Hallberg, A. (2016). Case endings in Spoken Standard Arabic: Statistics, norms, and diversity in unscripted formal speech [Doctoral dissertation, Lund University]. https://lup.lub.lu.se/record/8524489

Suleiman, Y. (2011). Ideology, grammar-making and the standardization of Arabic. In B. Orfali (Ed.), In the shadow of Arabic: The centrality of language to Arabic culture. Brill. https://doi.org/10.1163/9789004216136_002

The Unicode Standard: Version 12.0 — core specifications. (2019). Unicode Consortium. http://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf

Versteegh, K. (1983). Arabic grammar and corruption of speech. Al-Abhath, 31, 139–160.

An editing-based workflow for writing academic papers

2021-11-10T00:00:00+00:00

In this post I describe my process for writing academic papers, more or less from start to finish, with examples from a chapter I recently finished for Routledge Handbook of linguistic prescriptivism. This post is intended primarily for students as a hands-on example of how one might go about writing a paper or a thesis, of the steps involved, and of the amount of work one might expect to have to put into it. The post also demonstrates a way of thinking about editing as a core part of thinking.

Writing down advanced and complicated ideas is not simply a matter of putting your thoughts to text from start to finish; it entails carefully weighing and testing formulations, word choices, text structures, and what to cite and how. You rarely make the right choice the first time around, and you typically have to write it down to find out if it works. Choices you do also affect other parts of the paper; you don’t want to repeat yourself unnecessarily or use inconsistent terminology, for example. And since you cannot hold the entire paper in your head at once, these things will pop up in later editing passes. Also, the logically exact and unambiguous formulations required in good scientific writing, where everything is made completely explicit, is in many ways a counterintuitive and unfamiliar way of using language and quite different form every-day and literary modes of expression. For all these reasons, academic writing inevitably involves a lot of test-writing to see what works, and a lot of editing to make it work. This process is labor-intensive, but also highly rewarding and interesting.

Writing a thesis or a first longer research paper is for many students an emotionally taxing and stressful task. This stress often comes from not being familiar with the process and the intermediate steps between the initial idea and the finished text, and from not knowing how much time and effort to expect to put into it. This makes it difficult to evaluate ones progress, and it is easy to (erroneously) assume that since you have to redo a lot of rewriting, you must be doing something wrong. The more times you go through this entire process from start to finish, the more familiar you become with it and with its different phases. You learn that numerous cycles of rewriting and editing are a natural part of good academic writing. With this familiarity you also become more relaxed with the whole process and it gets more enjoyable. I hope this post can be a shortcut to reaching that familiarity.

The writing process

My process for writing academic papers includes multiple passes of printing and editing by hand. I find reading the printed text, as opposed to reading it on screen, puts it in a different light, making it easier to spot things that need to be changed. I also like the physicality of it, feeling the resistance of the pen against the paper and seeing the margins being filled out with notes, arrows, and doodles. Working with pen and paper also gives some welcome time away from the computer screen.

After the initial idea, my writing of a paper can be described as five steps, all described in detail below:

conceptualization and structuring
research and note-taking
first write-up
editing
proof-reading

Some of these are repeated several times. Step 4 is the most involved in terms of writing and is the most cyclical, and accordingly gets it gets the most space here.

These steps are not intended to be a handbook or a set of prescriptive rules for how best to write a paper. It is only an example of one way to go about it. Other researchers may have developed other strategies and routines that work well, or better, for them than this does to me.

Step 1: Conceptualization and structure

When I only have the idea of the paper, before even doing research, I write down the main structure of the paper in the form of section headings, sometimes with short notes in the form of bullet-lists under each heading of what that section should include. Crucially, these notes include the research question or the stated the aim of the paper. This gives me an opportunity think through and make a mental image of what needs to be done for me to answer these questions. This document, containing only headings loosely structured notes, will eventually grow and evolve into the finished paper.

Step 2: Research and note-taking

This stage involves data collection or other information gathering, reading up on related research, and doing experiments and analysis. Exactly what this step entails and how much time it takes depends on the nature of the paper. It is not the focus of this post since we are here concerned specifically with writing. The important point I want to make here is that during this research phase I take a lot of notes that I write under the appropriate headings in the document described above. Again, these notes are very simple: short abbreviated sentences or lists. These notes include how I actually do the research, so that I can later describe it in detail in the method section, and what the findings are that I want to present. If I get ideas about how to write the introduction or the conclusion, or randomly come up with some clever formulation that I might want to use somewhere in the paper, I also jot that down in the appropriate section.

Step 3: First write-up

Having done the research, I go through my notes, move them around if necessary to their appropriate parts in the structure, and start connecting them with text, fleshing it out to something resembling continuous readable prose. Note that I do not call this step drafting, because that might imply that the text is then in a state where it could be read by someone else. Most of the text produced in this step is still really bad prose, and intentionally so. The point here is to collect a bunch of coherent text that can later, in the next step, be reworked into good prose. The text should, however, in this stage contain all the core parts of the paper in one form or another.

Some people find it difficult to start writing a new text. I have never experienced this. Having taken plenty of organized notes during the research phase in Step 2 means that I don’t start with a clean document trying to find the first word. Rather, I start with a bunch of statements and try to connect them. I don’t even start at the beginning, but let the document grow from inside out instead of from beginning to end. If I don’t find ways of connecting some of the notes now, I leave it for later.

I think of this step of connecting the notes and writing them up in continuous text as me explaining the idea of the paper to myself. This typically reveals things that I hadn’t realized were lacking in the research, or things that I hadn’t thought of before that are needed for some explanation to make sense and for some reasoning to follow a logical sequence. This could, for example, be some method I should have tried out, some additional material that would logically fit in the research, or a concept I realize I don’t know well enough to apply or to explain in a concise and accurate manner. I then go back and redo steps 2 and 3 again for these parts.

This step also involves a lot of moving things around, since you tend to see clearer where certain statements or explanations fit when you have it laid out in text.

The introduction and the conclusion may still be missing at this point, or be in a very rough state. They are for me the most difficult parts to write, and they depend heavily on how the rest of the paper turns out, setting up or concluding things yet to be written. They also, more so than other parts, need to be adapted to the audience and the journal I submit to, which I may not have decided at this point.

Step 4: Editing

After Step 3 (with potential iterations of steps 2 and 3 for some parts) I end up with a coherent but poorly written text. Now it is time to print it and go at it with a pen. I prefer a pen with colored ink because I like how it looks on the page. I like to do this at a café or, depending on the season, outside in a park or by the sea. Being able to work in a beautiful environment is one of the perks of the trade.

In going through the text from start to finish I, among other things,

mark with arrows things that should be moved
reformulate clumsy-sounding sentences
shorten over-long sentences
correct spelling mistakes
expand on things that need further explanation or that need to be illustrated with some example
note terminology or other wording that needs to be standardized, so as to later fix this with the search-and-replace function on the computer
check that repeated listable material is always presented in the same order (for example if you are comparing three authors, it is easier on the reader if they are are always presented the same order, in explanations, lists, tables, and what have you)
remove closely repeated repetitions
add repetitions as reminders to the reader of ideas developed on distant previous parts of the paper
check the suitability of examples, so that they, for instance, do not include and call attention to features that are not under discussion and that may distract the reader
check that the citation engine has done its job correctly
note things that I need to look up in the literature or in my data

For a paper of twenty or so pages, it takes the most port of a working-day to do one editing pass, but I find I often do not have the energy to do it all in one day.

The image below shows a typical example of a page of text after one pass of editing.

On the surface, this editing is a process of improving the language and style of the paper. While this is certainly part of it, it is also a method for thinking deeper about the material. Organizing the text, even at the clause or word-level, or choosing the right wording to express an idea as exactly and unambiguously as possible, all part of editing, require deep and intense thought about the material. In my experience, editing is therefore deeply intertwined with and an inescapable part of systematic thought, and a core part of the intellectual labor of any research project.

When I have edited the entire text with pen in hand, I then insert the edits into the electronic document. To some extent this is a simple data-entry activity, but it is also an opportunity for further edits and to reevaluate the edits done on paper, now when I see how they read when inserted in the text. However, even the more mechanical entry is oddly satisfying, as I see how my handwritten notes gets integrated into the text and I watch it grow and develop in front of me.

For a typical paper I repeat Step 4 around five times. My printer runs warm. This might seem like a lot of boring work, but seen from the perspective of editing as the expression of thought, I find it enjoyable, and often challenging. It involves a lot interesting text-structural and linguistic problem solving and frequent micro-reviewing of the literature.

Somewhere between editing passes I go through the text to format it to the journal’s requirements, changing the font, paragraph formatting, putting tables in a separate file, and such. This can be a nice break from the demanding thought-intensive editing.

At some point, after many passes of editing, I reach a threshold of diminishing returns. Also, after having read through the same text (or different versions of it) multiple times I start to become blind to it and can no longer evaluate it or see it from the perspective of a person reading it for the first time. If at this point I am happy with it, I go to Step 5. If not, there are two options. The first is to let it rest for some time, preferably couple of weeks, before going at it again with fresh eyes. The second option is to have a colleague read it and give comments. This latter option may result in me re-evaluating and rethinking some major aspects of it, which requires more passes, perhaps focused on specific parts of which the colleague was critical. After this, I move to Step 5.

Step 5: Proof-reading

Proof-reading is a special kind of final editing pass at which I am spectacularly bad. But it needs to get done. I find it helps to print the text in a different font and to read it aloud slowly, focusing on articulation, and to have the computer read the text aloud back to me. It is best is to also have someone else, preferably a professional proof-reader, give it a final check. In this step I also carefully go through the generated bibliography to see that it has been rendered correctly.

When proof-reading, I try not to do any other types of edits, but I inevitably end up doing some. These are typically small things. Any larger edits, while possibly improving the text, will at this stage likely only have a small or questionable benefit, and are for the most part not worth the effort.

Then I’m done. (Well, until the reviewers and editors have their say.)

I do mean a lot of editing

To drive this point home, the image below shows the first sixteen pages, of a total of around twenty, after the first editing pass. I did five editing passes for this paper, including one pass after reviewer feedback, so roughly one hundred edited pages in total.

Looking at edits like these, after the fact, I always find them strangely intriguing. Part of it is probably that they represent all the work I have put into the text and that they are therefore something to be proud of. Part if it is also that they present a visual representation of all the accumulated thought, as it was filtered, condensed, and molded into the final text.

محاضرة عن علامات الإعراب في الفصحى الشفهية المرتجلة

2021-04-18T00:00:00+00:00

تجد هنا فيديو لمحاضرة عن بحثي في مجال اللغة العربية قمت بإلقائها في سلسلة المحاضرات حلقة عربية المنظمة من قبل قسم اللغة اللعربية في Philipps Universität Marburg الألمانية. عنوان المحاضرة علامات الإعراب في العربية الفصحى الشفهية وأقدم فيها أهم النتائج من رسالتي الدكتورة Case Endings in Spoken Standard Arabic من سنة ٢٠١٦، كما أناقش بعض المسائل المنهجية.

إن أردت معرفة المزيد عن نتائج البحث أو المنهج المتبع فيه فيمكنك تنزيل الرسالة الكاملة (المكتوبة باللغة الإنجليزية) هنا.

A quick look at eye-movements in reading Arabic text with and without diacritics

2021-03-16T00:00:00+00:00

The two videos below are generated from eye-movement data I recently collected in a short test run for a research application. They show how the gaze moves across the screen as a native Arabic speaker reads text without diacritics (Text 1) and with diacritics (Text 2), or so called unvowelled and vowelled text.

For skilled readers, reading text with diacritics is slightly slower than reading text without diacritics. While this is not a scientific test if this principle, you can see here that the gaze moves faster across the lines in the undiacritized Text 1. In fact, when first encountering a word, this reader looks at it on average 465 ms in Text 1 and 519 ms in Text 2. Note that only content words are diacritized in Text 2, so that this difference would probably be larger if Text 2 was completely diacritized.

Note also how the eyes move in quick, stepwise motions, even though when we read we experience it as our eyes moving smoothly and evenly across the text. (For more details on this, see this graphic.)

Text 1: without diacritics

Text 2: with diacritics

Text-to-speech in Vim (for proofreading)

2020-11-10T00:00:00+00:00

I am terrible at proofreading my own texts, which many of my colleagues can attest to. I was therefore happy to discover that MacOS has quite a good text-to-speech feature and that this features can be accessed from the command line. This means that I can quite easily integrate it into my Vim workflow, a prerequisite for me using it at all. This has turned out to be incredibly useful and it is something I now use daily for most texts I write. When the text is read back to me by the friendly lady inside my computer I can easily hear if something is spelled incorrectly, since she, unlike me, doesn’t know how to skim over miss-spelled or repeated words. Below I describe how I use the speech synthesis in Vim and thereafter I provide the code I have in my .vimrc to get this functionality.

The basic functionality of this implementation is that I can mark text in visual mode and heave it read aloud by pressing z. The language of the speech synthesis is based on the spelllanguage setting (in my case Swedish or English). The reading aloud is stopped with the ESC key. I typically do something like vip to mark a paragraph and then z to have it read aloud. I stop it with ESC if I hear something read wrong, and then have the re-mark the rest of the paragraph with v}z and have this section read aloud z.

I mostly write text in pandoc flavored markdown and the speech synthesis reads some of the markup aloud as well, which is annoying. I therefore have several substitution commands that the text is sent through before it is passed to the speech synthesis. These substitutions either removes things or convert them to things that make more sense when read aloud. These substitutions make the speech synthesis, among other things, do the following:

ignore all *, <, >, and $
read citation keys (for me formatted as __) as “citation” followed the year of the publication
read markdown footnotes as “footnote: ” or “footnote text:” followed by the footnote content, depending on formatting
ignore the target-part in []()-type links
read URLs as “URL” instead of spelling out the entire address

After some experimentation, I have set the reading rate to 250 words per minute. This is quite fast, and I have to focus to keep up and stop it whenever I want to do an edit.

The code below is what I have in my .vimrc to get this functionality. It was written with a lot of help from this thread in r/vim. I am not a programmer, and I am sure the code could be made more elegant. It has nevertheless worked well for me thus far. If you use this code you will most likely want to adapt it to your needs and tastes, especially the mapping and the list of sed substitutions. There are also several similar open source alternatives one could use instead of MacOS’s say, such as espeak.

function! TTS()
    if &spelllang == 'sv'
      let s:voice = 'Alva'
    else
      let s:voice = 'Allison'
    endif
    call system('echo '. shellescape(@x) .'
         \ | sed -E "s/[<>$]//g"
         \ | sed -E "s/@[a-z-]+_[a-z-]+_([0-9]{4,4})/, citation: \\1/g"
         \ | sed -E "s/\\[\\^([a-z]+)\\]/ footnote: \\1./g"
         \ | sed -E "s/\\]{(\\.[^}]+)}//g"
         \ | sed -E "s/\\^\\[([^]]+)\\]/ ... footnote text: \1. /g"
         \ | sed -E "s/\\[([^]]+)\\]\\([^)]+\\)/\\1/g"
         \ | sed -E "s/https?[^ ]+/URL /g"
         \ | sed -E "s/ / /g"
         \ | say --voice='. s:voice . ' -r 250 &')
    nnoremap <buffer><silent> <esc> :call system('killall say')<CR>
endfunction

vnoremap z "xy:call TTS()<cr>