Grafen ovan bygger på sådan data från Statistikmyndigheten. Ursprungsländer är ordnade uppifrån och ner efter storleksordning 2023. Graven visar hur arabiska som minoritetsspråk i Sverige, trots att det nu är Sveriges näst största modersmål (Parkvall 2018), är ett relativt nytt fenomen. Datan innehåller siffror från år 1900, men innan 1960 fanns det endast en handfull personer i Sverige födda i arabisktalande länder (7 personer år 1900 och 57 personer år 1950, varav 50 från Egypten). 1900–1960 är därför utesluten ur grafen. Motsvarande siffra för 1960, första året i grafen, är 197. 2023, sista året i grafen, fanns det enligt Statistikmyndigheten 452,327 utrikes födda personer från arabisktalande länder i Sverige, vilket motsvarar 4,3% av befolkningen.
Dessa siffror är inte en exakt avspegling av antalet arabisktalande personer, framför allt av två anledningar:
Personer från arabisktalande länder har ofta andra förstaspråk än arabiska. För de aktuella länderna handlar det framför allt om kurdisktalande från Irak och Syrien och amazightalande från Marocko och Algeriet. Dessa är dock ofta tvåspråkiga och talar då alltså även den arabiska dialekten från sina respektive länder.
Det finns många personer som är födda i Sverige och som är uppvuxna i arabisktalande hem och som därför talar arabiska som sitt förstaspråk, så kallade arvspråkstalare. Eftersom de inte är utrikes födda är de inte inkluderade i den här statistiken. Om dessa räknades med skulle den totala mängden arabisktalande bli något högre, speciellt bland de dialektgrupper som funnits i Sverige längre och där första generationen fått barn.
Med dessa förbehåll ger siffrorna ändå en ganska bra bild över antalet arabisktalande och över de arabiska dialekternas representation i Sverige. Vi kan se att den störta gruppen är syrier med 197,000 personer för 2023, följt av irakier med 146,000 personer. Däremot har den irakiska dialekten en längre historisk närvaro i Sverige, från 1990-talet, medan nästan alla syrier anlänt efter 2010. Antalet libanesiska och marockanska utrikes födda har varit stabilt sedan 90-talet. De mindre nations-/dialektgrupperna har ökat något sedan 2010. Ett flertal av dessa är inte namngivna i grafen i brist på plats.
Sammanfattningsvis är de syriska och irakiska dialekterna de i särklass största bland arabisktalande personer i Sverige. Den irakiska dialekten har en betydande närvaro i landet sedan 90-talet medan den syriska dialekten idag antagligen är den största. Den tredje största dialekten är den libanesiska, med ett talarantal som varit relativt stabilt sedan 90-talet
Parkvall, M. (2018). Arabiska Sveriges näst största modersmål. Svenska Dagbladet. https://www.svd.se/arabiska-sveriges-nast-storsta-modersmal/av/mikael-parkvall
Parkvall, M. (2009). Sveriges språk: Vem talar vad och var? Institutionen för lingvistik, Stockholms universitet. https://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-28743
]]>Kompendiet har nu använts en vända i introduktionskursen i arabiska på GU och en del småfel och otydligheter har korrigerats och hamrats ut. Jag kommer kontinuerligt att uppdatera och förbättra materialet allteftersom det används i framtida kurser. Den närmaste planen är att lägga till fler övningar. Jag tar gärna emot feedback från eventuella andra läsare.
]]>Boken består av sju kapitel:
Kapitlen är helt fristående så att man kan läsa enstaka kapitel efter intresse och i vilken ordning som helst. Har man inga förkunskaper i arabiska är det bäst att läsa alla kapitel i ordning. Läsaren som studerar arabiska på nybörjarnivå kommer ha mest nytta av kapitel 2, 3, 4 och 6. För lärare som har arabisktalande elever finns i kapitel 2, 3 och 7 information som kan vara till nytta för att förstå elevernas språkliga bakgrund och hur detta kan påverka deras inlärning av svenska. Personer som talar arabiska men som inte studerat modern språkvetenskap kommer finna nya sätt att tänka om språket i kapitel 6, och kommer nog tycka att kapitel 5 är rolig läsning.
Varje kapitel avslutas med en kort kommenterad bibliografi med rubriken Vidare läsning, uppdelad efter de teman som diskuteras i kapitlet.1 Dessa bibliografier har två syften. För det första är det ett sätt att formellt ange referenser för den information som presenteras i kapitlet och leda läsaren vidare till mer detaljerad information kring något hen fann speciellt intressant (eller är skeptiskt till). För det andra kan studenter som skriver uppsats i arabiska här finna tips på litteratur inom olika delar av arabisk språkvetenskap.
Jag har tillåtit mig en del typografisk extravagans, som exempelord i arabisk skrift i marginalen och grå boxar kring bokstäver för att förklara grammatik. Tillsammans med de många korta listorna med exempelord, illustrationer och tematiska inforutor ger detta visuellt livliga och varierade sidor som jag hoppas inbjuder till bläddring. (Det ger också viss kontrast till det torra omslaget.)
Hoppas att du kommer att gilla den.
Formatet på dessa är stulet från The Language Hoax av John H. McWhorter (Oxford University Press, 2014) som har ett innovativt och intressant sätt att hantera referenser i akademisk text. ↩
This alternative approach relies on linearly sequenced key presses without the use modifier keys. It exploits the fact that some character sequences, such as aa, .d, and _t, are rare or non-existent in English and in many other languages. These sequences can therefore be used to insert transcription-specific characters without interfering with other typing. The entire scheme can be described as follows:
Long vowels with macron (ā, ī, ū) are typed as the double corresponding letter.2
a+a = ā, etc.
Dotted versions of letters (ḍ, ṭ, ġ, etc.) are typed with a dot followed by the letter. The dot is places above or below as appropriate.
.+d = ḍ,
.+g = ġ, etc.
Underlined letter are typed with an underscore followed by the letter.
_+d = ḏ, etc.3
š is typed with v followed by s.
v+s
All these also have corresponding uppercase versions, typed as you’d expect, .+D gives Ḍ, for example.
For ʿayn and hamza I have not figured out a good combination so I (somewhat hesitantly) keep the Alt-Latin chording:
Altp = ʿ
AltP = ʾ
The code below is what I have in my .vimrc
to provide this functionality. It is toggled on and off for the current buffer with :EALLToggle
. The code is rather primitive, just enabling and disabling a bunch of insert mappings, but it is simple, easy to modify to other transcription systems or user preferences, and it gets the job done.
Overall, I have found this scheme to offers much more comfortable and ergonomic typing of Arabic transcription than does Alt-Latin style key-chording.
function! EALLToggle()
if !exists("b:eallmappings")
let b:eallmappings = 0
endif
if b:eallmappings == 0
let b:eallmappings = 1
echo "EALL mappings activated for this buffer"
inoremap <buffer> <M-p> ʿ
inoremap <buffer> <M-P> ʾ
inoremap <buffer> aa ā
inoremap <buffer> ii ī
inoremap <buffer> uu ū
inoremap <buffer> AA Ā
inoremap <buffer> II Ī
inoremap <buffer> UU Ū
inoremap <buffer> .d ḍ
inoremap <buffer> .D Ḍ
inoremap <buffer> .t ṭ
inoremap <buffer> .T Ṭ
inoremap <buffer> .s ṣ
inoremap <buffer> .S Ṣ
inoremap <buffer> .r ṛ
inoremap <buffer> .R Ṛ
inoremap <buffer> .z ẓ
inoremap <buffer> .Z Ẓ
inoremap <buffer> .h ḥ
inoremap <buffer> .H Ḥ
inoremap <buffer> .g ġ
inoremap <buffer> .G Ġ
inoremap <buffer> vs š
inoremap <buffer> vS Š
inoremap <buffer> _d ḏ
inoremap <buffer> _D Ḏ
inoremap <buffer> _t ṯ
inoremap <buffer> _T Ṯ
elseif b:eallmappings == 1
let b:eallmappings = 0
echo "EALL mappings deactiviated for this buffer"
iunmap <buffer> <M-p> ʿ
iunmap <buffer> <M-P> ʾ
iunmap <buffer>aa
iunmap <buffer>ii
iunmap <buffer>uu
iunmap <buffer>AA
iunmap <buffer>II
iunmap <buffer>UU
iunmap <buffer>.d
iunmap <buffer>.D
iunmap <buffer>.t
iunmap <buffer>.T
iunmap <buffer>.s
iunmap <buffer>.S
iunmap <buffer>.z
iunmap <buffer>.Z
iunmap <buffer>.h
iunmap <buffer>.H
iunmap <buffer>.g
iunmap <buffer>.G
iunmap <buffer>vs
iunmap <buffer>vS
iunmap <buffer>_d
iunmap <buffer>_D
iunmap <buffer>_t
iunmap <buffer>_T
endif
endfunction
command! EALLToggle call EALLToggle()
Arabic transcription of entire paragraphs is generally a bad idea, but may be required for academic publishing in certain journals. ↩
Of course, if you want to extend this to non-standard long vowels like ō, it will sōn run into trouble. ↩
This is not optimal, but -d intervenes with the hyphenated article and double dd etc. are too common. ↩
This post is a practically oriented introduction to Unicode for people regularly writing in or about Arabic. In Arabic digital text, a lot of work is done under the hood to rearrange and connect letters for correct display. Quite often, however, this system produces undesired results, such as punctuation jumping around or words appearing in the incorrect order. Understanding these problems, and solving them, often requires some basic understanding of Unicode in order to engage with the text on the level of digital encoding, rather than on the level of visual display.
Vim: Throughout this post I have included boxes with tips on how to do things in Vim/Neovim, my editor of choice. If you are not a Vim user, these boxes can be ignored, and I hope they are not too distracting.
Contents
Unicode has been the standard for digital text encoding since the early 2000s. It provides one coherent system for encoding virtually all forms of written language in current us (as well as those not in use) and replaces the plethora of different encoding system that were used previously. If you are typing letters other than those in the English alphabet, or non-English text on screen, the text is almost guaranteed to be encoded in Unicode.
For a more detailed yet accessible explanation of Unicode, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. I also highly recommended having a look at The Unicode Standard (2019), the official documentation. It is a highly readable and accessible document, even for a non-specialist. I recommend reading or skimming sections 1 Introduction and 2 General structure (around 70 pages), which give a good general understanding of the system on a conceptual level, and then reading the section on the specific language you are interested in.
The Unicode standard for text encoding replaced the various extensions of ASCII (American Standard Code for Information Interchange) that had been used since its creation in 1963. ASCII encodes 128 characters: the upper- and lowercase Latin letters used in English, digits, basic punctuation, various non-printable control characters (that for our purposes can be ignored), as well as some mathematical and other symbols. These 128 characters have come to form the backbone of computer text. Programming languages only use these characters, for example
On the most basic level, computers store information in binary form as ones and zeroes. Any ASCII character can be expressed as a series of seven ones and zeroes, seven bits. In the table, below, the three bits on the top row shows the first three bits of character and the bits in the leftmost column show the first four. The first three bits can more conveniently be expressed as the digits 0-7 and the last four. These four bits can combine in 16 different ways. Rather than labeling these combinations as numbers 1–16, we label them with hexadecimals, from 0–F (like the normal decimal system of 0–9 but extended with A-F to get a total of sixteen, A=11, B=12, etc.). This hexadecimal system will be important later.
000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | ||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
0000 | 0 | NUL | DLE | space | 0 | @ | P | p | |
0001 | 1 | SOH | DC1 | ! | 1 | A | Q | a | q |
0010 | 2 | STX | DC2 | ” | 2 | B | R | b | r |
0011 | 3 | ETX | DC3 | # | 3 | C | S | c | s |
0100 | 4 | EOT | DC4 | $ | 4 | D | T | d | t |
0101 | 5 | ENQ | NAK | % | 5 | E | U | e | u |
0110 | 6 | ACK | SYN | & | 6 | F | V | f | v |
0111 | 7 | BEL | ETB | ’ | 7 | G | W | g | w |
1000 | 8 | BS | CAN | ( | 8 | H | X | h | x |
1001 | 9 | HT | EM | ) | 9 | I | Y | i | y |
1010 | A | LF | SUB | * | : | J | Z | j | z |
1011 | B | VT | ESC | + | ; | K | [ | k | { |
1100 | C | FF | FS | , | < | L | \ | l | |
1101 | D | CR | GS | - | = | M | ] | m | } |
1110 | E | SO | RS | . | > | N | ^ | n | ~ |
1111 | F | SI | US | / | ? | O | _ | o | DEL |
ASCII, as the name implies, was developed as an American system. It is a very efficient way to store text digitally—if you only need to write text in English. To be able to write non-English letters, people started to device extensions of ASCII to allow for more characters to be included. Most of these extensions used an additional eighth bit, doubling the amount of code points to 256. Imagine the table above twice next to one another. The old ASCII characters were typically retained in positions 0–128, as above, with the new positions 129-256 being used for new characters. For example, The Multinational Character Set extends ASCII to include letters required in many European languages, and ISCII extends ASCII to write various Indic languages.
The problem with these extensions was that there as soon a number of different standards floating around, and you needed to know when opening a file in what standard it was encoded, otherwise the characters would be jumbled and incomprehensible. If a text was written in the Indian ISCII and decoded with the Multinational Character Set, the bit sequence 1111010 would be displayed as § rather than the indented उ, and similar things would happen to all characters in the file. Opening text files with all characters jumbled used to be quite a common experience. And, of course, writing in several languages in the same text was quite complicated. Overall, multilingual digital text was a bit of a mess.
Enter Unicode. The idea behind Unicode is to create a scheme for character encoding in which all languages are represented on an even keel in one and the same scheme. The ASCII inventory is neat in that it is has a very small and carefully selected inventory of 128 code points, expressed in a mere seven bits, giving 2⁷ (128) code points. In Unicode, by contrast, each code points is at least sixteen bits, giving 2¹⁶ (65,536) code points, but they have variable length and can be up to 32 bits. In total Unicode has 1,114,112 code points, that is, over a million different characters can be encoded. This is a lot. It is more than enough space to encode all characters from all written languages that have ever been in use. As of 2021 (Unicode v.14), 144,697 of these code points are actually assigned to characters. The biggest chunk of the, 92,865 code points, are assigned to Chinese characters. And there is lots of space to spare for future expansions. Crucially, Unicode incorporates ASCII in that the first 128 code points of Unicode are identical to ASCII, so that any text written in ASCII is also readable with Unicode encoding
Unicode currently encodes 159 scripts. Note that scripts are not the same as languages. For example, Swedish and English both use the same Latin script, and Arabic and Farsi use the same Arabic script, or slightly different sets of characters from the same script. The number of languages fully represented in Unicode is therefor far higher than the number of scripts. Indeed, I dare you to find a language that cannot be written in Unicode. In additions to languages of the normal sort, a large number of other sign systems are covered; forms of musical notion, emojis, astrological and alchemical symbols, typographical ornaments, and what have. Here is a small tasting, just to give some a sense of the breadth of things covered:
To familiarize yourself with the Unicode world, and to get some sense of its vastness, it is a nice little exercise to casually explore the Unicode inventory. This is a good place to start. On Mac, if you have enabled Show keyboard and emoji viewers in menu bar in the keyboard settings, you can do this with brows the symbols in Show Emoji & Symbols from that menu. On Windows, you can find a similar functionality under System Tools/Symbol Table.
As explained above, any given character from any script has a unique code point, essentially its binary sequence. Most characters consist of 16 bits (16 ones and zeroes), and every set of four bits can be expressed with one hexadecimal number. A code point in Unicode, which references a character, can therefore be express by four hexadecimals. For example, the letter Æ has the code point 00C6. This is preceded by U+ to indicate that this number is referring to a Unicode code point: U+00C6. The Arabic letter ج has the code point U+0626. Having access to these code points comes in handy, as we will see below.
Vim: With the excellent unicode.vim plugin you can do :UnicodeTable to open the entire Unicode inventory as a massive table in plain text to browse through at your convenience.
Unicode thus provides a framework to digitally encode text from any given language. A Unicode text (as well as text in other coding scheme) is thus a series of numbers, normally expressed in hexadecimals, that each reference a character. When the computer reads this in order to display it on screen, it looks up matches to those numbers in the font you are using and displays these matches as human readable text. Naturally, no font has has all these 1,114,112 characters represented. If the file contains a character that is not represented in the font you are using, the computer will instead show a replacement character, often �. Some software will look for the character in other fonts available in the system, and, if found, display that one character in this other font. MS Word does this, for example. While this does show the character, the result is often note very pleasing.
In Unicode, each character (each code point) is associated with a set of properties that determine, among other things, how the character interact with other characters. These properties are not stored in the file itself, which is just a list of numbers, but is references in a separate database. The most important of these properties for our purposes are:
Category. Whether the character is punctuation, letter, digit, upper or lower case, a control character, etc.
The latter, control character, is a particularly important category for our purposes. These are characters that have no visual appearance and take up no horizontal space. They are thus invisible. As shown below, you may want to be able to manipulate them, which can be tricky in commonly used word processors.
Writing direction, most commonly left-to-right (LTR), right-to-left (RTL), or neutral.
We well return to these properties below.
The true power of Unicode is that it gives you access to some 100,000 characters in one unified framework. However, no (practically useful) keyboard has 100,000 keys, and you only ever need a very small subset of all these characters, even in complex multilingual text.
There are three basic ways to access these characters in order to type or otherwise inciter them in a file:
With the keyboard. This is, of course, the most basic and everyday way to inter characters into a file. Every key on the keyboard is assigned a Unicode code point associated with a character. Which character is assigned to which key is essentially arbitrary, allowing for different keyboard layouts to be used for different language and purposes on the same physical keyboard. When using a Swedish layout, some keys will produce different characters then when using an American layout. This method, however, only provides access to a small set of characters (those that can fit on the keyboard) and typically only from one language at the time. Often this is all you need.
Manual selection. There are many programs that allow you to browse through or search the Unicode inventory for a character that you can then copy or otherwise insert into a document. Most operating systems ships with applications that do this (see above). A simple, low-tech way to do this is to do an internet search for Unicode and the name of the character you are looking for and then copy and paste the character from the browser.
Vim: With the unicode.vim plugin, you can type part of the name of Unicode character and while still in insert mode do Ctrl+x Ctrl+z to get a list of characters with a name that match that string. One of these can then be selected and inserted.
By code-point. Many applications allow you to do some keyboard shortcut in combination with the hexadecimal code-point to insert a character. Windows, for example, has a nice feature where you can type out the hexadecimal code in a document, e.g., 1F63C, highlight this string, and then do Alt+x to convert the string to the respective Unicode character (😼 CAT FACE WITH A WRY SMILE).
Vim: CTRL-v followed by u and then the hexadecimal code point in insert mode inserts the character.
It is also very useful to be able to easily identify a character that you come across in a file. There is, however, no built-in way of easily doing this in the common OSs as far as I know.
Vim: ga in normal mode displays information on the character under the cursor.
Arabists often find themselves writing bidirectional text, for example a text in English with a few words in the Arabic script. This can be a real hassle because the software that displays the text typically tries to reorder letters you type, that is, it reorders letters to display the Arabic parts RTL while the English parts remain LTR. It does not always get this right, or it does it in a way you’d not expect. If you are regularity working with bidirectional text, it is worth taking the time to understand how this works, so you can control and manipulate it.
Digital text is stored as a simple long list of characters (including spaces, line breaks, paragraph switches, etc.). These characters only have a certain order but no inherent direction. If this list contains letters that all are meant to be read in the same direction, say LTR, the computer can just list all the characters on the screen in in that same direction. If the text contains text meant to be read in different directions, e.g., English LTR and Arabic RTL, there are two options for how this may be displayed on the screen: logical order and visual order.
The logical order is the most basic (but less common) way to display bidirectional text. This is where the characters are simply spewed out in the order in which they as they are stored in the file. You can think of this as the order in which letters are typed. This can be in either direction, either
left-to-right
Hello, hello. اسمي اندرياس. Hello again.
or right-to-left
Hello, hello. اسمي اندرياس. Hello again.
This is, clearly, not how the text is intended to be read by humans. Either the Arabic or the English is incorrectly displayed. Nevertheless, displaying text in this way often makes editing it much more convenient.
Most software displaying bidirectional text will instead rearrange letters to display them it in the visual order, that is, how it is intended to be read by humans. Here, the computer uses the directionality property for each character specified in Unicode to try to rearrange the characters for human consumption. The exact way this is done is quite complex as it tries to to account for punctuation, word boundaries, etc. The exact way this is done is specified in the Unicode Bidirectional Algorithm (which I have tired, and failed, to fully understand). The same line as above rearranged with this algorithm looks like this:
Hello, hello. اسمي اندرياس. Hello again.
Letters are now reorder for both scripts to be displayed in their visually correct direction (which is not their order in the file). Your word processor, or in this case your browser, does this automatically. (In order to prevent this rearranging in the previous examples I inserted the control characters LEFT-TO-RIGHT OVERRIDE (U+202D) and RIGHT-TO-LEFT OVERRIDE (U+202E) at the start of the line to force a specific consisted display direction.)
This reordering often requires some manual tweaking, most commonly to deal with punctuation. In the visually reordered example above, you may have noticed that the period associate with the Arabic segment is to its right, rather than at the end of the segment to its left. The period and most other forms of punctuation have a directionality property set to neutral, meaning that they adapt to the main directionality of the paragraph, in this case is LTR. The rearranging mechanism, in effect, sees a series of Arabic letters, rearranges these to be RTL, then sees the period and places it after the Arabic segment as a LTR character. Punctuation jumping around seemingly uncontrollably is one of the most common problems when typing in Arabic.
You can control the placement of characters with neutral directionality with the control characters
The first two introduce an embedded segment that is to be displayed in LTR or RTL, and the latter ends this segment, going back to whatever directionality is the main one of the paragraph. The following example is the same line as above but with RIGHT-LEFT EMBEDDING just before the first Arabic word اسمي and POP DIRECTIONAL FORMATTING just after the second dot:
Hello, hello. اسمي اندرياس. Hello again.
Note how your browser now places dot in accordance with the Arabic visual ordering. If this same line displayed in an editor that shows control characters and displays the line in logical order, it looks something like this:
Hello, hello. <202b>اسمي اندرياس.<202c> Hello again.
Displaying text like this is very helpful when editing bidirectional text in that you can see everything going on under the hood and don’t have to wrestle with the computer rearranging the text.
Vim: See a previous post on how to work with bidirectional displayed in logical order in Vim.
A class of Unicode characters that are of particular importance for Arabists is the combining characters. These are characters that a) take up no horizontal space and b) modify or adds to the preceding character. Arabic vowel signs (fatḥa, ḍamma, kasra, etc.) and letter diacritics used in linguistic transcription (ḥ, š, etc.) are of this class. Combining characters essentially stack on the preceding character.
Unicode has no restrictions on how combining character can be combined with one another or with non-combining characters (typically a letter) from any script. This means that you can do silly things such as ḍ̣̣͑͑͑, a d with three COMBINING LEFT RING ABOVE and three COMBINING DOT BELOW, have an Arabic letter with a bunch of fatḥas and kasras تََََََِِِِِِِ, or do something more creative: ( ▀ ͜͞ʖ▀). Since the combining characters are encoded as separate characters, even though they do not appear as such, hitting Backspace with the cursor after a letter with a diacritic will only remove the diacritic, the last character before the cursor in the logical order.
Combining characters, by their nature, are not meant to be displayed in isolation without a letter to serve as their base. If you need show them isolation for purposes of demonstration the convention is to use ◌ DOTTED CIRCLE (U+25CC) as a place holder letter. A lone fatḥa is thus displayed like this: ◌َ. (Personally, I don’t like the look of this for Arabic vowel diacritics and often prefer to use the Arabic taṭwīl character as a placeholder letter: ـَ, ـِ, ـّ, etc.)
The combining characters that are used in modern typography (taškīl, vowel diacritics) are accessible with the Shift-layer of standard Arabic keyboard layouts. The basic ones are
So, for example, ب directly followed by ◌َ will show up as بَ, both occupying the same horizontal space. An effect of this is that you cannot place the cursors between the letter and the diacritic and hit Backspace to erase only the letter, nor can you highlight the diacritic without also highlighting the letter, and vice versa. Editing vowel diacritic thus takes some getting used to. Furthermore, the letter and the diacritic must be directly adjacent to show up correctly. This is way you cannot, for example, show a letter and its diacritic in different colors. The (typically invisible) code that ends the color marking segment for the letter would have to be placed between the letter (ب) and the following diacritic (◌َ), breaking this sequence. Such effects, having different font features of for a letter and its diacritic, can only be achieved in very complex and roundabout ways.
Arabic vowel diacritics should not be confused with the dots used to separate letters (ʾiʿjām), and the two are treated very differently in Unicode. The Arabic dotted letters are, of course, its own characters (ت U+062A, etc.), not combinations of characters, as are letters with vowel diacritics. However, Unicode does provide dotless forms of dotted letter shapes, e.g., ٮ (ARABIC LETTER DOTLESS BEH U+066E). This is useful for transcribing historical texts where letter dots are used inconsistently or not at all as well as for pedagogical purposes. These dotless characters make it possible to show, for example, how the word فبقى ‘and he stayed’ is written before dots or added: ڡٮٯى. There are also characters for writing the letter-dots in isolation, e.g., ﮴ (ARABIC SYMBOL TWO DOTS ABOVE U+FBB4). These are, however, not combining characters and cannot be used to with the dotless forms “reconstruct” the dotted letter.
A nice, if somewhat obscure, feature in the Arabic Unicode is the characters used for traditional Arabic typesetting of end of ayas, page numbers, years, etc. These are enclosing combining marks; they enclose any Arabic digits directly following them. There are five of these characters:
Below they are written with and without a space between the enclosing mark and the following digits.
٣٤ ٣٤
٣٤ ٣٤
٣٤ ٣٤
٣٤ ٣٤
٣٤ ٣٤
(This enclosing may not show up properly in your browser, depending on your font setup. This page tries to show these examples, as well as the Quranic examples ones below, in the Amiri font. If this is unsuccessful, some characters may not appear in their intended shape. As a last resort, you could always copy lines form this page to a word processor and play around with different Arabic fonts until you find one that can display them properly.)
The Quran has a number of orthographic features that are not uses in other texts. The ARABIC END OF AYA character above is an example of this. For these characters, Unicode has got you covered. To achieve correct Quranic orthography, you do, however, need to dig a little deeper into the Arabic section of the Unicode inventory, beyond what is found on the regular Arabic keyboard layout. Many Arabic fonts lack glyphs for these characters, and you may need a specialized or advanced font to display all of them.
Consider the following two examples:
أَنَّ ٱللهَ بَرِىٓءࣱ مِّنَ ٱلۡمشۡرِكِينَ وَرَسُولُهُ
إِنَّمَا يَخۡشَى ٱللهَ مِنۡ عِبَادِهِ ٱلۡعُلَمَـٰۤؤُاْ
(For a real-life use of these examples, see Hallberg 2016: 73.)
Note how the sukūn does not have the normal circular form of modern typography but the open form used in the Quran; how the double ḍamma in بَرِىٓءࣱ is two visually separated signs; how there is a small alif with madda on top of a letter mīm in ٱلۡعُلَمَـٰۤؤُاْ. All of specialized characters are can be inserted with reference to their code points or by manual lookup. There are a whole bunch of such Quran-specific characters Unicode. Unfortunately, they are a bit spread out on the space and do not have names that explicitly mentions the Quran, so it is a bit difficult to locate them all. These are the ones I have identified:
◌ࣰ | U+08F0 | ARABIC OPEN FATHATAN |
◌ࣱ | U+08F1 | ARABIC OPEN DAMMATAN |
◌ࣲ | U+08F2 | ARABIC OPEN KASRATAN |
◌ࣿ | U+08FF | ARABIC MARK SIDEWAYS NOON GHUNNA |
◌٘ | U+0658 | ARABIC MARK NOON GHUNNA |
ۖ◌ | U+06D6 | ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA |
ۗ◌ | U+06D7 | ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA |
ۘ◌ | U+06D8 | ARABIC SMALL HIGH MEEM INITIAL FORM |
ۙ◌ | U+06D9 | ARABIC SMALL HIGH LAM ALEF |
ۚ◌ | U+06DA | ARABIC SMALL HIGH JEEM |
ۛ◌ | U+06DB | ARABIC SMALL HIGH THREE DOTS |
ۜ◌ | U+06DC | ARABIC SMALL HIGH SEEN |
| U+06DD | ARABIC END OF AYAH |
۞ | U+06DE | ARABIC START OF RUB EL HIZB |
◌ۡ | U+06E1 | ARABIC SMALL HIGH DOTLESS HEAD OF KHAH |
◌ۢ | U+06E2 | ARABIC SMALL HIGH MEEM ISOLATED FORM |
◌ۣ | U+06E3 | ARABIC SMALL LOW SEEN |
◌ۤ | U+06E4 | ARABIC SMALL HIGH MADDA |
ۥ | U+06E5 | ARABIC SMALL WAW |
ۦ | U+06E6 | ARABIC SMALL YEH |
◌ۧ | U+06E7 | ARABIC SMALL HIGH YEH |
◌ۨ | U+06E8 | ARABIC SMALL HIGH NOON |
۩ | U+06E9 | ARABIC PLACE OF SAJDAH |
◌۪ | U+06EA | ARABIC EMPTY CENTRE LOW STOP |
◌۫ | U+06EB | ARABIC EMPTY CENTRE HIGH STOP |
◌ۭ | U+06ED | ARABIC SMALL LOW MEEM |
﴾ | U+FD3E | ORNATE LEFT PARENTHESIS |
﴿ | U+FD3F | ORNATE RIGHT PARENTHESIS |
◌ٓ | U+0653 | ARABIC MADDAH ABOVE |
◌ٔ | U+0654 | ARABIC HAMZA ABOVE |
The last two also appear in modern non-Quranic Arabic orthography, but only as part of complete letter forms (آ, أ، ؤ, etc). In the Quran, however, they are used more freely and therefore also appear in Unicode as combining characters. ORNATE LEFT PARENTHESIS and ORNATE RIGHT PARENTHESIS are not used in the Quran per se, but to delimit citations from the Quran in other texts.
In digital Arabic typesetting, letters change form automatically to correctly bind with adjacent letter. Typing م then ا will produce ما. If you want to show bound forms in isolation, such as ه, you need to be able to manipulate this behavior. This is useful, for example, in pedagogical contexts and in descriptions of Arabic typography or the Arabic writing system.
Unicode features number of control characters to manipulate letter-binding. The most useful is ZERO WIDTH JOINER (U+200D). As its name implies, it takes no space but causes adjacent letter to bind with it. With careful placement of this character, you can show all fours forms a letter they way they are typically displayed in Arabic textbooks:
ه ه ه ه
This line displayed in logical with control characters visible shows as
ه ه<200d> <200d>ه<200d> <200d>ه
A similar effect can be achieved by the taṭwil/kašīda character ـ (ARABIC TATWEEL U+0640), a horizontal line at the baseline that is normally used to elongate the connection between letter (تطويـــــل). This character can be accessed on the standard Arabic the keyboard layout with Shift+J. The taṭwīl, however, adds a horizontal line, which may or may not be what you want. (Compare ه and هـ).
The ZERO WIDTH JOINER can also be used to disable unwanted ligatures introduced by the typeface. Depending on the typeface, ligatures kick in when two or more specific letters are used next to one another in a given sequence. ZERO WIDTH JOINER between these letters breaks the sequence, negating the ligature, yet allows the letters to connect. To be precise, both letters connect to the intervening ZERO WIDTH JOINER, but visually they will connect to one another. Here are a few examples with words to the right having a ZERO WIDTH JOINER to negate ligatures:
(These are indented to be displayed in the Amiri font, which has a large number of ligatures. If they are displayed in another font with fewer ligatures, the left and right-hand side may for some words be identical.)
لا | لا |
لله | لله |
يحب | يحب |
محل | محل |
المسلم | المسلم |
كما | كما |
There is also a somewhat less useful (for Arabic) ZERO WIDTH NON-JOINER (U+200C) that can be inserted between letters to prevent connections without using a word space: مرحباً.
Specialized characters are also needed for Latinate transcription of Arabic. This is primarily done with diacritics that are added to letters from the Latin alphabet: ḥ, ā, š, etc. These are handled in one of two ways in Unicode. The first is the same way as the Arabic vowel diacritics, with combining characters, independent characters that do not take up any horizontal space but add to the preceding letter. Thus, an a directly followed by ◌̄ (COMBINING MACRON U+0304) is displayed as ā. This ā is two separate characters displayed on one and the same letter position. If you place the cursor after this character and hit Backspace, it will only delete the last character in the sequence, i.e., the macron, leaving a lone a.
These are the combining character you should need for the system of Arabic transcription most commonly used in Arabic linguistics:
As mentioned above, combining characters can be freely combined, to produce, for example, the ḏ̣ used in some transcription systems.
Note that all these characters also have “MODIFIER LETTER” versions that are identical in shape but that are not combining characters. They take up their on horizontal space like normal letters (e.g., ˉ MODIFIER LETTER MACRON U+02C9), aˉ.
The second way this type of diacritic is handled in Unicode is in recombined characters. Continuing with our example, there is a precombined ā (LATIN SMALL A WITH MACRON U+0101). Since this is one single character, hitting Backspace after this character deletes the whole thing.
Then you only need
for ʿayn and hamza and you’re set.
Now, if you type a lot of transcribed text, inserting these characters one by one with manual look-up, copying and pasting, or by typing character codes is tedious. For typing Arabic transcription, Mamlūk Studies Review provides the Alt-Latin keyboard layout. This layout extends the American QUARTY-layout to include these extra characters with key-combinations. It comes highly recommended. In short it uses
The system is neatly and clearly explained on the webpage. The layout can be downloaded for Mac or Windows and is easy to install. It can then be accessed with the operating system’s keyboard switching functionality. The characters produced by the Alt-Latin layout are the recombined versions of these letters, i.e., they are not an underlying letter and combining character together.
Vim: See this previous post on how to implement the same functionality internally in Vim.
A final (mildly pedantic) comment on Arabic transcription is on the use of the hyphen to delineate the definite article al- and other morphemes. The hyphen in transcription is functionally different from the hyphen used in normal text. The normal hyphen (HYPHEN MINUS U+002D) allows for line-breaks and is used for compound words or inserted to break up long words at the end of a line in justified text (text with a straight right-hand margin). It is entered by pressing the --key on the keyboard. Using this normal hyphen in transcription may produce line-breaks within transcribed Arabic words. The problem with this is that the hyphen reads as being inserted for line-breaking rather than as part of the word. The following two examples (from Versteegh 1983: 140 and Suleiman 2011: 20) illustrate this:
To avoid this, you can instead use ‑ (NON-BREAKING - U+2011). This character is visually identical to the normal hyphen but, as the name suggest, does not allow for line-breaking and ensures that the entire transcribed word is always on the same line.
Unicode, its pretty awesome.
Hallberg, A. (2016). Case endings in Spoken Standard Arabic: Statistics, norms, and diversity in unscripted formal speech [Doctoral dissertation, Lund University]. https://lup.lub.lu.se/record/8524489
Suleiman, Y. (2011). Ideology, grammar-making and the standardization of Arabic. In B. Orfali (Ed.), In the shadow of Arabic: The centrality of language to Arabic culture. Brill. https://doi.org/10.1163/9789004216136_002
The Unicode Standard: Version 12.0 — core specifications. (2019). Unicode Consortium. http://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf
Versteegh, K. (1983). Arabic grammar and corruption of speech. Al-Abhath, 31, 139–160.
]]>Writing down advanced and complicated ideas is not simply a matter of putting your thoughts to text from start to finish; it entails carefully weighing and testing formulations, word choices, text structures, and what to cite and how. You rarely make the right choice the first time around, and you typically have to write it down to find out if it works. Choices you do also affect other parts of the paper; you don’t want to repeat yourself unnecessarily or use inconsistent terminology, for example. And since you cannot hold the entire paper in your head at once, these things will pop up in later editing passes. Also, the logically exact and unambiguous formulations required in good scientific writing, where everything is made completely explicit, is in many ways a counterintuitive and unfamiliar way of using language and quite different form every-day and literary modes of expression. For all these reasons, academic writing inevitably involves a lot of test-writing to see what works, and a lot of editing to make it work. This process is labor-intensive, but also highly rewarding and interesting.
Writing a thesis or a first longer research paper is for many students an emotionally taxing and stressful task. This stress often comes from not being familiar with the process and the intermediate steps between the initial idea and the finished text, and from not knowing how much time and effort to expect to put into it. This makes it difficult to evaluate ones progress, and it is easy to (erroneously) assume that since you have to redo a lot of rewriting, you must be doing something wrong. The more times you go through this entire process from start to finish, the more familiar you become with it and with its different phases. You learn that numerous cycles of rewriting and editing are a natural part of good academic writing. With this familiarity you also become more relaxed with the whole process and it gets more enjoyable. I hope this post can be a shortcut to reaching that familiarity.
My process for writing academic papers includes multiple passes of printing and editing by hand. I find reading the printed text, as opposed to reading it on screen, puts it in a different light, making it easier to spot things that need to be changed. I also like the physicality of it, feeling the resistance of the pen against the paper and seeing the margins being filled out with notes, arrows, and doodles. Working with pen and paper also gives some welcome time away from the computer screen.
After the initial idea, my writing of a paper can be described as five steps, all described in detail below:
Some of these are repeated several times. Step 4 is the most involved in terms of writing and is the most cyclical, and accordingly gets it gets the most space here.
These steps are not intended to be a handbook or a set of prescriptive rules for how best to write a paper. It is only an example of one way to go about it. Other researchers may have developed other strategies and routines that work well, or better, for them than this does to me.
When I only have the idea of the paper, before even doing research, I write down the main structure of the paper in the form of section headings, sometimes with short notes in the form of bullet-lists under each heading of what that section should include. Crucially, these notes include the research question or the stated the aim of the paper. This gives me an opportunity think through and make a mental image of what needs to be done for me to answer these questions. This document, containing only headings loosely structured notes, will eventually grow and evolve into the finished paper.
This stage involves data collection or other information gathering, reading up on related research, and doing experiments and analysis. Exactly what this step entails and how much time it takes depends on the nature of the paper. It is not the focus of this post since we are here concerned specifically with writing. The important point I want to make here is that during this research phase I take a lot of notes that I write under the appropriate headings in the document described above. Again, these notes are very simple: short abbreviated sentences or lists. These notes include how I actually do the research, so that I can later describe it in detail in the method section, and what the findings are that I want to present. If I get ideas about how to write the introduction or the conclusion, or randomly come up with some clever formulation that I might want to use somewhere in the paper, I also jot that down in the appropriate section.
Having done the research, I go through my notes, move them around if necessary to their appropriate parts in the structure, and start connecting them with text, fleshing it out to something resembling continuous readable prose. Note that I do not call this step drafting, because that might imply that the text is then in a state where it could be read by someone else. Most of the text produced in this step is still really bad prose, and intentionally so. The point here is to collect a bunch of coherent text that can later, in the next step, be reworked into good prose. The text should, however, in this stage contain all the core parts of the paper in one form or another.
Some people find it difficult to start writing a new text. I have never experienced this. Having taken plenty of organized notes during the research phase in Step 2 means that I don’t start with a clean document trying to find the first word. Rather, I start with a bunch of statements and try to connect them. I don’t even start at the beginning, but let the document grow from inside out instead of from beginning to end. If I don’t find ways of connecting some of the notes now, I leave it for later.
I think of this step of connecting the notes and writing them up in continuous text as me explaining the idea of the paper to myself. This typically reveals things that I hadn’t realized were lacking in the research, or things that I hadn’t thought of before that are needed for some explanation to make sense and for some reasoning to follow a logical sequence. This could, for example, be some method I should have tried out, some additional material that would logically fit in the research, or a concept I realize I don’t know well enough to apply or to explain in a concise and accurate manner. I then go back and redo steps 2 and 3 again for these parts.
This step also involves a lot of moving things around, since you tend to see clearer where certain statements or explanations fit when you have it laid out in text.
The introduction and the conclusion may still be missing at this point, or be in a very rough state. They are for me the most difficult parts to write, and they depend heavily on how the rest of the paper turns out, setting up or concluding things yet to be written. They also, more so than other parts, need to be adapted to the audience and the journal I submit to, which I may not have decided at this point.
After Step 3 (with potential iterations of steps 2 and 3 for some parts) I end up with a coherent but poorly written text. Now it is time to print it and go at it with a pen. I prefer a pen with colored ink because I like how it looks on the page. I like to do this at a café or, depending on the season, outside in a park or by the sea. Being able to work in a beautiful environment is one of the perks of the trade.
In going through the text from start to finish I, among other things,
For a paper of twenty or so pages, it takes the most port of a working-day to do one editing pass, but I find I often do not have the energy to do it all in one day.
The image below shows a typical example of a page of text after one pass of editing.
On the surface, this editing is a process of improving the language and style of the paper. While this is certainly part of it, it is also a method for thinking deeper about the material. Organizing the text, even at the clause or word-level, or choosing the right wording to express an idea as exactly and unambiguously as possible, all part of editing, require deep and intense thought about the material. In my experience, editing is therefore deeply intertwined with and an inescapable part of systematic thought, and a core part of the intellectual labor of any research project.
When I have edited the entire text with pen in hand, I then insert the edits into the electronic document. To some extent this is a simple data-entry activity, but it is also an opportunity for further edits and to reevaluate the edits done on paper, now when I see how they read when inserted in the text. However, even the more mechanical entry is oddly satisfying, as I see how my handwritten notes gets integrated into the text and I watch it grow and develop in front of me.
For a typical paper I repeat Step 4 around five times. My printer runs warm. This might seem like a lot of boring work, but seen from the perspective of editing as the expression of thought, I find it enjoyable, and often challenging. It involves a lot interesting text-structural and linguistic problem solving and frequent micro-reviewing of the literature.
Somewhere between editing passes I go through the text to format it to the journal’s requirements, changing the font, paragraph formatting, putting tables in a separate file, and such. This can be a nice break from the demanding thought-intensive editing.
At some point, after many passes of editing, I reach a threshold of diminishing returns. Also, after having read through the same text (or different versions of it) multiple times I start to become blind to it and can no longer evaluate it or see it from the perspective of a person reading it for the first time. If at this point I am happy with it, I go to Step 5. If not, there are two options. The first is to let it rest for some time, preferably couple of weeks, before going at it again with fresh eyes. The second option is to have a colleague read it and give comments. This latter option may result in me re-evaluating and rethinking some major aspects of it, which requires more passes, perhaps focused on specific parts of which the colleague was critical. After this, I move to Step 5.
Proof-reading is a special kind of final editing pass at which I am spectacularly bad. But it needs to get done. I find it helps to print the text in a different font and to read it aloud slowly, focusing on articulation, and to have the computer read the text aloud back to me. It is best is to also have someone else, preferably a professional proof-reader, give it a final check. In this step I also carefully go through the generated bibliography to see that it has been rendered correctly.
When proof-reading, I try not to do any other types of edits, but I inevitably end up doing some. These are typically small things. Any larger edits, while possibly improving the text, will at this stage likely only have a small or questionable benefit, and are for the most part not worth the effort.
Then I’m done. (Well, until the reviewers and editors have their say.)
To drive this point home, the image below shows the first sixteen pages, of a total of around twenty, after the first editing pass. I did five editing passes for this paper, including one pass after reviewer feedback, so roughly one hundred edited pages in total.
Looking at edits like these, after the fact, I always find them strangely intriguing. Part of it is probably that they represent all the work I have put into the text and that they are therefore something to be proud of. Part if it is also that they present a visual representation of all the accumulated thought, as it was filtered, condensed, and molded into the final text.
]]>إن أردت معرفة المزيد عن نتائج البحث أو المنهج المتبع فيه فيمكنك تنزيل الرسالة الكاملة (المكتوبة باللغة الإنجليزية) هنا.
For skilled readers, reading text with diacritics is slightly slower than reading text without diacritics. While this is not a scientific test if this principle, you can see here that the gaze moves faster across the lines in the undiacritized Text 1. In fact, when first encountering a word, this reader looks at it on average 465 ms in Text 1 and 519 ms in Text 2. Note that only content words are diacritized in Text 2, so that this difference would probably be larger if Text 2 was completely diacritized.
Note also how the eyes move in quick, stepwise motions, even though when we read we experience it as our eyes moving smoothly and evenly across the text. (For more details on this, see this graphic.)
Text 1: without diacritics
Text 2: with diacritics
]]>.vimrc
to get this functionality.
The basic functionality of this implementation is that I can mark text in visual mode and heave it read aloud by pressing z
. The language of the speech synthesis is based on the spelllanguage
setting (in my case Swedish or English). The reading aloud is stopped with the ESC
key. I typically do something like vip
to mark a paragraph and then z
to have it read aloud. I stop it with ESC
if I hear something read wrong, and then have the re-mark the rest of the paragraph with v}z
and have this section read aloud z
.
I mostly write text in pandoc flavored markdown and the speech synthesis reads some of the markup aloud as well, which is annoying. I therefore have several substitution commands that the text is sent through before it is passed to the speech synthesis. These substitutions either removes things or convert them to things that make more sense when read aloud. These substitutions make the speech synthesis, among other things, do the following:
*
, <
, >
, and $
<author>_<noun>_<year>
) as “citation” followed the year of the publication<label>
” or “footnote text:” followed by the footnote content, depending on formatting[<link text>](<target>)
-type linksAfter some experimentation, I have set the reading rate to 250 words per minute. This is quite fast, and I have to focus to keep up and stop it whenever I want to do an edit.
The code below is what I have in my .vimrc
to get this functionality. It was written with a lot of help from this thread in r/vim. I am not a programmer, and I am sure the code could be made more elegant. It has nevertheless worked well for me thus far. If you use this code you will most likely want to adapt it to your needs and tastes, especially the mapping and the list of sed
substitutions. There are also several similar open source alternatives one could use instead of MacOS’s say
, such as espeak
.
function! TTS()
if &spelllang == 'sv'
let s:voice = 'Alva'
else
let s:voice = 'Allison'
endif
call system('echo '. shellescape(@x) .'
\ | sed -E "s/[<>$]//g"
\ | sed -E "s/@[a-z-]+_[a-z-]+_([0-9]{4,4})/, citation: \\1/g"
\ | sed -E "s/\\[\\^([a-z]+)\\]/ footnote: \\1./g"
\ | sed -E "s/\\]{(\\.[^}]+)}//g"
\ | sed -E "s/\\^\\[([^]]+)\\]/ ... footnote text: \1. /g"
\ | sed -E "s/\\[([^]]+)\\]\\([^)]+\\)/\\1/g"
\ | sed -E "s/https?[^ ]+/URL /g"
\ | sed -E "s/ / /g"
\ | say --voice='. s:voice . ' -r 250 &')
nnoremap <buffer><silent> <esc> :call system('killall say')<CR>
endfunction
vnoremap z "xy:call TTS()<cr>