http://andreasmhallberg.github.io/oslo
Preliminaries
- Terminal
- Plain text (txt, html, markdown)
- Vim
Unicode
Principles
- Standard since early 2000s
-
Replaced various extensions of ASCII (1963) for different languages
- In Unicode all scripts are equal
- 1,114,112 available code points (137,929 encoded, v. 12, 2019)
- Seeks to represent all written languages that have ever existed
- The manual! (The Unicode Standard, 2019)
- Read introduction + language of interest
- A font has glyph for subset of all these
- Softwares substitute font for missing glyphs
- Arbitrarily mapped to keyboard
Encoding
A Æ 我 𐎄 ﷺ 𓀍 🤷 😼
- For each character:
- Code point, conveniently expressed in hexadecimal (0–F), e.g. U+00C6 Æ
- Name, e.g.
LATIN CAPITAL LETTER AE
- Properties (writing direction, combining, etc.)
- Explore character inventory
- OS application
:UnicodeTable
(unicode.vim)- Inserting characters
- Keyboard layouts
- With code-point
CTRL+V u [code point]
- Insert with search
CTRL+X CTRL+Z
(unicode.vim plugin) egyp
- Inspect character:
ga
(Vim)
(Bi-)Directionality
- Digital text as list of characters (including line breaks, etc.)
- Order of characters
-
Logical order: as digitally stored, as read by computers
Hello, hello. اسمي اندرياس. Hello again.
(Invisible RTL-OVERRIDE character added at start of line)
-
Visual order: as visually displayed, as read by humans
Hello, hello. اسمي اندرياس. Hello again.
- Direction properties
- RTL (Latin letters)
- LTR (Arabic letters)
- Neutral (Most punctuation, space)
- Direction properties
-
Swap direction and keyboard in Vim.
http://andreasmhallberg.github.io/typing-arabic-in-vim/
-
- Control characters
- U+200E LEFT-RIGHT EMBEDDING
- U+200F RIGHT-LEFT EMBEDDING
- U+202C POP DIRECTIONAL FORMATTING
As above with Arabic text enclosed in RTL embedding and POP:
Hello, hello. اسمي اندرياس. Hello again.
- Override
- U+202E RIGHT-TO-LEFT OVERRIDE
- U+202D LEFT-TO-RIGHT OVERRIDE
Combining characters
- Takes no horizontal space
- Added to preceding character
- E.g, diacritics (ḥ, š) and Arabic vowel signs
- No limit to combinations
- ḍ̣̣͑: d with U+0351 COMBINING LEFT RING ABOVE and 3 x U+0323 COMBINING DOT BELOW
- بَََََََِِِِِِِ: baa with a bunch of fatḥas
̿̿ ̿̿ ̿̿ ̿'̿'\̵͇̿̿\з= ( ▀ ͜͞ʖ▀) =ε/̵͇̿̿/’̿’̿ ̿ ̿̿ ̿̿ ̿̿
-
Quranic orthography
In plain text:
أَنَّ ٱللهَ بَرِىٓءࣱ مِّنَ ٱلۡمشۡرِكِينَ وَرَسُولُهُ
إِنَّمَا يَخۡشَى ٱللهَ مِنۡ عِبَادِهِ ٱلۡعُلَمَـٰۤؤُاْ
- Enclosing combining marks
- ٢٠٢٢
- ٣٤٥
Searching with regular expressions
Method for Hallberg (in press).
- Plain Unicode characters
rg -P "في ب" 79729368.epub.html
Better in logical order:
`rg -P "في ال" 79729368.epub.html`
`rg -P "ُو[^ًٌٍَُِّْ]" 79729368.epub.html`
rg -oP "ُو[^ًٌٍَُِّْ]" 79729368.epub.html | wc -l
- Unicode code points
rg -oP "\x{064f}و[^\x{064b}\x{064b}\x{064c}\x{064d}\x{064e}\x{064f}\x{0650}\x{0651}\x{0652}]" 79729368.epub.html | wc -l
- Unicode code points with ranges
rg -oP "\x{064f}و[^\x{064b}-\x{0652}]" 79729368.epub.html | wc -l
Letter binding control
(Demonstrate with Amiri)
-
U+200D ZERO WIDTH JOINER
لم لم
لا لا
الله الله
-
U+200C ZERO WIDTH NON-JOINER
لمل لم
References
Hallberg, A. (in press). Principles of variation in the use of diacritics (taškīl) in Arabic books. Language Sciences.
Hallberg, A. (2016). Case endings in Spoken Standard Arabic: Statistics, norms, and diversity in unscripted formal speech [Doctoral dissertation, Lund University]. https://lup.lub.lu.se/record/8524489
The Unicode Standard: Version 12.0 — core specifications. (2019). Unicode Consortium. http://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf