a8

Information about a8

Published on January 21, 2008

Author: Quintilliano

Source: authorstream.com

Content

Surrogate Support in Microsoft Products:  Surrogate Support in Microsoft Products Michael S. Kaplan Software Design Engineer Trigeminal Software, Inc. What are surrogates?:  What are surrogates? "a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high surrogate and the second is a low surrogate" High/low surrogate?:  High/low surrogate? High: U+D800 - U+DBFF Low: U+DC00 - U+DFFF Terminology: "surrogate pair" preferred over "surrogate character" Conversion example #1:  Conversion example #1 Example #1: The first character in the Surrogate range (D800, DC00) as UTF-32: 1. D800: binary 1101100000000000 (lower ten bits: 0000000000) 2. DC00: binary 1101110000000000 (lower ten bits: 0000000000) 3. Concatenate 0000000000+0000000000 = x0000 4. Add x10000 Result: U+10000. This makes sense, since the first character in the Surrogate range follows immediately after the last character in the 16-bit Unicode range (U+FFFF) Conversion example #2:  Conversion example #2 Example #2. You have a Unicode character such as U+2040A (a CJK character in Plane2) and wish to encode it in UTF-16 1. Subtract x10000 - Result: 1040A 2. Split into two ten-bit pieces: 0001000001 0000001010 3. Add 1101100000000000 (D800) to the high 10 bits piece (0001000001) - Result: 1101100001000001 (D841) 4. Add 1101110000000000 (DC00) to the low 10 bits piece (0000001010) - Result: 1101110000001010 (DC0A) Your surrogate pair: D841, DC0A UTF-8 conversions:  UTF-8 conversions Illegal conversions: six-byte UTF-8 (two surrogate code points of UTF-16, converted separately) legal conversions: four-byte UTF-8 (one UTF-32 code point) UTF-8 example:  UTF-8 example Unicode surrogate pair: aaaabbbbbbcccccc, zzzzyyyyyyxxxxxx becomes incorrect UTF-8 total 6 bytes: 1110aaaa 10bbbbbb 10cccccc 1110zzzz 10yyyyyy 10xxxxxx Instead, you should take a Unicode surrogate pair: 110110wwwwzzzzyy, 110111yyyyxxxxxx and convert it to UTF-8 totaling 4 bytes (below, uuuuu is defined as = wwww+1): 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx Encoding choices for MS:  Encoding choices for MS UTF-16, mostly Occasionally UTF-8 Even more occasionally, UTF-32 REASONS: There was obviously an existing, well-tested set of APIs that support UCS-2, which is a total subset of UTF-16. A completely new API set was not required. A move to UTF-32 would require twice as much space for all characters. A move to UTF-8 would require even more than twice as much space in many cases. The products...:  The products... Mostly the new generation of products: Windows 2000/XP Office XP (some support in Office 2000) Most of these products supported Unicode already a little bit of extra work needed for surrogate pairs usually just UTF-8 support needed Windows 2000/XP:  Windows 2000/XP Uniscribe/GDI+ support for rendering Each surrogate pair is a single grapheme APIs like CharPrev/CharNext not changed Extensions to fallback fonts in XP Font CMAP extensions in XP Lots of UTF-8 issues fixed in XP No specific surrogate font/IME (yet) Collation for Supplementary chacacters:  Collation for Supplementary chacacters All Plane-1 (non-ideographic) characters sort after all the other non-ideographic scripts but before the ideographs. All Plane 2 (ideographic) characters will be sorted after all the ideographs on the BMP. All Plane 3-14 (currently not assigned) will be treated like any other unassigned characters. (includes plane 14 language tags) All characters encoded in Plane 15-16 (private use) will be sorted after all other characters. Other system components:  Other system components MLang Internet Explorer IIS 5.0/6.0 The downlevel story:  The downlevel story No good support for Unicode, let along supplementary characters Uniscribe/RichEdit does improve the downlevel story for display purposes, at least Officially, no surrgoate support on Win9x The Office suite:  The Office suite Word Frontpage Excel/Access Outlook RichEdit 4.0 Specific Features:  Specific Features Insertion/Deletion of text - All Cursor movement - All Font linking/fallback - All (Word's is best) UTF-8 issues fixed - All Enhanced word breaking - All (Word/RichEdit) Vertical text - Word/PowerPoint/Publisher/RichEdit Direct entry (Alt+nnnnnn, hhhhh + Alt+x) - Word/RichEdit CHS/CHT/CHP Office:  CHS/CHT/CHP Office The product and the langpacks support an extended Unicode IME that handles supplementary characters An Extension B font is also included Visual Studio[.NET]:  Visual Studio[.NET] String class and globalization namespace StringInfo GetTextElementEnumerator Handles supplementary characters Also handles composite characters GDI+ IDE support SQL Server:  SQL Server Past - no support Present - surrogate "safe" (neutral) Future - surrogate awaree Items not supported:  Items not supported Character Map Graph 10 Outlook 10 mail headers Collations for supplementary characters Fonts/IMEs Questions?:  Questions? Slide21:  Surrogate Support in Microsoft Products

Related presentations


Other presentations created by Quintilliano

Michelangelo
10. 01. 2008
0 views

Michelangelo

CHAP37
07. 05. 2008
0 views

CHAP37

Esercizio donne
02. 05. 2008
0 views

Esercizio donne

Cris Curtis FC Canada 082505
24. 04. 2008
0 views

Cris Curtis FC Canada 082505

Bkby1 1
22. 04. 2008
0 views

Bkby1 1

3 China Flanders Lenovo
22. 04. 2008
0 views

3 China Flanders Lenovo

TVWF
17. 04. 2008
0 views

TVWF

icoia0405
14. 04. 2008
0 views

icoia0405

Understanding MSDS
17. 01. 2008
0 views

Understanding MSDS

outsourcing
03. 04. 2008
0 views

outsourcing

Topic 06 Congress
08. 01. 2008
0 views

Topic 06 Congress

CRM luxury
10. 01. 2008
0 views

CRM luxury

20020311 Holy Communion
11. 01. 2008
0 views

20020311 Holy Communion

Standard less analysis
11. 01. 2008
0 views

Standard less analysis

SSC Class 4
14. 01. 2008
0 views

SSC Class 4

urban fiction powerpoint
14. 01. 2008
0 views

urban fiction powerpoint

classical music in poland
15. 01. 2008
0 views

classical music in poland

AQUALUNG PDR
18. 01. 2008
0 views

AQUALUNG PDR

Unit2  07
23. 01. 2008
0 views

Unit2 07

CATS Review Jeopardy
14. 01. 2008
0 views

CATS Review Jeopardy

2007f lacan
28. 01. 2008
0 views

2007f lacan

Privacy invasions
06. 02. 2008
0 views

Privacy invasions

womeninthewoods
07. 02. 2008
0 views

womeninthewoods

Romanticism
12. 02. 2008
0 views

Romanticism

cl0 Intro
30. 01. 2008
0 views

cl0 Intro

baddiley 2
29. 02. 2008
0 views

baddiley 2

Eric Mathiesen
05. 03. 2008
0 views

Eric Mathiesen

06safetyand1slides
05. 03. 2008
0 views

06safetyand1slides

SERDP Partners Soil Moisture
22. 01. 2008
0 views

SERDP Partners Soil Moisture

Bus188 Chapter 02 NT
19. 03. 2008
0 views

Bus188 Chapter 02 NT

sncr wet kilns
11. 02. 2008
0 views

sncr wet kilns

20040213 RepICFA HN
03. 04. 2008
0 views

20040213 RepICFA HN

chapter12Shaffer
03. 03. 2008
0 views

chapter12Shaffer

1157464258crs
12. 01. 2008
0 views

1157464258crs

Progress presentation Group1
08. 01. 2008
0 views

Progress presentation Group1

sciencedemocracy
29. 01. 2008
0 views

sciencedemocracy

SInov
10. 01. 2008
0 views

SInov

f v meth 061113 transzmedit
07. 02. 2008
0 views

f v meth 061113 transzmedit

rovelto adv hj symposium
14. 02. 2008
0 views

rovelto adv hj symposium

Inner Nutrition June 2007
16. 01. 2008
0 views

Inner Nutrition June 2007

all messageofthecross
04. 02. 2008
0 views

all messageofthecross

05OutdoorObserve
24. 03. 2008
0 views

05OutdoorObserve

effort flow integration
13. 02. 2008
0 views

effort flow integration

world energy 3
24. 01. 2008
0 views

world energy 3

sc wkda may04
12. 02. 2008
0 views

sc wkda may04

Vyvyan
25. 02. 2008
0 views

Vyvyan

Malcolm Hamilton
25. 01. 2008
0 views

Malcolm Hamilton

ccrecommend02
24. 01. 2008
0 views

ccrecommend02

ost
08. 02. 2008
0 views

ost