72, 101, 108, 108, 111
Each code unit format has its own structure. Note that UTF-16LE needs an extra code unit.
UTF-8: 0x48, 0x65, 0x6c, 0x6c, 0x6f
UTF-16: 0x0048, 0x0065, 0x006c, 0x006c, 0x006f
UTF-16LE: 0xfffe, 0x4800, 0x6500, 0x6c00, 0x6c00, 0x6f00
UTF-32: 0x00000048, 0x00000065, 0x0000006c, 0x0000006c, 0x0000006f
So... what? Well, someone had a thinking cap on when they defined UTF-8. It's optimal for storing American English and safe for legacy software libraries. My favorite part is how the first bits in the first byte count the number of bytes remaining in the code unit.
UTF-16 is kinda strange when you really look at it. Storing it requires 100% more space than UTF-8 requires for American English. It's big/network endian by default... meaning most computers have to juggle bytes per character before using them. And it's totally incompatible with old ASCII libraries. Finally, it's not possible to rely on 1 utf-16 code unit per code point due to surrogate characters.
The Unicode 5.0 spec is gigantic. It's always stocked on the top shelf at Borders with the other oversized books. Read it and you may believe that ZWNBSP means something. :)
0 TrackBacks
Listed below are links to blogs that reference this entry: Unicode code points vs code units.
TrackBack URL for this entry: http://www.nearinfinity.com/mt/mt-tb.cgi/534



Leave a comment