UTF-8 Rocks
Something I found out recently, the guys that invented UTF-8 (a unicode format) and Shift-JIS (a Japanese format) were pretty smart. They made sure that it would be nearly impossible to have problems using either of them in a system that mostly assumed ASCII but generally doesn't care what's in a string otherwise

By that I mean for example in Japanese and Unicode there are all kinds of characters that take 2 or more bytes to represent. This is just an example but let's say that the code for 恋 was 9122 (hex). Well if you put that in an ASCII program that was looking for quote marks (") it would see the 22 as an ASCII quote and mess up. Another example might by let's say the code for 愛 was 943C. That 3C would be a < and < is used in HTML for webpages. It would really mess up your pages.

Well, in both cases UTF-8 and Shift-JIS, the designers made sure that could never happen. UTF-8 users only 80-FF for any character with a code greater than 7F. That means all ASCII is uneffected and it also means there is no possible way for there to be mis-interpreted puncuations hidden inside codes for other characters. Shift-JIS only uses 40-9F or so. Also avoiding most punctuation althugh there are still a few in there.

As for UTF-8. It also uses a shift value or escape code for all codes above 7F. That means basically all Japanese, Korean and Chinese gets turned into at least 3 bytes per character in UTF-8 instead of 2. I know some people have a slight issue with that since it means for example Japanese text stored in UTF-8 will take 50% more space than stored in an older Japanese only format but UTF-8 solves a huge problem which is that bascially most non unicode aware programs should be able to handle unicode utf-8 text throughout their systems. Any punctuation, control-characters or keywords they are looking for will never be mistaken for or mis-recognized. What a relief!

For more info here's a good place to start.



Pass it on

コメント:
Anyone doing any work with asian fonts etc., should have a look at CJKV Information Processing by Ken Lunde. Lot's of interesting background information and history explaining why things are the way they are and pretty much everything you need to know about the topic.  It's satisfyingly weighty also.
posted by vegOctober 24, 2004 at 5:46 [ e ]
http://tronweb.super-nova.co.jp/unicoderevisited.
html

for TRON people dumping on unicode. The surrogate pair thing bit me on the ass not long ago, so I'm kinda sympathetic...
posted by anonymousTroyOctober 28, 2004 at 8:34 [ e ]
TRON

Very interesting but it sounds like they will never make it.  It's not just an encoding standard it's an entire OS standard.  They could get lucky but that basically comes down to a "boil the ocean" type of problem at the moment.

Also note that that article was written before the author knew Unicode covered more than 65536 codes.  He revised it later but the first half of the article is still left completely false.

posted by greggmanOctober 28, 2004 at 13:13 [ e ]
UTF8 / MBCS etc.
Yeah, I can second the plug for CJKV Information Processing. Reading that got me the basics on asian language support, and I've used it to support Korean (KS C 5601 (KS X 1001:1992), Taiwanese (trad. Chinese BIG5), PRC (simplified Chinese - GB2312-80), Japanese (JIS X 0208:1997, shift-Jis encoding) in PC games that only had ascii-char* engines, none of this unicode crap.  He's a good guy to chat to, and once I'd figured out support for other asian languages like Thai (TIS 620-2533 ) etc then I got back to him with the new info.   All the asian MBCS schemes are well thought out and require only minimal decoding ingame, you just have to get people out of the habit of using either 7-bit ascii, functions like strchr() etc, or trying to parse strings backwards.
posted by DalroiDecember 3, 2004 at 18:42 [ e ]

検索
Code
記事
ニュース
メニュー

私は現在日本語を勉強しています。間違いがありましたら、ご教示ください。