Approaching Unicode

Be Developers' Conference

Unicode and International Issues
Hiroshi Lockheimer

Approaching Track - Unicode and International Issues

Hiroshi Lockheimer: My name is Hiroshi. This is Unicode, specifically -- I'm sure a lot of you have used Unicode or know of it. It's an international character and code standard. Basically it solves or tries to solve the problem of having to deal with multiple different encodings of the world such as ISO LAT One shift just for Japanese. There are many encoding standards in the world. Unicode tries to solve that by unifying all that into one big code space so you only have to know how to deal with one encoding. It's fixed-width, which means 16 bits, different than what you're used to.

But there are some tradeoffs and I'll explain later that we don't actually use straight unicode. We use what's called UTF-8 which tries to deal with, which tries to use the old style of dealing with strings which is ASCII. Unicode is 16-bit, 65,535 code positions available. About 20,000 actually are unassigned right now, so still a lot of room for expansion.

There's some private area. That will remain unassigned so people can have whatever characters they want in there, but I doubt any time soon. If we do fill up the unicode code space, though, there is a way to extend it using the surrogate extension mechanism. That basically says two unicode characters will -- or two unicode values will mean a different character, so it basically is a 32-bit encoding. But the BeOS does deal with that, untested because it hasn't been used yet anywhere. But we do deal with that in theory.

UTF-8 is a form of unicode, a variable length of unicode, a little more tricky to use compared to straight unicode. What it is, is a string of 8 bits, single-byte string that encodes straight 16-bit unicode. Typically, the characters are 1, 2, 3 bytes in length so it is multibyte which is why it's a little trickier. There's a big reason we chose it: It's backwards compatible with ASCII. That was a big win.

Actually, the person that came up with the idea of us using UTF-8 was Dominic who was in here earlier. He wrote the file system and he didn't want to rewrite any code that would have to deal with unicode specifically.

The problem with unicode obviously is because it is 16-bit. You can't parse for a known character easily. If you have the character A, for example, it starts with zero zero, some hex value, 1 byte afterwards if you're parsing for NULL without that knowledge, that would be considered a NULL character there. And your string would be terminated incorrectly.

With UTF, you can parse with NULL character and that will mean what you're used to, just end of string. So it's compatible with C functions such as string lengths. There are some subtle semantic differences. String length is actually a bit length basically, not really character length. You're not going to get the count of characters because this UTF-8 character can be multibyte. So if you've got 4 from string lengths, that doesn't necessarily mean there are 4 differing characters in that string, it means there's 4 bytes in there which is actually what most of the time you want.

Because when you're caching a string, you want the correct memory, not the number of characters. You want to allocate 4 bytes, and not 4 characters. The way UTF-8 works is the first byte indicates, as you can see from this chart, the first byte here will always indicate how many more bytes there are to follow, the first bit that's a zero. That obvious means this is just an ASCII 7-bit value by definition. If there are 2 here, that means this is a 2-byte sequence, 3 and 4.

So it's very easy to forward parse UTF-8. You look at the first byte. To jump to the next character, you would jump by the number of ones there, in essence. So it's a little trickier than simply indexing an array of bytes or shorts in the case of unicode. But given that simple rule, it's pretty easy to deal with.

For example, the text editing class, called TextView, I added unicode support in a weekend. It's not that hard.

BeOS API is primarily UTF-8-based. For example, to set the title of a window, there's a function called BWindow::SetTitle. That expects a UTF-8 string. You can pass it, you know, usually the Roman characters which is ASCII. Since that's UTF-8, you really don't have to think about it in terms of UTF-8. If you wanted to have a weird character in it or a Japanese title, for example, you just pass it in and the app server knows how to deal with it.

The same with BTextView. It deals exclusively. There are some conversion functions that I'll explain a little later that help you in converting to and from UTF-8 with the other encode. The type times are standard. No new special times. A lot of times they're wide characters. We don't have to deal with that. Consequently, when we added unicode support, UTF-8 support into the kits, the only function that had to change to accommodate that was this BView::KeyDown function. That was before a single byte. Now, it's a string which is what the user has typed on the key. It has the number of bytes that's being passed to you. That was pretty much the only change that had to occur for us to support UTF-8 from an API standpoint.

So a typical implementation of a KeyDown function, since a lot of times you just want to look for a tab key or enter for invoking action, because of those characters particularly well within the 7-bit ASCII range, you can simply dereference the first byte of a UTF-8 character, assume it is an ASCII character, and switch on it.

If it's a multibyte UTF-8 character, it won't be an ASCII value. If you case on a Be tab, which is basically the tab character, it won't jump to here. So you can always assume that something is ASCII and it will always work that way. So it's very simple. A lot of our kit classes such as BButton or BCheck process -- this is cut and pasted from BButton, I think. So a typical way of dealing with common characters.

The conversion functions I mentioned earlier in the UTF-8 header file, in the support kit, it's in libbe. There's convert to UTF-8. They're just C functions, not C++,a little tricky to deal with. There are a lot of parameters you have to pass through. But it does the job. Basically you specify what sort of encoding you want. You want to convert to or from. The flavors are here. B ISO1, 1 through 10 basically. We also have a Macintosh Roman character set, and a table for shift, SJIS, and these will use these tables to convert the buffer that you pass in, given the length into the buffer that you splice and the length of that buffer, convert from the same way, the reverse operation.

I urge you to use these functions because the tables are in the Be so that reduces the code size of your app. You don't have to include those tables yourself. And if there are any modifications that we need to make, we can just do it in one table. So it saves a lot on code size.

I also wrote a simple command line tool called "xtou" which uses those functions to do text file conversions from the command line. It would be a good one to write a Tracker app to do that. It would be a useful thing.

Font, this is discussion on international issues. We'll talk a little bit about fonts. It is a very important thing.

We deal with Windows formats TrueType only. Usually most of the fonts that work under NT will work without any modifications under the BeOS. A lot of times I buy Japanese fonts for NT or Windows 95, just load it up onto the BeOS and it works out any modifications.

We have some global font objects, be plain, be bold, be fixed objects. Which conform to the settings in this preference panel called fonts. This is be plain, be bold and be fixed font, setable by the user, as you can see, which I encourage you to use those objects instead of hardcoding font names in your source code.

So obviously, a very bad thing to do in internationalization. If you hardcode, you're assuming a specific font with a specific set of glyphs that it defines. If you set up hardcode font which only has Roman glyphs, and if a Japanese user or Korean or Chinese, any other user wanted to use that application which hardcodes that font, given their language, that is not defined in that font, they're pretty much stuck. There's nothing you can do about that.

But if you use these proxy objects that point to a font that the user can specify, they can easily specify a font that they like, which contains the characters that they need, and your application, because UTF-8 is just one character standard, that you know how to deal with, it will work flawlessly. So it makes internationalizing your app a lot easier, a lot easier than having to deal with different encodings which you would have to specifically code for BFont.

As you saw yesterday in the general session, we have some plans in this area. There's a lot of stuff we need to do still. Basically we've only taken one step which is to display various characters using UTF-8 and from here we want to be able to input, so for R4 we'll have a Japanese input. Still to be decided for the most part. I haven't actually started yet.

The actual Japanese input method, the core conversion engine has actually been started, but the input method, what will happen is I'm coming up with the generic API, which is not specifically tailored for Japanese, but for input methods in general so that third parties will be easily -- they'll be able to write add-ons which do Korean input or potentially even I might try to make it generic enough so you can do pen tablets input, any sort of input that is not the standard keyboard direct Roman input, we'll go through this Japanese -- excuse me -- this input method architecture.

It will probably comprise of a server which will load these add-ons and set up API's that talk to the server. So if you have any suggestions, if there are any experts out there who have suggestions, please contact me, definitely. This is a good time for that.

After input methods we'll work on localization. What I mean by localization here is to detach the graphical elements, the design of your applications from source code. Currently, a lot of BApplications are written in a very old-fashioned way, which is to define the positions of buttons, location of windows, what button goes where, the text of the radio buttons, the text of the widgets, the labels, all of that is in the source code, compiled into the source code. And that makes localization, internationalization very difficult, because you need to source code basically to modify an application to take advantage of a different language or display itself in a different language. So we would have to detach all that from the source code and put it into resource. I'll probably end up using the archiving capabilities of the BeOS. A lot of widgets that we supply, actually all of them, have an archive and a method where they basically freeze-dry a widget into a BMessage when archiving.

That BMessage will contain the current state of that control, for example, all the text associated with it. Its value, its position, its size, everything. And you can save that BMessage to a disk, the executable. And you can later unflatten it and load it at run time and display from there. So that way you can simply just modify these resources, these BMessages with a ResEdit type of tool. If you're a Mac developer I'm sure you know of ResEdit.

So modifying those resources outside of the source code would make localizing very easy. BeatWare, a third party of ours has AppSketcher. It does all of that sort of work. They do a lot of linking of widgets with actual functions. That's more into the next interface builder sort of type of thinking. But we'll probably just end up defining a set of standards of where to store certain resources that describe your interface and from there third parties can come up with tools that modify these GUI elements.

To summarize, localization is after input methods. Probably, that means it will not make it into R4. But R4 will definitely have input methods. We're soliciting a lot of input from you guys, on how you think we should deal with this, so if you have any suggestions or comments, definitely contact me or if you have any questions now, I'm ready for that.

A Speaker: What type of support is here or planned left to right, like Chinese, Arabic? Would you have a Be string which knows which direction the text is supposed to be justified for language selected?

Hiroshi Lockheimer: Justification or the actual direction of the text? Those are different things.

A Speaker: Both of them, any type of support from left to right.

Hiroshi Lockheimer: Right to left. Okay.

A Speaker: That text as well.

Hiroshi Lockheimer: Sure. Currently there are no plans for that. Our BFont object actually has a method which returns when a font is left to right or right to left. So we definitely are preparing for that, but our first priority right now is for left to right, alternative left to right scripts. But definitely TextView, for example, as we add these new international features into the OS, will take advantage of that. So when I add input methods, TextView will take advantage of that, including in-line input.

A Speaker: What about vertical text?

Hiroshi Lockheimer: I don't think we'll do that. That's easily done through drawing characters yourself. You can easily write a draw string which basically draws the characters one-by-one.

A Speaker: What was the question? Vertical text?

Hiroshi Lockheimer: Yes.

A Speaker: It's better to have OS support. Things are laid out a little bit differently.

Hiroshi Lockheimer: Sure. But right now we don't have any plans for that. Our font renderer has the ability to rotate text, but it doesn't have -- that's obviously different. But I guess, yeah, that would require significant additions to the font engine, which I don't think we're prepared to do right now. Maybe in the long-term.

Yeah?

A Speaker: Is there a UTF-8 version of upper or lower case?

Hiroshi Lockheimer: Not in our case. That's a very tricky sort of thing. You need a mapping table. For certain languages, like Japanese, it doesn't make sense lower case, upper case. A lot of times what I end up doing for convenience, which checks whether a character is in a several bit range, POSIX, otherwise pass through.

A Speaker: So that would eliminate all the European upper characters?

Hiroshi Lockheimer: I am planning on not adding in the convert functions more UTF-8 functions that do upper case to lower case, and so forth. That's planned.

A Speaker: The file system is for UTF-8; correct?

Hiroshi Lockheimer: The file system, nothing had to change in order to support UTF-8 because it is backwards compatible with ASCII. So I could show you here --

A Speaker: Actually, my question is, is the compiler?

Hiroshi Lockheimer: Yes, the Metrowerks compiler because UTF-8 is compileable with ASCII. There are no -- the editor does do UTF-8 so you can edit UTF-8. I can show you here. It's a Japanese file name, for example, right here, Japanese text. This is TextView here. I can actually show you a Japanese version of StyledEdit, all of our interface elements, so if you pass in the Japanese string, it will display it giving the correct fonts. The same thing with the words buttons.

We are at I think at the level of -- I might as well show you the NetPositive. I actually use the BeOS full-time to do Web browsing. BeMail, our standard mail clients know how to decode Japanese messages as well so I use that for mail. I do have a prototype method for in-house. I use that for mailing stuff. It is at the level of usability, but we still have no standard input method shipping with the product yet.

A Speaker: So your sound version of StyledEdit there, it used a Japanese font to render the menus and that sort of thing?

Hiroshi Lockheimer: Right.

A Speaker: I assume it couldn't get the system default font because the system default font doesn't have the Japanese. So is there a facility to ask the operating system what is the user-preferred font you know for rendering this language?

Hiroshi Lockheimer: Sure. That's something that will be added for the input. The applications are going to be written for specific language so they'll be a way for an app to say, "Hey, I know Japanese, English and French and, you know, get me" -- we'll have different people with be fixed fonts, language we support, so this setting span will probably change to indicate that. So you'll have these settings per language. Given applications preferred language, it will launch it passing at those fonts. So that way you can mix. You won't have to set the whole, you know, global font with the whole system to one language. You could have it on per application or per language basis.

A Speaker: Terminal doesn't support Japanese.

Hiroshi Lockheimer: Yes. Terminal is one of the few applications in our world that doesn't support UTF-8. I've been bugging Rico about that. I suggest you do, too. I think the terminal was one thing that where he assumed that he could index directly into a buffer that he has, given the size of the window, and he just hasn't done the work yet to upgrade that.

A Speaker: How is the choice of the UTF-8 affecting Java, everything unicode?

Hiroshi Lockheimer: I didn't actually do the job implementation, so I don't know, but because Java uses unicode, when if saves files to disk, it uses it, so it just happens to work out. If I'm wrong, please correct me, but that's my understanding.

Anything else? Okay. A short subject.

Transcription provided by:

1520 Parkmoor Avenue
San Jose, California
408.280.1252
dhoyman@rtreporters.com