World Wide Web17 Nov 2006 03:53 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference (IUC30) in Washington DC from Nov. 15th to 17th, 2006.

What is language? Language is not static. You can rebuild a history of language and get a history of humanity. Language is a live.

There are 6000 languages in the world today. Less then 25% are written. Languages are disappearing. There are the funds for the preservation of endangered languages.

New languages are being born all the time. There was a book that was published in Spanglish a few years ago. Hebrew was dead 100 years ago. It is an example of how languages can be revived.

English went from 45% to 35% of the web content. Some of the hardest languages are some of the most popular.

Language in a product is not just a feature it is an architectural dimension. Language affects the whole process - requirements, development, documentation, testing, and customer support.

You may translate into multiple languages but what happens when you get a call for support in another language.

Different words are different sizes in different languages. There are different grammars between languages. There are different characters in different languages. Some languages have more characters then others. The numbers are different. The punctuation is different. There is a Greek question mark that looks like a semi-colon.

Scripts (latin, greek, cyrillic) or writing systems control what things look like or how they are handled. The latin script is used by hundreds of languages. Some scripts have case some don’t. Latin has case. Chinese does not.

In CJK, there are Phonetic and Ideographic scripts.

Ruby in CJK is used for small phonetic annotation. Chinese has a phonetic and non-phonetic sections of the phone book.

Using Technorati Find Related Blog Posts on: , , , ,

World Wide Web17 Nov 2006 02:49 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference (IUC30) in Washington DC from Nov. 15th to 17th, 2006.

They have all the same challenges that people have with globalization. Extra challenge because they deal with lots of data. They also often deal with a short production cycle.

When Google brings in content, they right away convert it to Unicode.

Internally they use UTF-8 in C++. In Java & Windows they use UTF-16.

It is really important to them to have stable identifiers for the different countries.

Locale = Lanugage + (possibly) other info.

Anyone can help to contribute new localizations to Google. They use professional vendors for top-tier languages.

Sometimes they will delay release schedule to make sure that they can get all their translations in. Sometimes disable features till fully translated.

They have about 4% of their total search corpus with bad encodings or character corruption.

Markup is the most commonly used script (by a lot.) Next is latin then common text like numbers and spaces.

Often they will taken Bad html source: mixed encodings, doubley encoded markup, corrupted data. Often bad servers: server misidentifies type of encoding. If there is no auto-detect, there will be random junk.

Everyone has seen the “did you mean…?” in Google. This helps with misspellings. This does not use any type of dictionary. They just use their large corpus of data and see how people correct their search queries.

In Google Maps they do a lot of free form parsing, like when you search for “pizza in menlo park.” It can pull out Pizza and Menlo Park but it has to know language and grammar.

Using Technorati Find Related Blog Posts on: , , , , ,

World Wide Web17 Nov 2006 12:19 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

Notes:
We need to have an understanding of the different locales that we are going to have to communicate with. We need to understand the locale data. This locale data includes numbers, currency, dates, times, collation, naming the locales in different languages, and much more.

CLDR is a repository XML format for this information. It is sponsored by the Unicode Consortium. Controlled by Technical Report 35.

CLDR 1.4 was released July 17th, 2006. There are 360 Locales. There are 17,000 new or modified items. There are over 100 contributors. (i.e. Google, Apple, and Sun)

There is the main data (current locale data), collation (sorting), supplemental (not tied to a locale), and data in posix format.

Locale data is inherited from higher levels of structure. Like fr_CA -> fr -> root.

It will display the alternatives for naming the locale in different languages. It will show what direction text should go in. It will show you the exemplar characters. And much much much more…

There needs to be tools that make use of CLDR. It is the foundation of ICU, POSIX locale generator, or openOffice.

Discussion on the further expansion and scope of CLDR.

Using Technorati Find Related Blog Posts on: , , , , , , ,

World Wide Web16 Nov 2006 05:49 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

Notes:
We communicate with a wide audience, potentially the whole world. We need to be sensitive to what our users want. Everything we perceive goes through the filters of culture and language.

People communicate differently. What words we use determines how we think about what we think about. (Sapir-Whorf Hypothesis)

Culture and communication also changes over time. We need to be aware of this when developing our user interfaces.

We need to build a good team. There needs to be Subject Matter Experts (SME) from as many different regions as possible. Create a diverse base of usability testers.

Make sure that the technology infrastructure is robust enough to do the job. There is usually a tradeoff between power and flexibility.

Know what data you have on your site. In what file format are things in? Take a survey of all your data. This will help you know what to do in the future.

They did card sorting with various people around the world. They put the cards and instructions together into a packet and shipped it to people in different countries. The people sent back the cards and how they sorted things.

Figure out what the users needs are. Use a survey. Get to know your users better.

Search logs provide an unbiased opinion of where the IA is weak.

Develop wireframes for the different character sets. In some languages, words are longer then in other languages. Make sure that images work both ways, rtl and ltr.

Different taxonomies are used in different cultures and countries. They created controlled vocabularies.

Brand is communicated different in each country. Create a culture guide from the corporate level. The brand should be globally accepted but also culturally relevant.

Symbols change between cultures. Use of words between cultures change.

Using Technorati Find Related Blog Posts on: , , , ,

World Wide Web16 Nov 2006 04:06 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

Notes:
Tex decided to have some fun with this session and theme it all around the CBS TV Show CSI.

This session will introduce strategies for fixing character corruption.

Often you will go to a web site that will have all these garbage, corrupted, or wrong characters. We figure out what the environment was, which characters are corrupted, or screen caps.

We need to figure out what it was supposed to show. What were users expecting? What fonts were they using? What was the data source?

How could we have gotten this corrupted data? Can I reproduce it?

We need to understand the different technologies and how it handles characters. There are different protocols and negotiation tactics.

Understand the data flow and how it goes through the different pieces of middleware. You could have one system using one encoding and another.

A lot of character encodings are similar and its easy to get the names interchanged. This can cause problems.

Sometime its not an encoding problem. It can be just a problem with the font.

Correct conversion is not always obvious. Some people just assume that the conversion is going to work. It doesn’t always.

Unicode is an involving standard. Over time the conversions should change because we can get more accurate with the characters.

Using Technorati Find Related Blog Posts on: , , , , ,

World Wide Web16 Nov 2006 03:05 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

Notes:
Do you want a database? There is a lot of work using flat files & wikis. Search is very fast. In a wiki, people are able to comment on the soundness of data.

If you need Chinese, Japanese, or Korean, there is an advantage to using UTF-16. If you are doing a lot of linguistic processing, it may be better to use UTF-16. It doesn’t have to pull the two 8 bit groupings together. If you are using more obscure characters in CJK, you may want to use Unicode 4 or 5.0.

If you have properly labeled data, it doesn’t matter as much UTF-8 vs. UTF-16. If you have data being brought in from the Internet, then there can be some conversion problems. A lot of DBs are optimized for UTF-8. Oracle is optimized for UTF-8. A lot HTML text is going to be 8 byte so UTF-8 is better. Choose by Database and/or language.

When you have multi-lingual db, some databases are going to search some language better then others. It also is worth looking into how you index with the search.

We need to standardize on the language code. We know have bigger meta languages like Arabic. There is Iranian Arabic or Iraqian Arabic. Needs to be standardized.

Different database vendors use different Unicode character sets.

We need to think about all the tiers in the multi-tier database architecture and what their encoding may be.

When you convert between databases, know whether your converting from UTF-16 to UTF-8 or what.

Look at the DB vendors and see what they have for stemming, tokenization, and segmentation of data.

Character conversion needs to be factored into the performance. It will take a significant amount of processing time.

Using Technorati Find Related Blog Posts on: , , ,

World Wide Web16 Nov 2006 02:20 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

Richard Gillam of IBM just gave a really interesting talking about how computers parse personal names.

It seems like so many companies depend on their large databases of personal information to get in contact with their customers. How many millions of dollars are probably wasted because of duplicate entries or incorrect information?

Especially today those databases of names are going to be in multiple languages. Those different languages have different writing systems. It will be more of a challenge to understand the different names.

Apparently a lot of software has been developed to deal with the differences in names, how they are handled. You can learn a lot from names about gender, social standing, culture, location, or who the parent is.

Fascinating stuff.

Using Technorati Find Related Blog Posts on: , ,

World Wide Web16 Nov 2006 12:13 pm

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

(Had some really nice notes but the hotel wifi decided to eat them.)

Richard gave a really great overview of what some of the needs are with different writing systems from around the world.

I don’t think as Americans that we fully understand all the different writing systems that are available around the world. If we really want to make the World Wide Web world wide, then we need to support the writing systems and styles.

Cascading Style Sheets 3.0 is moving in that direction in great ways. Examples are things like list styles, vertical text, text grids, text justification, and ruby annotated text.

The W3C is working on these CSS 3.0 features in modules. They’re looking for more people with expertise in these various areas.

We really need to kick browser developers in the butt and tell them that we need to get these features implemented in the future. (will write more about this later.)

Richard showed some great demos. I will post the URL later.

W3C has these really great Internationalization Quick Tips Cards.

Using Technorati Find Related Blog Posts on: , , , ,

Computers16 Nov 2006 11:51 am

The following are my notes and summaries from sessions at the 30th Internationalization and Unicode Conference in Washington DC from Nov. 15th to 17th, 2006.

Note: I will post my notes from yesterday later on today.

This morning Nicholas Negroponte, Chairman of One Laptop Per Child gave a great keynote at the Internationalization and Unicode Conference.

He described their amazing initiative of spreading laptops to children around the world. How the laptop uses under two watts of power. It uses a wifi mesh network. It has a hand crank to power the battery. He even passed around a prototype of the laptop. It is pretty amazing.

He announced that the actual laptops were just coming off the lines at the factories in China. He even showed us a photo of one coming off the line in China. One is getting delivered to his office in Boston today.

They are apparently shooting for five countries to launch the laptops.

They are going to develop some cheap peripherals for the laptop. Even maybe a $100 digital projector. WHOA!!!

Just think how many more people in the world will be able to enjoy the content which we all develop and provide once these laptops get around the world.

Using Technorati Find Related Blog Posts on: , , , , , , ,

World Wide Web16 Nov 2006 01:17 am

My co-worker Michelle and I were driving to the the Dulles Hilton for the Internationalization Unicode Conference. We wanted to call the hotel to make sure that we were going in the right direction.

I pulled out my mobile phone. I pulled up the hotel’s Web site in Opera Mini. I found the phone number.

Opera Mini saw the phone number when I got to it on the page and asked me if I wanted to dial it. This is awesome!!! Yes I wanted to dial the number. I was expecting to having to copy and paste using my stylus.

Cheers to the Opera Mini Team! I am very impressed by your product and will continue to tell all my friends about it.

Using Technorati Find Related Blog Posts on: , , , , , ,

Next Page »