ASPToday - Your Just-In-Time Resource for ASP Code and Techniques

Programmer to Programmer^TM

The ASPToday Article
July 2, 2002

Previous article -
July 1, 2002

Handling Japanese Data on the Web

by M Sitaraman

CATEGORIES: .NET Framework, Site Design
ARTICLE TYPE: In-Depth	Reader Comments

ABSTRACT

There are many issues involved in handling Japanese Data in a Web Environment including encoding, multiple script handling, size checking and displaying dates. This article will suggest ways of tackling day-to-day problems while dealing with data on the web and illustrate these solutions in 2 .NET applications.

Requirements:

Japanese OS/Language Support

Knowledge of ASP.NET and VB.NET


Article Discussion	Rate this article	Related Links	Index Entries


ARTICLE

Introduction

Handling Japanese data on the Web can be a very challenging task, given the complexity of the language. The issues that arise in such an exercise range from

Encoding to be used in the Web Server for the Request and Response objects
Handling various Scripts like Hiragana, Katakana, Kanji and Romaji (English Alpha-Numeric Characters), in addition to Kana-Majiri
Handling Hankaku (Half-Width/Single Byte) and Zenkaku (Full-Width. /Double Byte) data
Data Size issues
Japanese Date formats (Japanese Eras)

These problems can be tackled by using a common standard based on which the encoding will be implemented across the various tiers of a system and by providing a Class Library, which provides functions to handle the afore-mentioned problems.

System Requirements

The system knowledge requirements to implement such a Japanese based solution are:

Japanese IME on the Developer/Debugger machines (Please note that for using sample application/class library and typing Japanese characters you will need Japanese Language (IME) support in your machine. Steps for installing the Japanese language support is mentioned in the Application Setup section at the end of this article)
Japanese OS/Language Support on the Web Server machine
Japanese OS/Language Support on the browser machine
Knowledge of ASP.NET and VB.NET

Defining the Problem

Let's now discuss the various problems outlined above.

Encoding

Encoding is an important factor to be concentrated on when developing a non-English based Web Application. Based on the Encoding, the web server will process the data received in the request object and also will render the screen in the response object. Various Encoding are possible while developing a Japanese Encoding based Web Application, namely Shift_JIS/SJIS, EUC-JP, utf-8 (using any of these encodings, you can input Japanese data in the different scripts, mentioned below). The impact of the encoding can be seen if we enter a data on the screen, for example in Shift_JIS, change the encoding to utf-8 in the web.config file (without stopping the web application) and refresh the screen. We will observe that all the characters are junked, in spite of the fact that the new encoding also supports these characters. This is because, the storage of the characters is totally dependant on the encoding used and might not be compatible with other encodings. Conversion of one encoding to another has to be done explicitly, if necessary, through programmatic means only, failing which the characters will be junked out.

Different Scripts

The Japanese Language, unlike English, uses multiple scripts. There are basically four scripts namely, Hiragana, Katakana ( Kana is the name given to the characters of these two scripts), Kanji and Romaji. So the same word can be written in any of the scripts. For example, the word "Tokyo" is written as in Hiragana, or in Katakana and in Kanji. To complicate things more, the characters in these scripts can be either Hankaku (Single Byte/Half-width character) in the case of Katakana or Zenkaku (Double-Byte/Full-Width character) in the case of Hiragana, Katakana or Kanji. Romaji is normally not used in business application development and will not be discussed in this article. To handle these scripts, let us look discuss these scripts in brief:

Hiragana: Hiragana characters are syllables derived from a form of Kanji. This script was developed in the ninth century and is used in writing letters and Waka (traditional Japanese poetry). Though anything can be conveyed in Japanese by the use of Hiragana Script alone, in practice a mix of these scripts called kana-majiri (discussed below) is used. Hiragana is typically used for particles, verb endings, adjectives, pronouns and interjections
Katakana: Katakana characters are syllables made up of some part of the kanji, often the hen or tsukuri radicals. Though they are different set of characters, they represent the same syllables as hiragana and have the same sound. Formerly Kanji and katakana were used together in official publications, scientific and literary essays, but nowadays used wholly in children's books, telegrams etc. Katakana characters in turn could be either Hankaku or Zenkaku (discussed below). In the IT scenario Katakana Script is typically used to represent terms of Non-Japanese origin (which includes, but are not limited to, the whole world of computer terminology like database, commit and so on)
Kanji: Kanji, the main script of Japan, was borrowed from the Chinese highly, pictorial and extremely complex. There are almost 44,000 kanji characters, though the Japanese uses about 3000-4000 of them. Kanji characters are typically Ideographs and Pictographs. The main difference between kanji and kana is that whereas Kanji is used to represent ideas and words (where each kanji character is a word by itself), Kana characters represent symbols.
Romaji: The word Romaji is actually derived from the word "Roman" and is used to depict Roman (English) Alpha Numeric and other characters (for example, ABC123?><)
Kana-Majiri: Kana-Majiri is basically the concept of mixing Hiragana, Katakana, Kanji and other special characters( for eg. , note that they are Zenkaku characters and NOT Romaji(English )Alpha Numeric))

To add to this, the Japanese characters can be either Hankaku or Zenkaku:

Hankaku: Hankaku characters are Single byte characters (also called as Half-Width characters). All English alpha-numeric characters (e.g. ABC123) are treated as Hankaku characters and Katakana characters can also be Hankaku characters( e.g. )
Zenkaku: Zenkaku characters are Double-Byte Characters(also called Full-Width characters). Kanji (), Hiragana () and Katakana () characters can be Zenkaku characters.

Now you might ask, "How does all this affect me?" It does, because most of the time there is a necessity to have functions for validating whether a String entered by the web user on the screen is:

Hankaku
Zenkaku
Hiragana
Katakana
Kanji
Kana-Majiri
Romaji

There is also, typically, a requirement for functions, which will convert:

Hiragana to Katakana and vice versa
Hankaku To Zenkaku and vice versa

(Conversion from and to Kanji is not possible).

Size Checking

Though this appears to be a trivial functionality, it plays an important role in a web development. In the case of normal English data, we can check the size of the data by getting the length of the string itself. But as the Japanese data could contain Hankaku, Zenkaku. or a combination of these two, we need to check for the byte count of the data instead of the length. The byte count can also differ due to the encoding used to store the data (the Kanji word for Tokyo ("") will have a byte count of 6 in utf-8 encoding , but in the case of Shift_JIS/EUC-JP will have a byte count of 4). So we need to have a centralized function that will return the correct size of a string

Japanese Date

One more interesting aspect for Japanese based websites is that one would see the dates represented in a different way. For example, the date 2001/01/01 (Note that Dates are always written in YYYY/MM/DD format in Japanese applications) will be shown as . This is because, the Japanese Date System is based on the reigning emperor of Japan. Whenever a new emperor ascends to the throne, a new era starts and the dates are re-initialized from that day onwards. For example the current era is the Heisei era of Emperor Akihito.

Defining the Solution

Now that we have been through the problems, let's look into the solutions:

Encoding

First of all, we need to decide on which encoding to use when developing such an application. The various factors that will play a definite role in this decision process are:

Multi-Lingual Support: If the system handles other languages, in addition to Japanese, then utf-8 is the preferable choice. utf-8 encoding supports all languages and is the de-facto standard adopted even by Microsoft, which is clear from the fact that while creating an application in Visual Studio.NET all the files have the encoding set to utf-8 including the response and request encoding attributes in the web.config file.
Data Passing across Systems: If the System interacts with external Systems/Servers like a Queue Server System (IBM MQ Series/MSMQ) or UNIX Hosts, then it is recommended that we set the encoding of our Microsoft .NET Web Application same as the encoding of the other systems. It is easier to configure and handle the encoding at the Microsoft .NET Web Application level rather than fiddle around with the encoding of other systems
Data Size: The downside of using utf-8 encoding is that an utf-8 based character uses one byte for a non-Japanese character and three bytes for any non-English character. This could lead to unnecessary wastage of data space in a Database if the System is to handle only Japanese data

So the rule of thumb is that if the System is to have Multi-Lingual support, then utf-8 encoding should be used. If only Japanese data is to be handled, then we can go for Shift_JIS or EUC-JP encoding. Of these two, which one to use, would depend on external System interaction. If the system does not interact with any external system (whose Encoding is EUC-JP), then Shift_JIS is the recommended one, as Microsoft defaults its Operating Systems to this encoding only, and therefore uniformity is ensured. The setting for this is pretty simple and needs to be done in the following places:

Web tier: Settings for this tier can be changed in the web.config file. In the Globalization section of the web.config file make the following entry:

<!--  GLOBALIZATION
This section sets the globalization settings of the application. -->
<!-- Use this encoding if the a Shift_JIS encoding is needed(Japanese Data) -->
<globalization requestEncoding="Shift_JIS" responseEncoding="Shift_JIS" />

If you want to use different encoding, then change the value of the attributes requestEncoding and responseEncoding to correspond to the same. This will ensure that at the web tier level, all the characters are received and rendered as per the encoding specified

Business Tier: At the business facade and business rules level, you don't need any encoding setting as long as the system has Japanese language support (internally Japanese based Microsoft O/Ss use the MS932 encoding - which corresponds to Microsoft CP932 (Codepage 932), and in fact Shift_JIS is internally mapped to MS932 encoding only, and therefore they are one and the same).

Database Tier: At the Database level you need to specify the Database Collation Name for Japanese support. Here you can specify whether you want to have Case Sensitivity, Accent Sensitivity, Kana Sensitivity and Width Sensitivity. For more details on the exact values, you can check out the SQL Server Books Online, but a safe bet would be Japanese_CI_AS (which is Japanese, Case Insensitive and Accent sensitive). Other values for this would depend on the business/database requirements.

Multiple Script Handling and Size Checking

Handling various scripts involves validation and conversion of the data received. The key to solving this is understanding the fact that, all things said and done, a character is after all, a character, and has a value associated with it. Just like any other language, all characters of Japanese scripts also have Unicode values associated with it and in most of the cases(with the exclusion of Katakana Hankaku characters), they are contiguous. So in the validation routines we need to check for the Unicode value of each character in a string and based on that we can determine, to which script does the character belong. The Unicode range of the characters of the Japanese Scripts are:

Hiragana Characters : 12354 to 12435(81 characters)
Katakana Zenkaku Characters : 12450 to 12531(81 characters)
Katakana Hankaku Characters : Not contiguous
Kanji characters : 19968 to 64045(44,077 characters)

Based on these Unicode values we can find out whether a String is hiragana , katakana or Kanji. The catch here is that Hankaku Katakana characters do not have contiguous Unicode values. This problem can be solved by first converting the received string to Zenkaku Katakana and then check for the Unicode value. For example, in the Class library, it is handled as follows :

To check whether a String is hiragana or not:

To Check whether a String is katakana or not:

Here the Strict property being used is described below in the article. Similar checks can be done for the characters of the kanji script as well.

We can check whether the String is Hankaku or Zenkaku by checking the byte Size of each character:

(The Encoding Class is available in the System.Text Namespace)

If we had used a utf-8 encoding in the code block:

then the return value would have been 3, but please note that even if you have utf-8 as the encoding in the Web tier, you can still have the encoding as Shift_JIS in the validation function and the data is converted from utf-8 to Shift_JIS implicitly by the System and therefore it will return 2 (remember, it is easier to handle the encoding in MS systems.).

The LangHelper Class available for download with this article provides all these functionalities through the methods, isHankaku(), isZenkaku(), isHiragana(), isKatakana(), isKanji(), isKanaMajiri(), HanToZen(), ZenToHan(), HiraganaToKatakana(), KataKanaToHiragana() and GetSize().

Japanese Dates

The trick in converting English dates to Japanese dates is to understand the Japanese Eras. As mentioned above, the Japanese Eras are dependent on the ascension to the throne by a new emperor. So just like characters, the Eras also have a range. The range of Eras are :

Taisho ("") Era: 1912/07/30 - 1926/01/24
Showa ("") Era: 1926/12/25 - 1989/01/07
Heisei ("") Era: 1989/01/08 - Till Date

So, based on an English date, we determine the era that it belongs to and based on the era we can convert it to a Japanese Date. For example, the date 2002/12/25 (All Dates in YYYY/MM/DD format) would be (Note that the month ( 12) and Date ( 25) remain unchanged). Here is the kanji symbol for Heisei, is the kanji for the string literal 'Year', is the kanji for the string literal 'Month', is the kanji for the string literal 'Day'. How did we achieve this conversion?

Public Function getJapaneseDate(ByVal p_sDate as String, ByRef sErrorData as String)
Dim m_sTAISHOSTARTDAY As String = "30" ' The Start Day of the TaishoEra
Dim m_sTAISHOSTARTMONTH As String = "07" ' The Start Monthof the Taisho Era
Dim m_sTAISHOSTARTYEAR As String = "1912" ' The Start Year of the Taisho Era
Dim m_sTAISHOENDDAY As String = "24" ' The End Day of the Taisho Era
Dim m_sTAISHOENDMONTH As String = "12" ' The End Month of the Taisho Era
Dim m_sTAISHOENDYEAR As String = "1926" ' The End Year of the Taisho Era

Dim m_sSHOWASTARTDAY As String = "25" ' The Start Day of the Showa Era
Dim m_sSHOWASTARTMONTH As String = "12" ' The Start Month of the Showa Era
Dim m_sSHOWASTARTYEAR As String = "1926" ' The Start Year of the Showa Era

Dim m_sSHOWAENDDAY As String = "07" ' The End Day of the Showa Era
Dim m_sSHOWAENDMONTH As String = "01" ' The End Monthof the Showa Era
Dim m_sSHOWAENDYEAR As String = "1989" ' The End Year of the Showa Era

Dim m_sHEISEISTARTDAY As String = "08" ' The Start Day of the Heisei Era
Dim m_sHEISEISTARTMONTH As String = "01" ' The Start Month of the Heisei Era
Dim m_sHEISEISTARTYEAR As String = "1989" ' The Start Year of the Heisei Era

Dim m_sHEISEIENDDAY As String = "31" ' The End Day of the Heisei Era
Dim m_sHEISEIENDMONTH As String = "12" ' The End Month of the Heisei Era
Dim m_sHEISEIENDYEAR As String = "2099" ' The End Year of the Heisei Era



Dim sJapaneseDate As String

        If IsDate(p_sDate) = False Then
            p_sErrorData = "Invalid Date. Enter Data in 'YYYY/MM/DD' format"
            Return ""
        End If

        If Len(p_sDate) <> 10 Then
            p_sErrorData = "Enter Date in 'YYYY/MM/DD' format"
            Return ""
        End If

        If p_sDate.IndexOf("/") <> 4 And p_sDate.LastIndexOf("/") <> 7 Then
            p_sErrorData = "Enter Date in 'YYYY/MM/DD' format"
            Return ""
        End If

        If Not isHankaku(p_sDate) Then
            Dim bTmpStrict As Boolean
            bTmpStrict = Strict
            Strict = False
            p_sDate = ZenToHan(p_sDate, "")
            Strict = bTmpStrict
        End If

p_sDate = p_sDate.Replace("/", "")' Remove the "/"

Dim iJapaneseYear As Integer
If (p_sDate >= (m_sTAISHOSTARTYEAR & m_sTAISHOSTARTMONTH & m_sTAISHOSTARTDAY)) And 
(p_sDate <= (m_sTAISHOENDYEAR & m_sTAISHOENDMONTH & m_sTAISHOENDDAY)) Then
iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sTAISHOSTARTYEAR) + 1
sJapaneseDate = m_sTaishoKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol
ElseIf (p_sDate >= (m_sSHOWASTARTYEAR & m_sSHOWASTARTMONTH & m_sSHOWASTARTDAY)) And 
(p_sDate <= (m_sSHOWAENDYEAR & m_sSHOWAENDMONTH & m_sSHOWAENDDAY)) Then
iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sSHOWASTARTYEAR) + 1
sJapaneseDate = m_sShowaKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol
ElseIf (p_sDate >= (m_sHEISEISTARTYEAR & m_sHEISEISTARTMONTH & m_sHEISEISTARTDAY)) And 
(p_sDate <= (m_sHEISEIENDYEAR & m_sHEISEIENDMONTH & m_sHEISEIENDDAY)) Then
iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sHEISEISTARTYEAR) + 1
sJapaneseDate = m_sHeiseiKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol
Else
sErrorData = "Not a Valid Date or predates the Taisho Era"
Return ""
End If

sJapaneseDate = sJapaneseDate & Mid(p_sDate, 5, 2) & m_sJapaneseMonthSymbol & Mid
(p_sDate, 7, 2) & m_sJapaneseDaySymbol
return sJapaneseDate
end sub

This functionality is also available through the GetJapaneseDate() function of the LangHelper Class, which is provided for download with this article.

Component Design

All these functionalities are available in a single class LangHelper through its methods. As this is a helper/utility class only, it is a single-tier component. This should be preferably included inside the Utility Namespace, if you have one, in your Web Application from the design point of view.

Component Coding

The VB.NET based class, LangHelper exposes its functionalities through he following methods:

Validation Functions

IsHankaku(p_sDate As String) As Boolean
IsZenkaku(p_sDate As String) As Boolean
IsHiragana(p_sDate As String) As Boolean
IsKatakana(p_sDate As String) As Boolean
IsKanji(p_sDate As String) As Boolean
isKanaMajiri(p_sDate As String) As Boolean

Conversion Functions

HanToZen(p_sDate As String, ByRef p_sErrorData As String) as string
ZenToHan(p_sDate As String, ByRef p_sErrorData As String) as string
HiraganaToKatakana(p_sDate As String, ByRef p_sErrorData As String) as string
KatakanaToHiragana(p_sDate As String, ByRef p_sErrorData As String) as string

Misc. Functions

GetSize(p_sData as string) as Integer
GetStringCharacterestics(p_sData as String) as string
GetJapaneseDate(p_sDate As String, ByRef p_sErrorData As String) as string

There is also a Strict property in this Class, whose functionality is interesting. This property is typically used in the conversion functions. When I pass a pure Zenkaku String (all characters Zenkaku), to the ZenToHan function, it will return a Hankaku equivalent of it ( for example, ZenToHan("") will return (""). If I pass a non-Zenkaku String, it will return an error in the Second Parameter which is passed ByRef. But what if I pass a mixed String? For example what if I say ZenToHan(""). Here, one part of the String
"" is Zenkaku Katakana and is convertible, but the other part of the String "" is Kanji, and therefore nonconvertible to Hankaku. So what will be the result? This is determined by the value in the Strict Property. If this property is set to True, the function returns "" and in the ByRef Parameter, p_sErrorData, you get the error description. If the Strict Property is False, then all convertible characters will be converted and non-convertible characters will be left as they are. In the above mentioned case, if the Strict property is false, the ZenToHan function will return "" (Note that the convertible part "" has been converted to "" and the non-convertible part "" has been left intact. So:

The same functionality applies to all conversion functions.

The Client Code can call these functions as follows:

The Sample Application

The download, in addition to the Class Library, contains two Client applications, which demonstrate the functionalities exposed by the LangHelper class. The first one is a Windows Forms based client, LangHelperWinClient, and the second is a web based aspx form, LangHelperAspxClient. These two differ only from the presentation point of view and do not handle any business logic. The LangHelper Class itself provides all the above-mentioned conversion/validation/misc. functionalities and these two presentation tier components just make function calls to the instance of LangHelper Class.

LangHelperWinClient

LangHelperAspxClient

Application Setup

The Download contains a VisualStudio.NET Solution ( LangHelperLibrarySolution) with three projects, LangHelperLibrary, LangHelperAspxClient and LangHelperWinClient.

LangHelperLibrary Project: This is the core of the Application and contains the class LangHelper.vb. This class has the functions described in this article
LangHelperAspxClient Project: This is the Sample Client Application in a WebForms (aspx) mode. The LangHelperAspxClientScreen.aspx file is the startup page in this project. The files can be used as such and the only file that needs modification (that too, only if you want to test with different encoding) is the web.config file.
LanghelperWinClient Project: This is a Sample Client Aplication in a Windows Forms mode. It mimics the presentation tier of LangHelperAspxClient Project, but in a Desktop Application mode. The startup form for this project is LangHelperWinClientScreen and needs no modification

Installation Instructions:

Do remember that the Web Server and the Client browser machines need Japanese Language support for this application to work.

Download the Support Material Zip File to a temporary location
Extract the files from the Zip File and this will result in two directories, LangHelper Solution and LangHelperAspxClient.
Move/Copy the LangHelperAspxClient directory to the WWW Root directory ( ...\Inetpub\wwwroot\).
Open the ...\LangHelperSolution\Solutions\LangHelperSolution\ LangHelperLibrarySolution.sln in Visual Studio.NET
This will open up the three projects. The default startup project is LangHelperWinClient and you can start testing the Class Library in this mode. If you want the Web based mode, then change the startup project to LangHelperAspxClient project.

Japanese Language Support:

To run the sample application, typing Japanese characters, and for developing Japanese language based web systems, it is essential to have a language support at the Operating System(OS) Level. This can be done either by:

Installing the Japanese version of the Operating System (Windows 2000 Japanese version OS)
Installing an English based Windows 2000 OS and then adding the Japanese language support. In Windows NT, installing the language support is quite simple and can be done through

Start->Settings->Control Panel->Regional Options->Your Locale (Location )->Select the Japanese Checkbox->OK.

Please note that all this is needed, only for you to go through the application and see the capabilities. The core part of the application is the LangHelper Class and later at development time, you just need to copy the LangHelper file to your own project and use the functions. The other two projects ( LangHelperAspxCLient and LangHelperWinClient) are given for demonstrating the capabilities of the Class Library and deal only with the presentation tier (UI tier).

Any Limitations or Futher Work

As such, there is no limitation in the Class Library provided. All the functions take a String as a parameter and this enables you to pass single characters also, by storing them in a single variable. If you want to customize the functionality, it should be easy to do, as each function is atomic and modular (for example, to write a customized function which will accept only Hiragana and Kanji characters, you can write a wrapper function, which in-turn picks up each character of the string passed to it, calls isHiragana and isKanji functions)

' Customised Function to extend the Validation functionality to check
 ' whether a String is either Hiragana (OR) Kanji
Public function isHiraganaOrKanji(p_sData as string) as boolean
     Dim iCtr as integer
          Dim objLangHelper as new LangHelper
          For iCtr=1 to len(p_sData)
             If objLangHelper.isHiragana(mid(p_sData,iCtr,1))==false and 
             objLangHelper.isKanji(mid(p_sData,iCtr,1))==false then
                  Return false
             End if
           Next
           Return true
      End Sub

Conclusion

So that should solve the common problems that typically happen in implementing a web based solution for Japanese customers. These functionalities are the ones that are very common in Japanese environment and can be resolved by using a single point validation and conversion routines provided by the class library. Also once the encoding issue is resolved at the System Architecture stage itself, there should be no surprises at development/deployment time.

Please rate this article using the form below. By telling us what you like and dislike about it we can tailor our content to meet your needs.

Article Information

Author M Sitarman

Chief Technical Editor John R. Chapman

Project Manager Helen Cuthill

Reviewers Sean Schade, Saurahb Nandu

Article Information
Author	M Sitarman
Chief Technical Editor	John R. Chapman
Project Manager	Helen Cuthill
Reviewers	Sean Schade, Saurahb Nandu

If you have any questions or comments about this article, please contact the technical editor.

RATE THIS ARTICLE

Please rate this article (1-5). Was this article...


Useful?	No	Yes, Very

Innovative?	No	Yes, Very

Informative?	No	Yes, Very

Brief Reader Comments?
Your Name: (Optional)

USEFUL LINKS
Related Tasks:

Download the support material for this Article
Enter Technical Discussion on this Article
Technical Support on this article - support@asptoday.com
See other articles in the .NET Framework category
See other articles in the Site Design category
See other In-Depth articles
Reader Comments on this article
Go to Previous Article