Programmer to ProgrammerTM  
Wrox Press Ltd  
   
  Search ASPToday Living Book ASPToday Living Book
Index Full Text
  cyscape.com

ASPToday Home
 
 
Home HOME
Site Map SITE MAP
Index INDEX
Full-text search SEARCH
Forum FORUM
Feedback FEEDBACK
Advertise with us ADVERTISE
Subscribe SUBSCRIBE
Bullet LOG OFF
                         
      The ASPToday Article
July 2, 2002
      Previous article -
July 1, 2002
   
 
   
   
   
Handling Japanese Data on the Web   M Sitaraman  
by M Sitaraman
 
CATEGORIES:  .NET Framework, Site Design  
ARTICLE TYPE: In-Depth Reader Comments
   
    ABSTRACT  
 

There are many issues involved in handling Japanese Data in a Web Environment including encoding, multiple script handling, size checking and displaying dates. This article will suggest ways of tackling day-to-day problems while dealing with data on the web and illustrate these solutions in 2 .NET applications.





Requirements:



Japanese OS/Language Support

Knowledge of ASP.NET and VB.NET




   
                   
    Article Discussion   Rate this article   Related Links   Index Entries  
   
 
    ARTICLE

Introduction

Handling Japanese data on the Web can be a very challenging task, given the complexity of the language. The issues that arise in such an exercise range from

These problems can be tackled by using a common standard based on which the encoding will be implemented across the various tiers of a system and by providing a Class Library, which provides functions to handle the afore-mentioned problems.

System Requirements

The system knowledge requirements to implement such a Japanese based solution are:

Defining the Problem

Let's now discuss the various problems outlined above.

Encoding

Encoding is an important factor to be concentrated on when developing a non-English based Web Application. Based on the Encoding, the web server will process the data received in the request object and also will render the screen in the response object. Various Encoding are possible while developing a Japanese Encoding based Web Application, namely Shift_JIS/SJIS, EUC-JP, utf-8 (using any of these encodings, you can input Japanese data in the different scripts, mentioned below). The impact of the encoding can be seen if we enter a data on the screen, for example in Shift_JIS, change the encoding to utf-8 in the web.config file (without stopping the web application) and refresh the screen. We will observe that all the characters are junked, in spite of the fact that the new encoding also supports these characters. This is because, the storage of the characters is totally dependant on the encoding used and might not be compatible with other encodings. Conversion of one encoding to another has to be done explicitly, if necessary, through programmatic means only, failing which the characters will be junked out.

Different Scripts

The Japanese Language, unlike English, uses multiple scripts. There are basically four scripts namely, Hiragana, Katakana ( Kana is the name given to the characters of these two scripts), Kanji and Romaji. So the same word can be written in any of the scripts. For example, the word "Tokyo" is written as in Hiragana, or in Katakana and in Kanji. To complicate things more, the characters in these scripts can be either Hankaku (Single Byte/Half-width character) in the case of Katakana or Zenkaku (Double-Byte/Full-Width character) in the case of Hiragana, Katakana or Kanji. Romaji is normally not used in business application development and will not be discussed in this article. To handle these scripts, let us look discuss these scripts in brief:

To add to this, the Japanese characters can be either Hankaku or Zenkaku:

Now you might ask, "How does all this affect me?" It does, because most of the time there is a necessity to have functions for validating whether a String entered by the web user on the screen is:

There is also, typically, a requirement for functions, which will convert:

(Conversion from and to Kanji is not possible).

Size Checking

Though this appears to be a trivial functionality, it plays an important role in a web development. In the case of normal English data, we can check the size of the data by getting the length of the string itself. But as the Japanese data could contain Hankaku, Zenkaku. or a combination of these two, we need to check for the byte count of the data instead of the length. The byte count can also differ due to the encoding used to store the data (the Kanji word for Tokyo ("") will have a byte count of 6 in utf-8 encoding , but in the case of Shift_JIS/EUC-JP will have a byte count of 4). So we need to have a centralized function that will return the correct size of a string

Japanese Date

One more interesting aspect for Japanese based websites is that one would see the dates represented in a different way. For example, the date 2001/01/01 (Note that Dates are always written in YYYY/MM/DD format in Japanese applications) will be shown as . This is because, the Japanese Date System is based on the reigning emperor of Japan. Whenever a new emperor ascends to the throne, a new era starts and the dates are re-initialized from that day onwards. For example the current era is the Heisei era of Emperor Akihito.

Defining the Solution

Now that we have been through the problems, let's look into the solutions:

Encoding

First of all, we need to decide on which encoding to use when developing such an application. The various factors that will play a definite role in this decision process are:

So the rule of thumb is that if the System is to have Multi-Lingual support, then utf-8 encoding should be used. If only Japanese data is to be handled, then we can go for Shift_JIS or EUC-JP encoding. Of these two, which one to use, would depend on external System interaction. If the system does not interact with any external system (whose Encoding is EUC-JP), then Shift_JIS is the recommended one, as Microsoft defaults its Operating Systems to this encoding only, and therefore uniformity is ensured. The setting for this is pretty simple and needs to be done in the following places:

<!--  GLOBALIZATION
This section sets the globalization settings of the application. -->
<!-- Use this encoding if the a Shift_JIS encoding is needed(Japanese Data) -->
<globalization requestEncoding="Shift_JIS" responseEncoding="Shift_JIS" />

If you want to use different encoding, then change the value of the attributes requestEncoding and responseEncoding to correspond to the same. This will ensure that at the web tier level, all the characters are received and rendered as per the encoding specified

Multiple Script Handling and Size Checking

Handling various scripts involves validation and conversion of the data received. The key to solving this is understanding the fact that, all things said and done, a character is after all, a character, and has a value associated with it. Just like any other language, all characters of Japanese scripts also have Unicode values associated with it and in most of the cases(with the exclusion of Katakana Hankaku characters), they are contiguous. So in the validation routines we need to check for the Unicode value of each character in a string and based on that we can determine, to which script does the character belong. The Unicode range of the characters of the Japanese Scripts are:

Based on these Unicode values we can find out whether a String is hiragana , katakana or Kanji. The catch here is that Hankaku Katakana characters do not have contiguous Unicode values. This problem can be solved by first converting the received string to Zenkaku Katakana and then check for the Unicode value. For example, in the Class library, it is handled as follows :

To check whether a String is hiragana or not:

To Check whether a String is katakana or not:

Here the Strict property being used is described below in the article. Similar checks can be done for the characters of the kanji script as well.

We can check whether the String is Hankaku or Zenkaku by checking the byte Size of each character:

(The Encoding Class is available in the System.Text Namespace)

If we had used a utf-8 encoding in the code block:

then the return value would have been 3, but please note that even if you have utf-8 as the encoding in the Web tier, you can still have the encoding as Shift_JIS in the validation function and the data is converted from utf-8 to Shift_JIS implicitly by the System and therefore it will return 2 (remember, it is easier to handle the encoding in MS systems.).

The LangHelper Class available for download with this article provides all these functionalities through the methods, isHankaku(), isZenkaku(), isHiragana(), isKatakana(), isKanji(), isKanaMajiri(), HanToZen(), ZenToHan(), HiraganaToKatakana(), KataKanaToHiragana() and GetSize().

Japanese Dates

The trick in converting English dates to Japanese dates is to understand the Japanese Eras. As mentioned above, the Japanese Eras are dependent on the ascension to the throne by a new emperor. So just like characters, the Eras also have a range. The range of Eras are :

So, based on an English date, we determine the era that it belongs to and based on the era we can convert it to a Japanese Date. For example, the date 2002/12/25 (All Dates in YYYY/MM/DD format) would be (Note that the month ( 12) and Date ( 25) remain unchanged). Here is the kanji symbol for Heisei, is the kanji for the string literal 'Year', is the kanji for the string literal 'Month', is the kanji for the string literal 'Day'. How did we achieve this conversion?

Public Function getJapaneseDate(ByVal p_sDate as String, ByRef sErrorData as String)
Dim m_sTAISHOSTARTDAY As String = "30" ' The Start Day of the TaishoEra
Dim m_sTAISHOSTARTMONTH As String = "07" ' The Start Monthof the Taisho Era
Dim m_sTAISHOSTARTYEAR As String = "1912" ' The Start Year of the Taisho Era
Dim m_sTAISHOENDDAY As String = "24" ' The End Day of the Taisho Era
Dim m_sTAISHOENDMONTH As String = "12" ' The End Month of the Taisho Era
Dim m_sTAISHOENDYEAR As String = "1926" ' The End Year of the Taisho Era

Dim m_sSHOWASTARTDAY As String = "25" ' The Start Day of the Showa Era
Dim m_sSHOWASTARTMONTH As String = "12" ' The Start Month of the Showa Era
Dim m_sSHOWASTARTYEAR As String = "1926" ' The Start Year of the Showa Era

Dim m_sSHOWAENDDAY As String = "07" ' The End Day of the Showa Era
Dim m_sSHOWAENDMONTH As String = "01" ' The End Monthof the Showa Era
Dim m_sSHOWAENDYEAR As String = "1989" ' The End Year of the Showa Era

Dim m_sHEISEISTARTDAY As String = "08" ' The Start Day of the Heisei Era
Dim m_sHEISEISTARTMONTH As String = "01" ' The Start Month of the Heisei Era
Dim m_sHEISEISTARTYEAR As String = "1989" ' The Start Year of the Heisei Era

Dim m_sHEISEIENDDAY As String = "31" ' The End Day of the Heisei Era
Dim m_sHEISEIENDMONTH As String = "12" ' The End Month of the Heisei Era
Dim m_sHEISEIENDYEAR As String = "2099" ' The End Year of the Heisei Era



Dim sJapaneseDate As String

        If IsDate(p_sDate) = False Then
            p_sErrorData = "Invalid Date. Enter Data in 'YYYY/MM/DD' format"
            Return ""
        End If

        If Len(p_sDate) <> 10 Then
            p_sErrorData = "Enter Date in 'YYYY/MM/DD' format"
            Return ""
        End If

        If p_sDate.IndexOf("/") <> 4 And p_sDate.LastIndexOf("/") <> 7 Then
            p_sErrorData = "Enter Date in 'YYYY/MM/DD' format"
            Return ""
        End If

        If Not isHankaku(p_sDate) Then
            Dim bTmpStrict As Boolean
            bTmpStrict = Strict
            Strict = False
            p_sDate = ZenToHan(p_sDate, "")
            Strict = bTmpStrict
        End If

p_sDate = p_sDate.Replace("/", "")' Remove the "/"

Dim iJapaneseYear As Integer
If (p_sDate >= (m_sTAISHOSTARTYEAR & m_sTAISHOSTARTMONTH & m_sTAISHOSTARTDAY)) And 
(p_sDate <= (m_sTAISHOENDYEAR & m_sTAISHOENDMONTH & m_sTAISHOENDDAY)) Then
iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sTAISHOSTARTYEAR) + 1
sJapaneseDate = m_sTaishoKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol
ElseIf (p_sDate >= (m_sSHOWASTARTYEAR & m_sSHOWASTARTMONTH & m_sSHOWASTARTDAY)) And 
(p_sDate <= (m_sSHOWAENDYEAR & m_sSHOWAENDMONTH & m_sSHOWAENDDAY)) Then
iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sSHOWASTARTYEAR) + 1
sJapaneseDate = m_sShowaKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol
ElseIf (p_sDate >= (m_sHEISEISTARTYEAR & m_sHEISEISTARTMONTH & m_sHEISEISTARTDAY)) And 
(p_sDate <= (m_sHEISEIENDYEAR & m_sHEISEIENDMONTH & m_sHEISEIENDDAY)) Then
iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sHEISEISTARTYEAR) + 1
sJapaneseDate = m_sHeiseiKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol
Else
sErrorData = "Not a Valid Date or predates the Taisho Era"
Return ""
End If

sJapaneseDate = sJapaneseDate & Mid(p_sDate, 5, 2) & m_sJapaneseMonthSymbol & Mid
(p_sDate, 7, 2) & m_sJapaneseDaySymbol
return sJapaneseDate
end sub

This functionality is also available through the GetJapaneseDate() function of the LangHelper Class, which is provided for download with this article.

Component Design

All these functionalities are available in a single class LangHelper through its methods. As this is a helper/utility class only, it is a single-tier component. This should be preferably included inside the Utility Namespace, if you have one, in your Web Application from the design point of view.

Component Coding

The VB.NET based class, LangHelper exposes its functionalities through he following methods:

Validation Functions

Conversion Functions

Misc. Functions

There is also a Strict property in this Class, whose functionality is interesting. This property is typically used in the conversion functions. When I pass a pure Zenkaku String (all characters Zenkaku), to the ZenToHan function, it will return a Hankaku equivalent of it ( for example, ZenToHan("") will return (""). If I pass a non-Zenkaku String, it will return an error in the Second Parameter which is passed ByRef. But what if I pass a mixed String? For example what if I say ZenToHan(""). Here, one part of the String
"" is Zenkaku Katakana and is convertible, but the other part of the String "" is Kanji, and therefore nonconvertible to Hankaku. So what will be the result? This is determined by the value in the Strict Property. If this property is set to True, the function returns "" and in the ByRef Parameter, p_sErrorData, you get the error description. If the Strict Property is False, then all convertible characters will be converted and non-convertible characters will be left as they are. In the above mentioned case, if the Strict property is false, the ZenToHan function will return "" (Note that the convertible part "" has been converted to "" and the non-convertible part "" has been left intact. So:

The same functionality applies to all conversion functions.

The Client Code can call these functions as follows:

The Sample Application

The download, in addition to the Class Library, contains two Client applications, which demonstrate the functionalities exposed by the LangHelper class. The first one is a Windows Forms based client, LangHelperWinClient, and the second is a web based aspx form, LangHelperAspxClient. These two differ only from the presentation point of view and do not handle any business logic. The LangHelper Class itself provides all the above-mentioned conversion/validation/misc. functionalities and these two presentation tier components just make function calls to the instance of LangHelper Class.

LangHelperWinClient

LangHelperAspxClient

Application Setup

The Download contains a VisualStudio.NET Solution ( LangHelperLibrarySolution) with three projects, LangHelperLibrary, LangHelperAspxClient and LangHelperWinClient.

Installation Instructions:

Do remember that the Web Server and the Client browser machines need Japanese Language support for this application to work.

Japanese Language Support:

To run the sample application, typing Japanese characters, and for developing Japanese language based web systems, it is essential to have a language support at the Operating System(OS) Level. This can be done either by:

Start->Settings->Control Panel->Regional Options->Your Locale (Location )->Select the Japanese Checkbox->OK.

Please note that all this is needed, only for you to go through the application and see the capabilities. The core part of the application is the LangHelper Class and later at development time, you just need to copy the LangHelper file to your own project and use the functions. The other two projects ( LangHelperAspxCLient and LangHelperWinClient) are given for demonstrating the capabilities of the Class Library and deal only with the presentation tier (UI tier).

Any Limitations or Futher Work

As such, there is no limitation in the Class Library provided. All the functions take a String as a parameter and this enables you to pass single characters also, by storing them in a single variable. If you want to customize the functionality, it should be easy to do, as each function is atomic and modular (for example, to write a customized function which will accept only Hiragana and Kanji characters, you can write a wrapper function, which in-turn picks up each character of the string passed to it, calls isHiragana and isKanji functions)

' Customised Function to extend the Validation functionality to check
 ' whether a String is either Hiragana (OR) Kanji
Public function isHiraganaOrKanji(p_sData as string) as boolean
     Dim iCtr as integer
          Dim objLangHelper as new LangHelper
          For iCtr=1 to len(p_sData)
             If objLangHelper.isHiragana(mid(p_sData,iCtr,1))==false and 
             objLangHelper.isKanji(mid(p_sData,iCtr,1))==false then
                  Return false
             End if
           Next
           Return true
      End Sub

Conclusion

So that should solve the common problems that typically happen in implementing a web based solution for Japanese customers. These functionalities are the ones that are very common in Japanese environment and can be resolved by using a single point validation and conversion routines provided by the class library. Also once the encoding issue is resolved at the System Architecture stage itself, there should be no surprises at development/deployment time.

Please rate this article using the form below. By telling us what you like and dislike about it we can tailor our content to meet your needs.

Article Information
Author M Sitarman
Chief Technical Editor John R. Chapman
Project Manager Helen Cuthill
Reviewers Sean Schade, Saurahb Nandu

If you have any questions or comments about this article, please contact the technical editor.

 
 
   
  RATE THIS ARTICLE
  Please rate this article (1-5). Was this article...
 
 
Useful? No Yes, Very
 
Innovative? No Yes, Very
 
Informative? No Yes, Very
 
Brief Reader Comments?
Your Name:
(Optional)
 
  USEFUL LINKS
  Related Tasks:
 
 
   
  Related ASPToday Articles
   
  • Internationalisation and Personalisation of Web Content using Unicode (May 8, 2000)
  • Language Localization for Enterprise Web Applications (March 30, 2000)
  •  
           
     
     
      Related Sources
     
  • http://www.japanese-online.com/language/ Japanese Language Resource & Community: http://www.japanese-online.com/language/%20Japanese%20Language%20Resource%20&%20Community
  •  
     
           
      Search the ASPToday Living Book   ASPToday Living Book
     
      Index Full Text Advanced 
     
     
           
      Index Entries in this Article
     
  • byte count, checking
  •  
  • characters for Japanes
  •  
  • client application
  •  
  • date formats
  •  
  • Double Byte characters
  •  
  • encoding
  •  
  • Encoding class
  •  
  • GetByteCount method
  •  
  • GetEncoding method
  •  
  • Hankaku
  •  
  • Hiragana
  •  
  • Internationalization
  •  
  • Japanese language data
  •  
  • Kana-Majiri
  •  
  • Kanji
  •  
  • Katakana
  •  
  • LangHelper class
  •  
  • languages
  •  
  • Localization
  •  
  • Romaji
  •  
  • scripts for Japanese
  •  
  • Single Byte characters
  •  
  • Strict property
  •  
  • System.Text namespace
  •  
  • web applications
  •  
  • Windows applications
  •  
  • Zenkaku
  •  
     
     
    HOME | SITE MAP | INDEX | SEARCH | REFERENCE | FEEDBACK | ADVERTISE | SUBSCRIBE
    .NET Framework Components Data Access DNA 2000 E-commerce Performance
    Security Admin Site Design Scripting XML/Data Transfer Other Technologies

     
    ASPToday is brought to you by Wrox Press (http://www.asptoday.com/OffSiteRedirect.asp?Advertiser=www.wrox.com/&WROXEMPTOKEN=650479ZIga46k1OWJ3YsrDhonv). Please see our terms and conditions and privacy policy.
    ASPToday is optimised for Microsoft Internet Explorer 5 browsers.
    Please report any website problems to webmaster@asptoday.com. Copyright © 2002 Wrox Press. All Rights Reserved.