Programmer to ProgrammerTM | |||||
| |
|
|
|
|
|
|
|
|
|
| |||||||||||||||||||
The ASPToday
Article July 2, 2002 |
Previous
article - July 1, 2002 |
||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||
ABSTRACT |
| ||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||
Article Discussion | Rate this article | Related Links | Index Entries | ||||||||
ARTICLE |
Handling Japanese data on the Web can be a very challenging task, given the complexity of the language. The issues that arise in such an exercise range from
These problems can be tackled by using a common standard based on which the encoding will be implemented across the various tiers of a system and by providing a Class Library, which provides functions to handle the afore-mentioned problems.
The system knowledge requirements to implement such a Japanese based solution are:
Let's now discuss the various problems outlined above.
Encoding is an important factor to be concentrated on when developing a non-English based Web Application. Based on the Encoding, the web server will process the data received in the request object and also will render the screen in the response object. Various Encoding are possible while developing a Japanese Encoding based Web Application, namely Shift_JIS/SJIS, EUC-JP, utf-8 (using any of these encodings, you can input Japanese data in the different scripts, mentioned below). The impact of the encoding can be seen if we enter a data on the screen, for example in Shift_JIS, change the encoding to utf-8 in the web.config file (without stopping the web application) and refresh the screen. We will observe that all the characters are junked, in spite of the fact that the new encoding also supports these characters. This is because, the storage of the characters is totally dependant on the encoding used and might not be compatible with other encodings. Conversion of one encoding to another has to be done explicitly, if necessary, through programmatic means only, failing which the characters will be junked out.
The Japanese Language, unlike English, uses multiple scripts. There are basically four scripts namely, Hiragana, Katakana ( Kana is the name given to the characters of these two scripts), Kanji and Romaji. So the same word can be written in any of the scripts. For example, the word "Tokyo" is written as in Hiragana, or in Katakana and in Kanji. To complicate things more, the characters in these scripts can be either Hankaku (Single Byte/Half-width character) in the case of Katakana or Zenkaku (Double-Byte/Full-Width character) in the case of Hiragana, Katakana or Kanji. Romaji is normally not used in business application development and will not be discussed in this article. To handle these scripts, let us look discuss these scripts in brief:
To add to this, the Japanese characters can be either Hankaku or Zenkaku:
Now you might ask, "How does all this affect me?" It does, because most of the time there is a necessity to have functions for validating whether a String entered by the web user on the screen is:
There is also, typically, a requirement for functions, which will convert:
(Conversion from and to Kanji is not possible).
Though this appears to be a trivial functionality, it plays an important role in a web development. In the case of normal English data, we can check the size of the data by getting the length of the string itself. But as the Japanese data could contain Hankaku, Zenkaku. or a combination of these two, we need to check for the byte count of the data instead of the length. The byte count can also differ due to the encoding used to store the data (the Kanji word for Tokyo ("") will have a byte count of 6 in utf-8 encoding , but in the case of Shift_JIS/EUC-JP will have a byte count of 4). So we need to have a centralized function that will return the correct size of a string
One more interesting aspect for Japanese based websites is that one would see the dates represented in a different way. For example, the date 2001/01/01 (Note that Dates are always written in YYYY/MM/DD format in Japanese applications) will be shown as . This is because, the Japanese Date System is based on the reigning emperor of Japan. Whenever a new emperor ascends to the throne, a new era starts and the dates are re-initialized from that day onwards. For example the current era is the Heisei era of Emperor Akihito.
Now that we have been through the problems, let's look into the solutions:
First of all, we need to decide on which encoding to use when developing such an application. The various factors that will play a definite role in this decision process are:
So the rule of thumb is that if the System is to have Multi-Lingual support, then utf-8 encoding should be used. If only Japanese data is to be handled, then we can go for Shift_JIS or EUC-JP encoding. Of these two, which one to use, would depend on external System interaction. If the system does not interact with any external system (whose Encoding is EUC-JP), then Shift_JIS is the recommended one, as Microsoft defaults its Operating Systems to this encoding only, and therefore uniformity is ensured. The setting for this is pretty simple and needs to be done in the following places:
<!-- GLOBALIZATION This section sets the globalization settings of the application. --> <!-- Use this encoding if the a Shift_JIS encoding is needed(Japanese Data) --> <globalization requestEncoding="Shift_JIS" responseEncoding="Shift_JIS" />
If you want to use different encoding, then change the value of the attributes requestEncoding and responseEncoding to correspond to the same. This will ensure that at the web tier level, all the characters are received and rendered as per the encoding specified
Handling various scripts involves validation and conversion of the data received. The key to solving this is understanding the fact that, all things said and done, a character is after all, a character, and has a value associated with it. Just like any other language, all characters of Japanese scripts also have Unicode values associated with it and in most of the cases(with the exclusion of Katakana Hankaku characters), they are contiguous. So in the validation routines we need to check for the Unicode value of each character in a string and based on that we can determine, to which script does the character belong. The Unicode range of the characters of the Japanese Scripts are:
Based on these Unicode values we can find out whether a String is hiragana , katakana or Kanji. The catch here is that Hankaku Katakana characters do not have contiguous Unicode values. This problem can be solved by first converting the received string to Zenkaku Katakana and then check for the Unicode value. For example, in the Class library, it is handled as follows :
To check whether a String is hiragana or not:
To Check whether a String is katakana or not:
Here the Strict property being used is described below in the article. Similar checks can be done for the characters of the kanji script as well.
We can check whether the String is Hankaku or Zenkaku by checking the byte Size of each character:
(The Encoding Class is available in the System.Text Namespace)
If we had used a utf-8 encoding in the code block:
then the return value would have been 3, but please note that even if you have utf-8 as the encoding in the Web tier, you can still have the encoding as Shift_JIS in the validation function and the data is converted from utf-8 to Shift_JIS implicitly by the System and therefore it will return 2 (remember, it is easier to handle the encoding in MS systems.).
The LangHelper Class available for download with this article provides all these functionalities through the methods, isHankaku(), isZenkaku(), isHiragana(), isKatakana(), isKanji(), isKanaMajiri(), HanToZen(), ZenToHan(), HiraganaToKatakana(), KataKanaToHiragana() and GetSize().
The trick in converting English dates to Japanese dates is to understand the Japanese Eras. As mentioned above, the Japanese Eras are dependent on the ascension to the throne by a new emperor. So just like characters, the Eras also have a range. The range of Eras are :
So, based on an English date, we determine the era that it belongs to and based on the era we can convert it to a Japanese Date. For example, the date 2002/12/25 (All Dates in YYYY/MM/DD format) would be (Note that the month ( 12) and Date ( 25) remain unchanged). Here is the kanji symbol for Heisei, is the kanji for the string literal 'Year', is the kanji for the string literal 'Month', is the kanji for the string literal 'Day'. How did we achieve this conversion?
Public Function getJapaneseDate(ByVal p_sDate as String, ByRef sErrorData as String) Dim m_sTAISHOSTARTDAY As String = "30" ' The Start Day of the TaishoEra Dim m_sTAISHOSTARTMONTH As String = "07" ' The Start Monthof the Taisho Era Dim m_sTAISHOSTARTYEAR As String = "1912" ' The Start Year of the Taisho Era Dim m_sTAISHOENDDAY As String = "24" ' The End Day of the Taisho Era Dim m_sTAISHOENDMONTH As String = "12" ' The End Month of the Taisho Era Dim m_sTAISHOENDYEAR As String = "1926" ' The End Year of the Taisho Era Dim m_sSHOWASTARTDAY As String = "25" ' The Start Day of the Showa Era Dim m_sSHOWASTARTMONTH As String = "12" ' The Start Month of the Showa Era Dim m_sSHOWASTARTYEAR As String = "1926" ' The Start Year of the Showa Era Dim m_sSHOWAENDDAY As String = "07" ' The End Day of the Showa Era Dim m_sSHOWAENDMONTH As String = "01" ' The End Monthof the Showa Era Dim m_sSHOWAENDYEAR As String = "1989" ' The End Year of the Showa Era Dim m_sHEISEISTARTDAY As String = "08" ' The Start Day of the Heisei Era Dim m_sHEISEISTARTMONTH As String = "01" ' The Start Month of the Heisei Era Dim m_sHEISEISTARTYEAR As String = "1989" ' The Start Year of the Heisei Era Dim m_sHEISEIENDDAY As String = "31" ' The End Day of the Heisei Era Dim m_sHEISEIENDMONTH As String = "12" ' The End Month of the Heisei Era Dim m_sHEISEIENDYEAR As String = "2099" ' The End Year of the Heisei Era Dim sJapaneseDate As String If IsDate(p_sDate) = False Then p_sErrorData = "Invalid Date. Enter Data in 'YYYY/MM/DD' format" Return "" End If If Len(p_sDate) <> 10 Then p_sErrorData = "Enter Date in 'YYYY/MM/DD' format" Return "" End If If p_sDate.IndexOf("/") <> 4 And p_sDate.LastIndexOf("/") <> 7 Then p_sErrorData = "Enter Date in 'YYYY/MM/DD' format" Return "" End If If Not isHankaku(p_sDate) Then Dim bTmpStrict As Boolean bTmpStrict = Strict Strict = False p_sDate = ZenToHan(p_sDate, "") Strict = bTmpStrict End If p_sDate = p_sDate.Replace("/", "")' Remove the "/" Dim iJapaneseYear As Integer If (p_sDate >= (m_sTAISHOSTARTYEAR & m_sTAISHOSTARTMONTH & m_sTAISHOSTARTDAY)) And (p_sDate <= (m_sTAISHOENDYEAR & m_sTAISHOENDMONTH & m_sTAISHOENDDAY)) Then iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sTAISHOSTARTYEAR) + 1 sJapaneseDate = m_sTaishoKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol ElseIf (p_sDate >= (m_sSHOWASTARTYEAR & m_sSHOWASTARTMONTH & m_sSHOWASTARTDAY)) And (p_sDate <= (m_sSHOWAENDYEAR & m_sSHOWAENDMONTH & m_sSHOWAENDDAY)) Then iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sSHOWASTARTYEAR) + 1 sJapaneseDate = m_sShowaKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol ElseIf (p_sDate >= (m_sHEISEISTARTYEAR & m_sHEISEISTARTMONTH & m_sHEISEISTARTDAY)) And (p_sDate <= (m_sHEISEIENDYEAR & m_sHEISEIENDMONTH & m_sHEISEIENDDAY)) Then iJapaneseYear = CInt(Mid(p_sDate, 1, 4)) - CInt(m_sHEISEISTARTYEAR) + 1 sJapaneseDate = m_sHeiseiKanjiSymbol & iJapaneseYear & m_sJapaneseYearSymbol Else sErrorData = "Not a Valid Date or predates the Taisho Era" Return "" End If sJapaneseDate = sJapaneseDate & Mid(p_sDate, 5, 2) & m_sJapaneseMonthSymbol & Mid (p_sDate, 7, 2) & m_sJapaneseDaySymbol return sJapaneseDate end sub
This functionality is also available through the GetJapaneseDate() function of the LangHelper Class, which is provided for download with this article.
All these functionalities are available in a single class LangHelper through its methods. As this is a helper/utility class only, it is a single-tier component. This should be preferably included inside the Utility Namespace, if you have one, in your Web Application from the design point of view.
The VB.NET based class, LangHelper exposes its functionalities through he following methods:
There is also a Strict property in this Class, whose functionality is
interesting. This property is typically used in the conversion functions. When I
pass a pure Zenkaku String (all characters Zenkaku), to the ZenToHan function, it will return a Hankaku equivalent
of it ( for example, ZenToHan("") will return (""). If I pass a non-Zenkaku
String, it will return an error in the Second Parameter which is passed ByRef. But what if I pass a mixed String? For example
what if I say ZenToHan(""). Here, one part of the
String
"" is Zenkaku
Katakana and is convertible, but the other part of the String "" is Kanji, and therefore
nonconvertible to Hankaku. So what will be the result? This is determined by the
value in the Strict Property. If this property is
set to True, the function returns "" and in the ByRef
Parameter, p_sErrorData, you get the error
description. If the Strict Property is False, then all convertible characters will be converted
and non-convertible characters will be left as they are. In the above mentioned
case, if the Strict property is false, the ZenToHan function will return "" (Note that the convertible
part "" has been converted to "" and the non-convertible
part "" has been left intact.
So:
The same functionality applies to all conversion functions.
The Client Code can call these functions as follows:
The download, in addition to the Class Library, contains two Client applications, which demonstrate the functionalities exposed by the LangHelper class. The first one is a Windows Forms based client, LangHelperWinClient, and the second is a web based aspx form, LangHelperAspxClient. These two differ only from the presentation point of view and do not handle any business logic. The LangHelper Class itself provides all the above-mentioned conversion/validation/misc. functionalities and these two presentation tier components just make function calls to the instance of LangHelper Class.
The Download contains a VisualStudio.NET Solution ( LangHelperLibrarySolution) with three projects, LangHelperLibrary, LangHelperAspxClient and LangHelperWinClient.
Do remember that the Web Server and the Client browser machines need Japanese Language support for this application to work.
Japanese Language Support:
To run the sample application, typing Japanese characters, and for developing Japanese language based web systems, it is essential to have a language support at the Operating System(OS) Level. This can be done either by:
Start->Settings->Control Panel->Regional Options->Your Locale (Location )->Select the Japanese Checkbox->OK.
Please note that all this is needed, only for you to go through the application and see the capabilities. The core part of the application is the LangHelper Class and later at development time, you just need to copy the LangHelper file to your own project and use the functions. The other two projects ( LangHelperAspxCLient and LangHelperWinClient) are given for demonstrating the capabilities of the Class Library and deal only with the presentation tier (UI tier).
As such, there is no limitation in the Class Library provided. All the functions take a String as a parameter and this enables you to pass single characters also, by storing them in a single variable. If you want to customize the functionality, it should be easy to do, as each function is atomic and modular (for example, to write a customized function which will accept only Hiragana and Kanji characters, you can write a wrapper function, which in-turn picks up each character of the string passed to it, calls isHiragana and isKanji functions)
' Customised Function to extend the Validation functionality to check ' whether a String is either Hiragana (OR) Kanji Public function isHiraganaOrKanji(p_sData as string) as boolean Dim iCtr as integer Dim objLangHelper as new LangHelper For iCtr=1 to len(p_sData) If objLangHelper.isHiragana(mid(p_sData,iCtr,1))==false and objLangHelper.isKanji(mid(p_sData,iCtr,1))==false then Return false End if Next Return true End Sub
So that should solve the common problems that typically happen in implementing a web based solution for Japanese customers. These functionalities are the ones that are very common in Japanese environment and can be resolved by using a single point validation and conversion routines provided by the class library. Also once the encoding issue is resolved at the System Architecture stage itself, there should be no surprises at development/deployment time.
Please rate this article using the form below. By telling us what you like and dislike about it we can tailor our content to meet your needs.
Article Information | |
---|---|
Author | M Sitarman |
Chief Technical Editor | John R. Chapman |
Project Manager | Helen Cuthill |
Reviewers | Sean Schade, Saurahb Nandu |
If you have any questions or comments about this article, please contact the technical editor.
|
| |||||||
|
| |||||||||||||||
|
ASPToday is brought to you by
Wrox Press (http://www.asptoday.com/OffSiteRedirect.asp?Advertiser=www.wrox.com/&WROXEMPTOKEN=650479ZIga46k1OWJ3YsrDhonv).
Please see our terms
and conditions and privacy
policy. ASPToday is optimised for Microsoft Internet Explorer 5 browsers. Please report any website problems to webmaster@asptoday.com. Copyright © 2002 Wrox Press. All Rights Reserved. |