Sim卡中的ucs2格式

 

Sim卡中的中文都是以ucs2格式存储的,ucs2unicode只是字节序不同,unicode是小头在前,ucs2是大头在前。

 

Ucs2GB2312互换可以用VC中的WideCharToMultiByte以及MultiByteToWideChar函数。

 

Ucs2本身有3种格式,常用的是80格式,即80开头,每两个字节表示一个字符,还有8182格式,后两种可以用一个字节表示一个汉字。80,81,82,gb2312在特定条件下可以互换。

 

下面对规范做一些简要解释

 

Annex B (normative):

Coding of Alpha fields in the SIM for UCS2

 

If 16 bit UCS2 characters as defined in ISO/IEC 10646 [31] are being used in an alpha field, the coding can take one of three forms. If the ME supports UCS2 coding of alpha fields in the SIM, the ME shall support all three coding schemes for character sets containing 128 characters or less; for character sets containing more than 128 characters, the ME shall at least support the first coding scheme. If the alpha field record contains GSM default alphabet characters only, then none of these schemes shall be used in that record. Within a record, only one coding scheme, either GSM default alphabet, or one of the three described below, shall be used.

 

如果在Alpha字段中使用ISO / IEC 10646 [31]中定义的16UCS2字符,则编码可以采用三种形式中的一种。如果我支持SIM中的alpha字段的UCS2编码,则应支持包含128个字符或更少的字符集的所有三种编码方案;对于包含超过128个字符的字符集,我至少应至少支持第一编码方案。如果Alpha字段记录仅包含GSM默认字母字符,则该记录中均未使用这些方案。在记录中,应仅使用仅一个编码方案,GSM默认字母或下面描述的三个编码方案。

 

1)     If the first octet in the alpha string is '80', then the remaining octets are 16 bit UCS2 characters, with the more significant octet (MSO) of the UCS2 character coded in the lower numbered octet of the alpha field, and the less significant octet (LSO) of the UCS2 character is coded in the higher numbered alpha field octet, i.e. octet 2 of the alpha field contains the more significant octet (MSO) of the first UCS2 character, and octet 3 of the alpha field contains the less significant octet (LSO) of the first UCS2 character (as shown below).  Unused octets shall be set to 'FF', and if the alpha field is an even number of octets in length, then the last (unusable) octet shall be set to 'FF'.

 

Example 1

 

Octet 1

Octet 2

Octet 3

Octet 4

Octet 5

Octet 6

Octet 7

Octet 8

Octet 9

'80'

Ch1MSO

Ch1LSO

Ch2MSO

Ch2LSO

Ch3MSO

Ch3LSO

'FF'

'FF'

这话的意思是说,以80开头的,是ucs2格式,大头在前,小头在后,不用的字节用FF填充。

 

举例,汉字“中国”,其

 

GB2312内码是             D6D0B9FA

 

ucs280方案表示是     4E2D56FD

 

 

 

2)     If the first octet of the alpha string is set to '81', then the second octet contains a value indicating the number of characters in the string, and the third octet contains an 8 bit number which defines bits 15 to 8 of a 16 bit base pointer, where bit 16 is set to zero, and bits 7 to 1 are also set to zero.  These sixteen bits constitute a base pointer to a "half-page" in the UCS2 code space, to be used with some or all of the remaining octets in the string. The fourth and subsequent octets in the string contain codings as follows; if bit 8 of the octet is set to zero, the remaining 7 bits of the octet contain a GSM Default Alphabet character, whereas if bit 8 of the octet is set to one, then the remaining seven bits are an offset value added to the 16 bit base pointer defined earlier, and the resultant 16 bit value is a UCS2 code point, and completely defines a UCS2 character.

 

如果Alpha String的第一个八位字节被设置为'81',则第二个八位字节包含指示字符串中字符数的值,第三个八位字节包含8位数字,该值定义16位的位158的位数。基指针,其中第16位设置为零,并且比特71也设置为零。这十六位构成了UCS2代码空间中的“半页”的基本指针,用于与字符串中的一些或全部剩余八位字节一起使用。字符串中的第四个和后续八位字节包含如下的编码;如果八位字节的比特8被设置为零,则八位字节的剩余7位包含GSM默认字母字符,而如果八位字节的比特8设置为一个,则剩余的七位是添加到16的偏移值比特基指针预先定义,结果16位值是UCS2代码点,并且完全定义了UCS2字符。

 

Example 2

 

Octet 1

Octet 2

Octet 3

Octet 4

Octet 5

Octet 6

Octet 7

Octet 8

Octet 9

'81'

'05'

'13'

'53'

'95'

'A6'

'XX'

'FF'

'FF'

 

 

   In the above example;

 

-  Octet 2 indicates there 5 characters in the string.

 

-  Octet 3 indicates bits 15 to 8 of the base pointer, and indicates a bit pattern of 0hhh hhhh h000 0000 as the 16 bit base pointer number. Bengali characters for example start at code position 0980 (0000 1001 1000 0000), which is indicated by the coding '13' in octet 3 (shown by the italicised digits).

 

-  Octet 4 indicates GSM Default Alphabet character '53', i.e. "S".

 

-  Octet 5 indicates a UCS2 character offset to the base pointer of '15', expressed in binary as follows 001 0101, which, when added to the base pointer value results in a sixteen bit value of 0000 1001 1001 0101, i.e. '0995', which is the Bengali letter KA.

 

- Octet 8 contains the value 'FF', but as the string length is 5, this a valid character in the string, where the bit pattern 111 1111 is added to the base pointer, yielding a sixteen bit value of 0000 1001 1111 1111 for the UCS2 character (i.e. '09FF').

 

 

前面乱七八糟,东西写了很多,也翻译了几句,但是其实上就一句话。

 

这段话的意思是说,81格式中,有一个基址,然后在这个基址上用一个字节表示一个ucs2,如果要进行ucs2显示,首先要算出来基址,然后每个字节算出来一个16bitucs2 80格式码。

 

有了80格式码,就容易了。

 

在格式上,81是标识,后面是一个字节的长度,再后面是基址,基址要左移7位,低位以及高位都置成0,具体看英文吧,最后是数据。

 

由于定义区间限制,所以81格式只有表示255种字符,且这255种字符在ucs2 80编码中,最多有128(或127)种不同的中文字符或128(或127)种不同的英文字符,而且这128种中文的ucs2 80格式编码一定在相邻的128个范围内。因为,中文只能用80-ff来表示,所以最多容纳128种不同的中文字符或127个英文,所以一个值是3080的处理方法是不一样的,30直接表示'0',而80要用基址来计算,(82格式也是这样)

 

举例,汉字 一丁丂七丄丅丆万丈三

 

GB2312内码 D2BBB6A18140C6DF814181428143CDF2D5C9C8FD

 

80格式编码 4E004E014E024E034E044E054E064E074E084E09       (连续的)

 

81编码  0A 9C 80818283848586878889  (连续的)

 

3) If the first octet of the alpha string is set to '82', then the second octet contains a value indicating the number of characters in the string, and the third and fourth octets contain a 16 bit number which defines the complete 16 bit base pointer to a "half-page" in the UCS2 code space, for use with some or all of the remaining octets in the string. The fifth and subsequent octets in the string contain codings as follows; if bit 8 of the octet is set to zero, the remaining 7 bits of the octet contain a GSM Default Alphabet character, whereas if bit 8 of the octet is set to one, the remaining seven bits are an offset value added to the base pointer defined in octets three and four, and the resultant 16 bit value is a UCS2 code point, and defines a UCS2 character.

 

Example 3

 

Octet 1

Octet 2

Octet 3

Octet 4

Octet 5

Octet 6

Octet 7

Octet 8

Octet 9

'82'

'05'

'05'

'30'

'2D'

'82'

'D3'

'2D'

'31'

 

 

   In the above example

 

-  Octet 2 indicates there are 5 characters in the string.

 

-  Octets 3 and 4 contain a sixteen bit base pointer number of '0530', pointing to the first character of the Armenian character set.

 

-  Octet 5 contains a GSM Default Alphabet character of '2D', which is a dash "-".

 

-  Octet 6 contains a value '82', which indicates it is an offset of '02' added to the base pointer, resulting in a UCS2 character code of '0532', which represents Armenian character Capital BEN.

 

-  Octet 7 contains a value 'D3', an offset of '53', which when added to the base pointer results in a UCS2 code point of '0583', representing Armenian Character small PIWR.

 

82格式编码与81类似,不同的就是81格式以一个字节表示基址,82是以2个字节为基址。

 

举例,汉字   一丁丂七丄丅丆万丈三

 

GB2312内码  D2BBB6A18140C6DF814181428143CDF2D5C9C8FD

 

80格式编码 4E004E014E024E034E044E054E064E074E084E09       (连续的)

 

81编码  0A9C80818283848586878889  (连续的)

 

82编码  0A4E0080818283848586878889  (连续的)