libc 之 locales
Table of Contents
1 Locales
软件的国际化,意味着使软件符合用户的习惯。 ISO C 中,通过 locale 来实现这一目的。
每一台机器可以支持多个 locales , 用户可以通过环境变量来设置程序将要使用的 locale.
1.1 Locale 的作用
每个 locale 均由若干为不同目的而定义的规范构成。 这些规范包括:
- 什么样的宽字符序列是合法的,以及如何来解释他们。
- 如何对字符进行分类。
- 本地语言和字符的对照表。
- 如何格式化数字的显示。
- 输出以及错误提示使用何种语言。
- 使用何种语言来回答 yes-or-no questions。
- 使用何种语言来应对复杂的用户输入。
1.2 Locale 的选择
选择 (设置) Locale 的最简方法是设置环境变量: LANG , 该方法将会选择这个 locale 的所有规范。例如:
[yyc@localhost ~]$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
同时,我们也可以单独设置一个 locale 中的某个单独的规范, 例如早期的 fcitx (Linux 下的中文输入法), 要求 LC_CTYPE 必须为 GB2312 , 则可以进行如下设置:
[yyc@localhost ~]$ export LC_CTYPE="zh_CN.GB2312" [yyc@localhost ~]$ locale LANG=en_US.UTF-8 LC_CTYPE=zh_CN.GB2312 LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
一个系统不一定支持所有的 locales , 但所有的系统都需要支持一个标准的 Locale —— "C" 或者 "POSIX" 。
1.3 Locales 影响到的 Activities 的类别
locale 定义的规范可以分为若干类别,这些类别如下, 其中,每个类别的名字既可以作为环境变量名而在环境变量中找到, 也可以作为宏名在函数 setlocale 中作为参数。
- LC_COLLATE
影响字符串的校对。
- LC_TYPE
影响字符的分类,以及将字符转换成多字节和宽字符。
- LC_MONETARY
影响估计货币的格式化输出。
- LC_NUMERIC
影响数字的格式化输出。
- LC_TIME
影响日期和时间的格式化输出。
- LC_MESSAGES
影响用户接口中消息中使用的语言及用于匹配 yes-or-no questions 答案的正则表达式。
- LC_ALL
该符号并非环境变量,用在 setlocale() 中,用于设置上述所有的类别。
- LANG
如果设置了该环境变量,则该环境变量的值会影响上述所有的类别, 除非用户又显示地、重新设置了上述类别中的某一个。
1.4 Locale 的设置
由 C Family 编写的应用程序启动时可以自动继承通过环境变量设置的 locale , 但这种继承仅限于应用程序本身,对应用程序所使用的库不起作用 —— 这些库提供的函数将默认使用标准库中的 C Locale 。
我们可以通过 setlocale() 来通知库函数使用由环境变量指定的 locale:
setlocale(LC_ALL, "");
setlocale() 还可以用来指定 locale 中的某个单独的规范:
char * setlocale (int CATEGORY, const char *LOCALE);
该函数用于将当前 Locale 中的 CATEGORY 设置为 LOCALE 。
- 如果 *LOCALE 为 NULL, 则返回当前使用的 LOCALE;
- 如果 *LOCALE 不为 NULL且合法, 则返回当设置成功后使用的 LOCALE;
- 如果 *LOCALE 不为 NULL且不合法, 则当前 locale 不变,函数返回 NULL。
1.5 标准 Locales
前面提到,并非所有的系统都支持所有的 locales , 但是所有的系统都必须支持若干标准的 locales, 这些标准 Locales 包括:
- C:
由标准 C 指定的 locale , 其属性和行为均符合 ISO C 标准。
- POSIX:
POSIX locale,Linux 下的 POSIX locale 当前与 C 完全一样。 - ""
空 locale ,使用该 locale 的程序会自动使用环境变量中规定的 locale 。locales 的定义和安装通常是由系统管理员完成的。
1.6 Locale 信息的获取
有多种方式可以用于获取 locale 信息, 其中最简单的方法是让 C library 自己去获取, 很多 Library 都可以这样去做。 以 strftime() 为例,同样的代码,在不同的 locale 下,输出会随 locale 而变。
但 有时程序无法自动完成 locale 信息的获取, 此时我们足要自己去做。 用来完成这个目的的函数有两个 localeconv() 和 nl_langinfo() 。 其中,前者是 标准C 提供的,可移植性好,但借口超烂。后者是 Unix 接口, 只要系统遵循 Unix 标准,就可以使用。
1.6.1 蹩脚的 localeconv
localeconv() 同 setlocale() 一样,是由标准 C 提供的,可移植, 但使用代价昂贵,可拓展性差。并且,它接提供了访问 locale 中的 LC_MONETARY 和 LC_NUMERIC , 通用性差。
localeconv() 原型为:
struct lconv * localeconv (void);
该函数返回一个 lconv 结构的指针, lconv 结构中的元素包含了如何在当前 locale 中格式化输出数字和货币的一些信息。 Glibc 中,其定义如下:
/* Structure giving information about numeric and monetary notation. */ struct lconv { /* Numeric (non-monetary) information. */ char *decimal_point; /* Decimal point character. */ char *thousands_sep; /* Thousands separator. */ /* Each element is the number of digits in each group; elements with higher indices are farther left. An element with value CHAR_MAX means that no further grouping is done. An element with value 0 means that the previous element is used for all groups farther left. */ char *grouping; /* Monetary information. */ /* First three chars are a currency symbol from ISO 4217. Fourth char is the separator. Fifth char is '\0'. */ char *int_curr_symbol; char *currency_symbol; /* Local currency symbol. */ char *mon_decimal_point; /* Decimal point character. */ char *mon_thousands_sep; /* Thousands separator. */ char *mon_grouping; /* Like `grouping' element (above). */ char *positive_sign; /* Sign for positive values. */ char *negative_sign; /* Sign for negative values. */ char int_frac_digits; /* Int'l fractional digits. */ char frac_digits; /* Local fractional digits. */ /* 1 if currency_symbol precedes a positive value, 0 if succeeds. */ char p_cs_precedes; /* 1 iff a space separates currency_symbol from a positive value. */ char p_sep_by_space; /* 1 if currency_symbol precedes a negative value, 0 if succeeds. */ char n_cs_precedes; /* 1 iff a space separates currency_symbol from a negative value. */ char n_sep_by_space; /* Positive and negative sign positions: 0 Parentheses surround the quantity and currency_symbol. 1 The sign string precedes the quantity and currency_symbol. 2 The sign string follows the quantity and currency_symbol. 3 The sign string immediately precedes the currency_symbol. 4 The sign string immediately follows the currency_symbol. */ char p_sign_posn; char n_sign_posn; #ifdef __USE_ISOC99 /* 1 if int_curr_symbol precedes a positive value, 0 if succeeds. */ char int_p_cs_precedes; /* 1 iff a space separates int_curr_symbol from a positive value. */ char int_p_sep_by_space; /* 1 if int_curr_symbol precedes a negative value, 0 if succeeds. */ char int_n_cs_precedes; /* 1 iff a space separates int_curr_symbol from a negative value. */ char int_n_sep_by_space; /* Positive and negative sign positions: 0 Parentheses surround the quantity and int_curr_symbol. 1 The sign string precedes the quantity and int_curr_symbol. 2 The sign string follows the quantity and int_curr_symbol. 3 The sign string immediately precedes the int_curr_symbol. 4 The sign string immediately follows the int_curr_symbol. */ char int_p_sign_posn; char int_n_sign_posn; #else char __int_p_cs_precedes; char __int_p_sep_by_space; char __int_n_cs_precedes; char __int_n_sep_by_space; char __int_p_sign_posn; char __int_n_sign_posn; #endif };
具体含义,参考其中注释。
1.6.2 优雅、迅捷的 nl_langinfo
char *nl_langinfo(ln_item ITEM);
nl_langinfo() 用于访问 locale 中的细节,粒度细,速度快。 其中, ITEM 定义在头文件 langinfo.h 中,解释如下:
`CODESET' `nl_langinfo' returns a string with the name of the coded character set used in the selected locale. `ABDAY_1' `ABDAY_2' `ABDAY_3' `ABDAY_4' `ABDAY_5' `ABDAY_6' `ABDAY_7' `nl_langinfo' returns the abbreviated weekday name. `ABDAY_1' corresponds to Sunday. `DAY_1' `DAY_2' `DAY_3' `DAY_4' `DAY_5' `DAY_6' `DAY_7' Similar to `ABDAY_1' etc., but here the return value is the unabbreviated weekday name. `ABMON_1' `ABMON_2' `ABMON_3' `ABMON_4' `ABMON_5' `ABMON_6' `ABMON_7' `ABMON_8' `ABMON_9' `ABMON_10' `ABMON_11' `ABMON_12' The return value is abbreviated name of the month. `ABMON_1' corresponds to January. `MON_1' `MON_2' `MON_3' `MON_4' `MON_5' `MON_6' `MON_7' `MON_8' `MON_9' `MON_10' `MON_11' `MON_12' Similar to `ABMON_1' etc., but here the month names are not abbreviated. Here the first value `MON_1' also corresponds to January. `AM_STR' `PM_STR' The return values are strings which can be used in the representation of time as an hour from 1 to 12 plus an am/pm specifier. Note that in locales which do not use this time representation these strings might be empty, in which case the am/pm format cannot be used at all. `D_T_FMT' The return value can be used as a format string for `strftime' to represent time and date in a locale-specific way. `D_FMT' The return value can be used as a format string for `strftime' to represent a date in a locale-specific way. `T_FMT' The return value can be used as a format string for `strftime' to represent time in a locale-specific way. `T_FMT_AMPM' The return value can be used as a format string for `strftime' to represent time in the am/pm format. Note that if the am/pm format does not make any sense for the selected locale, the return value might be the same as the one for `T_FMT'. `ERA' The return value represents the era used in the current locale. Most locales do not define this value. An example of a locale which does define this value is the Japanese one. In Japan, the traditional representation of dates includes the name of the era corresponding to the then-emperor's reign. Normally it should not be necessary to use this value directly. Specifying the `E' modifier in their format strings causes the `strftime' functions to use this information. The format of the returned string is not specified, and therefore you should not assume knowledge of it on different systems. `ERA_YEAR' The return value gives the year in the relevant era of the locale. As for `ERA' it should not be necessary to use this value directly. `ERA_D_T_FMT' This return value can be used as a format string for `strftime' to represent dates and times in a locale-specific era-based way. `ERA_D_FMT' This return value can be used as a format string for `strftime' to represent a date in a locale-specific era-based way. `ERA_T_FMT' This return value can be used as a format string for `strftime' to represent time in a locale-specific era-based way. `ALT_DIGITS' The return value is a representation of up to 100 values used to represent the values 0 to 99. As for `ERA' this value is not intended to be used directly, but instead indirectly through the `strftime' function. When the modifier `O' is used in a format which would otherwise use numerals to represent hours, minutes, seconds, weekdays, months, or weeks, the appropriate value for the locale is used instead. `INT_CURR_SYMBOL' The same as the value returned by `localeconv' in the `int_curr_symbol' element of the `struct lconv'. `CURRENCY_SYMBOL' `CRNCYSTR' The same as the value returned by `localeconv' in the `currency_symbol' element of the `struct lconv'. `CRNCYSTR' is a deprecated alias still required by Unix98. `MON_DECIMAL_POINT' The same as the value returned by `localeconv' in the `mon_decimal_point' element of the `struct lconv'. `MON_THOUSANDS_SEP' The same as the value returned by `localeconv' in the `mon_thousands_sep' element of the `struct lconv'. `MON_GROUPING' The same as the value returned by `localeconv' in the `mon_grouping' element of the `struct lconv'. `POSITIVE_SIGN' The same as the value returned by `localeconv' in the `positive_sign' element of the `struct lconv'. `NEGATIVE_SIGN' The same as the value returned by `localeconv' in the `negative_sign' element of the `struct lconv'. `INT_FRAC_DIGITS' The same as the value returned by `localeconv' in the `int_frac_digits' element of the `struct lconv'. `FRAC_DIGITS' The same as the value returned by `localeconv' in the `frac_digits' element of the `struct lconv'. `P_CS_PRECEDES' The same as the value returned by `localeconv' in the `p_cs_precedes' element of the `struct lconv'. `P_SEP_BY_SPACE' The same as the value returned by `localeconv' in the `p_sep_by_space' element of the `struct lconv'. `N_CS_PRECEDES' The same as the value returned by `localeconv' in the `n_cs_precedes' element of the `struct lconv'. `N_SEP_BY_SPACE' The same as the value returned by `localeconv' in the `n_sep_by_space' element of the `struct lconv'. `P_SIGN_POSN' The same as the value returned by `localeconv' in the `p_sign_posn' element of the `struct lconv'. `N_SIGN_POSN' The same as the value returned by `localeconv' in the `n_sign_posn' element of the `struct lconv'. `INT_P_CS_PRECEDES' The same as the value returned by `localeconv' in the `int_p_cs_precedes' element of the `struct lconv'. `INT_P_SEP_BY_SPACE' The same as the value returned by `localeconv' in the `int_p_sep_by_space' element of the `struct lconv'. `INT_N_CS_PRECEDES' The same as the value returned by `localeconv' in the `int_n_cs_precedes' element of the `struct lconv'. `INT_N_SEP_BY_SPACE' The same as the value returned by `localeconv' in the `int_n_sep_by_space' element of the `struct lconv'. `INT_P_SIGN_POSN' The same as the value returned by `localeconv' in the `int_p_sign_posn' element of the `struct lconv'. `INT_N_SIGN_POSN' The same as the value returned by `localeconv' in the `int_n_sign_posn' element of the `struct lconv'. `DECIMAL_POINT' `RADIXCHAR' The same as the value returned by `localeconv' in the `decimal_point' element of the `struct lconv'. The name `RADIXCHAR' is a deprecated alias still used in Unix98. `THOUSANDS_SEP' `THOUSEP' The same as the value returned by `localeconv' in the `thousands_sep' element of the `struct lconv'. The name `THOUSEP' is a deprecated alias still used in Unix98. `GROUPING' The same as the value returned by `localeconv' in the `grouping' element of the `struct lconv'. `YESEXPR' The return value is a regular expression which can be used with the `regex' function to recognize a positive response to a yes/no question. The GNU C library provides the `rpmatch' function for easier handling in applications. `NOEXPR' The return value is a regular expression which can be used with the `regex' function to recognize a negative response to a yes/no question. `YESSTR' The return value is a locale-specific translation of the positive response to a yes/no question. Using this value is deprecated since it is a very special case of message translation, and is better handled by the message translation functions (*note Message Translation::). The use of this symbol is deprecated. Instead message translation should be used. `NOSTR' The return value is a locale-specific translation of the negative response to a yes/no question. What is said for `YESSTR' is also true here. The use of this symbol is deprecated. Instead message translation should be used.