Monday 1 November 2010

Internationalization and Localization

This article explains how to configure a Perforce Server to run in internationalization mode and how to configure Perforce clients to work with different character sets. This articles also discusses possible problems you might encounter when handling Unicode or non-ASCII data in Perforce, as well as remedies to these problems.


In Perforce there are several ways to work with multiple character sets depending on your requirements: With the addition of "utf16" as another standard Perforce filetype (see p4 help filetypes for details), it is possible to continue running your Perforce server in "non-unicode" mode AND safely manage your unicode files.
If your filenames or Perforce metadata contain non-ASCII characters, then your Perforce administrator might need to consider switching your Perforce Server into unicode mode as described below. When running in unicode mode, all non-file data (identifiers, descriptions, and so on), as well as the content of all files of type ?unicode?, are translated between the character set specified by the P4CHARSET variable on the client and UTF8 in the server.

Before switching to unicode mode, verify that the character set you want to work with is supported. If the goal is to manage files that contain unicode characters, then you may consider standardizing on either UTF8 or UTF16 encoding. Note, starting with the 2007.2 Release, Perforce adds a new UTF16 filetype (see the Release Notes) to specifically support UTF16 files in both, non-unicode and unicode modes. To benefit from UTF16 support, all of your Perforce users need to be running 2007.2 versions of Perforce client programs.

If you need to work on unicode files that contain characters saved in the users directory, syncing/submitting such files to/from a single client machine can become a cumbersome process, as extra steps (switching between different P4CHARSETS, installing additional Code Pages and so on) are required to complete the task.


If the above option is not appropriate for your situation, then the unicode files can be added as binary files. This does make diffing such files more difficult, because by default Perforce does not support diffing true binary files. However, if your binary files are true UTF8 files, then the default diff/merge tool in P4V correctly diffs them. In addition, P4Win/P4V users can also specify a third-party diff/merge tool for such files. Likewise, command line users can force the diff using the "-t" flag.

Before you use Perforce in a unicode environment, you must first instruct your Perforce Server to run in unicode mode. To set up your server to run in this mode, stop the server, and then run this command from within your Perforce server root directory:

p4d -xi

This command verifies that all existing metadata is valid UTF8 and sets a protected unicode counter, to make
sure that future invocations of p4d operate in unicode mode. Once set on the server, unicode mode cannot be deactivated (that is, you cannot return to non-unicode mode). After p4d -xi switches your server into the unicode mode, you may then invoke p4d with your usual flags.

Important:

Occasionally, when trying to switch the server to unicode mode with the p4d -xi command,
the server responds with:

Table db.user has 14 rows with invalid UTF8.Table db.domain has 1 rows with invalid UTF8....Perforce server error:Database has 14 tables with non-UTF8 text and can't be switched to Unicode mode.

To fix this problem, do the following:

Take a checkpoint.Save the new checkpoint as a UTF8 file. For Windows users, you can use Notepad (2GB limit), or another editor of your choice if your checkpoint is greater than 2GB, to convert the new checkpoint to UTF8. The editor itself is not important (though word processors should be avoided), as long as it can save your new checkpoint as UTF8. On Unix, consider using iconv. Remove all db.* filesRestore from UTF8 checkpointVerifyTry p4d -xi againOr, you might also try: Take a checkpointTry to find all of the high-ASCII characters in each table (rows in the checkpoint)Fix thoseSave the modified checkpointRestore from the checkpointTry p4d -xi again

To convert to proper UTF8, you can use any of the character set conversion tools that are available. The "iconv" tool/converter is a good choice and it's available for both, Unix and Windows OS's. Note, "iconv" might miss some german umlaut characters; use it diligently and run p4 verify immediately after you use this tool.

To use Perforce in an unicode environment, you must also set the P4CHARSET environment variable on your client machines. If it is not set, then users of P4V or P4SCC.DLL are asked to choose their encoding when making a first connection to a Unicode enabled server, and other users end up with a "Unicode server permits only unicode enabled clients" message.

Important:
Be aware that mixing different encodings and, consequently, P4CHARSET settings on the same computer is likely to cause file corruption and/or translation problems.

The following table lists a few of the most used (in the USA) P4CHARSET values:

It's worth mentioning "none" as P4CHARSET value which a). overrides any existing P4CHARSET if used with "-C" switch and b). allows to connect to (non)/unicode enabled server. For the complete list of supported P4CHARSET values, run p4 help charset or visit: http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt

* utf8 is untranslated, but the file content is validated.

** utf16 requires that P4COMMANDCHARSET be set to a different (non-utf16) charset
for the p4 command line client to function, for example:

p4 -C utf16 -Q utf8 sync some_fileswhere "-C" is a command line flag for P4CHARSET and "-Q" is for P4COMMANDCHARSET.

Note, that both, P4V and P4WIN have a field in the Preferences dialog to reset P4CHARSET.

Log in to Windows and open an MS-DOS command prompt.Confirm that you have a True Type (TT) or Open Type font. Display your active code page on Windows machines by issuing the chcp command. Windows displays a message like the following: Active code page: 1252

Select the character set based on the active code page as follows:


To set P4CHARSET for all users on this workstation, you need Administrator privileges. Issue the following command:

p4 set -s P4CHARSET=[character_set]

If you do not have Administrator privileges, you can use:

p4 set P4CHARSET=[character_set]

to set P4CHARSET for the user currently logged in. Other users on the same machine have to set P4CHARSET independently.

Set P4CHARSET to the proper value from a command shell or in a startup script such as .kshrc, .cshrc, or .profile. You can determine the proper value for P4CHARSET by examining the current setting of the LANG or LOCALE environment variable.

Set P4CHARSET to the proper value in either a command shell, for example:

$ export P4CHARSET=utf8

or the "environment.plist" file which resides in ~/.MacOSX directory.

Note that the first form will be valid for your running shell session only. To make the change permanent, P4CHARSET should be set in your shell's startup scripts or in the environment.plist file. The default shell is bash and the startup script is ~/.bashrc.

If P4CHARSET is not set in an environment, P4V users are prompted to select a setting from the drop down list when establishing their first connection with the Unicode enabled server.

If you try to connect to a unicode mode enabled server to perform most commands, the server will return an error: $ p4 countersUnicode server permits only unicode enabled clients.If unicode is enabled, the output of p4 counters will include a 'unicode' counter with a value of '1'.

Example: $p4 counterschange = 1unicode = 1upgrade = 21If you do not have a P4CHARSET set, or cannot run p4 counters, you can use tagged output with p4 info. The tagged info output, gernerated by p4 -Ztag info will have a field for unicode that will be set to enabled.

Example: $ p4 -Ztag info[...]... clientAddress 127.0.0.1:50936... unicode enabled... serverAddress localhost:9988... serverRoot introot/... serverDate 2010/10/21 11:36:37 -0700 PDT... serverUptime 02:46:52... caseHandling sensitive

This message is displayed if your client machine is configured with a character set that does not include characters being sent to it by the Perforce Server. Your client machine cannot display unmapped characters.

For example, if your client machine is configured to use the shift-JIS character set and your depot contains files named using characters from the Japanese EUC character set that do not have mappings in shift-JIS, you see the "Cannot translate..." error message when you execute a p4 files or p4 changes command that lists those files.

The Perforce Server has internal limits on the lengths of strings used to index job descriptions, specify filenames, control view mappings, and identify client names, label names, and other objects.

The most common limit is 1024 bytes. Because some characters in Unicode can expand to more than one byte, it is possible for certain Unicode entries to exceed Perforce internal limits.

Because no basic Unicode character expands to more than three bytes, dividing the Perforce internal limit by three ensures that no Unicode sequence exceeds the limit.

To ensure that no Unicode sequence exceeds the Perforce limit, do not create client names or view patterns that exceed 341 Unicode characters.

Under normal usage conditions, this length limit is not expected to pose a significant limitation.

With a server not running in internationalized mode, the Perforce "unicode" filetype behaves much differently.
The client and server both assume that a file is valid UTF8 and store it as such. The server does not attempt to translate or verify the content of the file in any way. It is imperative that the files be saved using an editor that can save as UTF8 prior to submitting such files to Perforce. Outside of this requirement, users can access the Perforce server normally. There is no need to set P4CHARSET on the client.

The file was checked in UTF16 instead of UTF8 by a user. Rollback to an old revision or resave the file as UTF8.

6 users have rated this article 3.7 out of 5

View the original article here

No comments:

Post a Comment