No matter how distinct they may seem, all localization frameworks have one thing in common: they need a way to store localized text. For a localization effort that’s entirely self-contained, it doesn’t matter how your translations are stored. However, if you plan to integrate your localization effort with outside tools or translators, the format your translations are stored in can quickly become a barrier to progress.
In this post, we’ll look at some of the more common localization file formats and established best practices. Whether you’re starting a new project or looking to integrate an existing localization effort, knowing which file formats to target can save you time and effort.
Gettext is a popular internationalization framework used in a wide variety of programming languages and operating systems. In addition to supporting a diverse number of languages, language rules, and locale-specific settings, it’s supported by a large number of tools such as Poedit, gted, and Virtaal.
A key benefit of gettext is standardization. GNU gettext is one of the more popular open-source implementations of gettext and has been ported to PHP, Python, Perl, and more. WordPress, Ubuntu, and LibreOffice use gettext to provide translations.
XML Localization Interchange File Format (.xliff)
XLIFF is an industry standard format based on XML. XLIFF was designed specifically for the localization industry as a bridge between platforms and tools, such as your application and a localization service. XLIFF also serves to standardize the way information is transferred throughout the localization process, ensuring interoperability across different workflows. Even applications such as Microsoft SharePoint rely on XLIFF to transfer localization information to and from translators.
Like gettext, XLIFF defines a standard format for storing localized text. Your team can use tools such as the Translate Toolkit to generate and convert XLIFF files. Not only does this help speed up the localization process, but it helps ensure your XLIFF files will be complete and parsable by services such as Transifex.
Extensible Markup Language (.xml)
Unlike most of the other formats on this list, XML is a language used to encode data within a document. Whereas most formats provide a rigid structure for defining your localization data, the structure of an XML document is defined within the document itself. This means one XML document can have a drastically different structure than another document even if both files use a completely valid syntax.
Because of this, XML forms the base of many formats such as Windows resource files (.resx, .resw), Android string resource files (.xml), and the XML Localization Interchange File Format (.xliff). Although these formats use the same language, they implement it in slightly different ways. For example, the following XML defines the string “Hello world!” with an ID of “hello_world” in Android:
<string name="hello_world">Hello world!</string>
The following XML shows the same concept implemented in a Windows desktop application:
<data name="hello_world" xml:space="preserve"> <value>Hello world</value> </data>
Although any platform capable of interpreting XML can parse these files, the platform has to know how the XML document is structured. This is why some localization platforms support some XML-based format, but not others.
As with XML, JSON is a general-purpose format for transmitting data between applications. JSON supports many of the same benefits as XML, such as a flexible and dynamic structure that can be read by any JSON interpreter (of which there are many). JSON is also arguably more human-readable than XML, making it easier for developers and translators to work with.
A common problem that many localization teams experience with JSON is invalid data types. Values stored in a JSON file can consist of multiple data types including strings, numbers, and empty (or null) values. For localization purposes, we recommend only storing non-empty string values by surrounding them in double quotes.
Lastly, complex, nested JSON objects can cause problems for certain parsers. A value stored in JSON can be a string, a number, an empty value, a collection of strings or numbers, or even another JSON document. Some localization frameworks may not support complex structures without first requiring additional parsing rules to be defined.
JSON is famously used by MediaWiki to store over 23,000 translations.
Java Properties (.properties)
Properties files are commonly used in Java applications to store application configuration settings. They’re commonly used for localization due to their simplicity and readability. When used for localization, a property file consists primarily of two strings: an identifier followed by the localized text. Different formats (such as Mozilla localization files) may provide different features, but they all follow the same basic structure.
A unique point to properties files is that they require ISO-8859-1 (or Latin-1) encoding as opposed to the UTF-8 encoding common to most other formats. While this won’t make a significant impact on your use of properties files, it is something for developers to be aware of.
Comma-Separated Values (.csv)
The CSV format is perhaps the simplest format for storing localization information. It consists of groups of two strings separated by a comma, with each group placed on a new line. CSV files are extremely straightforward, easy to parse, and easy to work with, although their flexibility is limited when compared with other formats. Magento uses CSV files to manage localized strings.
While the file format you use is more likely to be determined by your localization framework, it’s good to know what’s out there. Transifex supports over 25 localization file formats in addition to those listed above. You may also be able to convert an existing format to another using tools such as Translate Toolkit. For more information, contact us with your localization questions.