A story of jQuery, CLDR JSON and the Community at the Unicode Conf, Santa Clara – CA

On November, I gave a talk at the Internationalization & Unicode Conference #38, Santa Clara – CA. There’s no video this time. But, here are the slides, a picture thanks to John Huân (#paypal) and its transcription (slightly changed to sound better for a post).

jQuery is very popular for its core project create by John Resig that made web development simple and reliable in the dark era of browsers. Developers could use the magical dollar sign and not worry about compatibility anymore.

It happens that we have other projects as well, which are all under the same jQuery foundation umbrella. A really quick intro. UI, which holds a curated set of user interface components and is famous for its calendar widget: datepicker. Mobile, which extends UI with responsiveness and accessibility for smartphone and tablet devices. QUnit, which is a testing framework, we happen to use it ourselves to test our projects. Sizzle, the CSS selector engine inlcuded in jQuery. and Globalize, which is where I am.

My name is Rafael Xavier and I’m Globalize project lead. So, enough introduction, what happened to Globalize?

Everything starts with web developers trying to create a simple web application that needs globalization support. What do we do? Simple. We include libraries that allows us to do so and job’s done.

Everything looks fantastic until we go show our amazing progress to our boss and he says: “this is wrong…”.

When that happens (when we spot a bug) well, we get the chance to know our libraries a little better. Anyway we don’t have much choice other than either (a) figure out what’s wrong and fix it, or (b) file a bug.

On Globalize, we had a bunch of bugs. Interestingly, one third have all the same root cause: i18n content, or more precisely, having wrong or outdated i18n content embedded into our code. For example, a wrong month name or wrong day name. This bug affects all sorts of locales and all sorts of functionalities. The list goes on and on. Reviewing those bugs was really tricky, because we at jQuery know JavaScript. But, very little about the several other languages.

At some point, the reviewing process was to look & use what’s in CLDR. So, we asked ourselves why not to replace our database and use CLDR instead? This is when I got first introduced to Globalize. By that time, this bug was opened for 8 months and it wasn’t one, but two big issues we were trying to solve: (a) we were managing the content ourselves, and (b) the source initially chosen, which was .NET, turned out to have big problems.

When I was first digging into this bug, discovering what CLDR was and how we could change Globalize to use it, my initial concern was how that change would affect our locale coverage and our current functionality. But, for my happiness I figured that by adopting it, we would actually double the locale coverage and we would also fix all the functionality issues we had so far. It was simply
amazing.

Ok! In order to address that whole bunch of bugs at once, we are convinced to rewrite the whole library to conform with CLDR (with TR#35). But, we are also convinced we have a second challange to solve, embedded content. Or tomorrow, we’d end up with unhappy developers again complaining about outdated CLDR or wrong CLDR. We didnt want to simply rename our problem. But, we wanted so solve it.

Before creating something new. Let’s research. What about other libraries? After all, we are not alone in the world. Do they have this problem? If not, how do they solve it?

Let’s start with twitter-cldr. It’s based on CLDR obviously as its name suggests. But, it embeds CLDR data in the code as we unfortunately did as well. Do they get the same sorta bugs we had? Let’s see an example. Developer filed a bug about wrong date formatting in the Italian language, which has been fixed in CLDR 26. But, hasn’t been fixed in twitter-cldr (by the time of this talk). To give a rough idea of the timeline we’re talking about, after 43 days since developer has first reported this bug, it was fixed and published by CLDR. At the day of the talk, it was 91 days and the bug was still opened.

Next, angular.js. It’s based on CLDR. Actually, it pulls data in from Google Closure, which in turn is based on CLDR. Looking its master code (at the day of this talk), it presented the same Italian bug that we just saw.

Here, I’ve picked one existing bug as an example. But, there are more, an easy way to find them is by looking at CLDR changelogs.

Not to mention the bugs that are still open on CLDR. For example, the plural one I used as an example earlier on (shown in my slide). It seems so basic. But, it’s real and it happens today on CLDR 26. But, for the Brazillian Portuguese language.

The surprising fact is that all the mainstream globalization libraries for JavaScript run into the same problems as we, jQuery, have.

What usually happens is that we, again speaking as a developer, cannot wait indefinitely for those bugs to get fixed and land upstream ~~boss yelling~~. Occasionally, we fix them locally. Yay. Is it ideal? Some people may say “it works for me”. Although, keeping a local custom modified library has its drawbacks. As soon as a new update pops up and we download it. We lose our previous fixes. Unless we
keep a list of patches that we can re-apply over and over and, hopefully, not get any conflicts while doing so.

This is definitely not a good maintenance perspective over time. So, what could be done instead? At Globalize, we thought a solution should cover three simple points:

1. It should leverage the official CLDR JSON. Because, (a) processing JSON on
JavaScript is lighter. (b) It’s distributed by Unicode. It’s official. We believe developer should be able to say: “Globalize (or google Closure, format.js), here is my CLDR data, use it”.

2. It should allow developers to load as much or as little data as they need. For example, when formatting numbers, we don’t need the plural rules obviously. We don’t even need lots of the number fields themselves. It depends on a case by case basis.

3. Avoid duplicating data between libraries. If a developer uses two number libraries, one for formatting and another for parsing. For example, ecma-402 polyfil and Globalize. Both should be able to use the same shared number data.

The intriguing question is: if this is so good and obvious, why hasn’t anyone done this yet?

I don’t know the answer to this question. But, there’s one fact I haven’t mentioned yet. The first appearance of a CLDR in the JSON format actually happened on CLDR v23, which happened on 2013. And, by that time all these libraries already existed. So, all have an excuse. But, no more.

Welcome to the new Globalize. It has been rewritten to address all the previous three points. It’s designed to work both in the browser and in Node.js. At jQuery, we systematically test it against desktop and mobile browsers (versions listed in the slide). It supports AMD and CommonJS module loaders.

On the rewrite, we’ve split the former monolithic library into individual modules. On date module, we find date formating/parsing. On number module, we find number formating/parsing. And, so on… All the functional verticals lie onto the Globalize core, which works as a base layer. Note we have no more content embedded into our library. The CLDR content is treated as a peer dependency. Also, note that throughout the Globalize code, we manipulate CLDR: we instantiate locales, we traverse CLDR paths. Some of these operations are not that trivial <likelySubtags example?>. Therefore, we’ve wrapped that code into cldr.js. So, we (a) keep Globalize focused on the i18n functions only, (b) allow other libraries to scaffold and build themselves on top of the same foundation we do. ~~layers topology~~.

Cldr.js is a low level library, whose only purpose is to help to manipulate CLDR. So, it’s really cool to develop your i18n library not needing to worry about that.

Like Globalize, it’s designed to work both in the browser, or in Node.js. It support AMD and CommonJS.

It’s unopiniated on how user should load CLDR data. There are dozen of ways. All we expect is the JSON. So, developer can fetch the data dynamically if he will. He can use AMD plugins. Node.js require. Or any imaginable way.

Globalize uses Cldrjs load method under the hoods, so Globalize works the same way.

It automatically deduces subtags using CLDR algorithms specified by CLDR docs. These variables then are automatically used when traversing the tree to get an item.

Those are the most important features, but there are more.

We’ve talked about the benefits of having CLDR as a separate thing. But, this approach inevitably introduces one extra step for developers: to download the CLDR data themselves. Who likes extra steps?

How can we ease this initial ramp up? More importantly, how does developer know he has the right CLDR version compatible with the libraries he’s using?

Libraries need a way to declare its CLDR peer dependency.

We at jQuery use bower and npm to manage our library dependencies. Can we use this same tools to manage CLDR peer dependency as well?

Yes, we created a package called `cldr-data` for both: npm and bower. It works the lightest possible way we could imagine it. It doesn’t actually mirror any data. But, it manages the zip urls, and as a post-processing step it downloads the right link and unpack it.

If you develop a globalization library, or plan to, consider scaffolding it on top of cldr.js and cldr-data.

Let’s start to see some action. The demo I’m about to show you is going to do three things: (a) fetch globalize, (b) fetch cldr data in the json format and (c) run some Globalize code on Node.js.

We can use npm to install all that. If you are not familiar with npm, it’s the Node.js package manager and is commonly used by Node developers. Installing globalize also installs its both dependencies: cldrjs (the base library) and cldr-data (the node module that pulls in the CLDR data in the json format from the Unicode servers)

We’re all set. Let’s use Globalize. Start using right ahead makes it complain about missing CLDR data.

We developers need to feed Globalize with the proper CLDR JSON files. Here, I’m using cldr-data package. It’s optional. We pass in only the content we need. Oh, but which ones do we need? If you need help figuring it out, see the table in our docs. Then, use it.

Let me show you how easy it is to manage the CLDR content. “User has the power”. Do you remember the italian bug we saw earlier on, which affected twitter-cldr and angular.js? Let’s reproduce it by using CLDR version 25. Note the problem is the usage of `/` (wrong) instead of ` ` (correct).

Do you want to fix it and move to the next level? Yes. Note that using CLDR version 26 is enough to fix the problem. We don’t need to update our JS library or application code.

Do you remember the Brazillian Portuguese plural bug, which is currently present on CLDR 26? Let’s reproduce and fix it? In Brazillian Portuguese, the plural form of 0.5 is not other, but one.

There’s no CLDR 27 yet. But, let’s set the right rules ourselves and retry. Tchara! Fixed! Note, we were able to fix it dynamically on user code.

The same thing works on browsers as well. We can use bower package manager, which is commonly used on client applications.

Please, find examples using these various environments (bower, AMD or node.js) in Globalize docs.

What about performance? I’m advocating a library that traverses the CLDR tree and parse rules dynamically at runtime, which is totally fine during development. But, what about production? Is there any optimization?

When formatting a number, there’s actually a two step process: (a) setup and (b) execution, where setup takes considerably more time (is a lot more expensive than execution). The runtime difference is an order of magnitude.

Note that the same is also true for date formatting, message formatting, or even parsing all of them.

So, an obvious way to speed up iterations in your code is to generate the formatter outside the loop. The same idea is valid for server applications, we probably want our formatters to be created in advance, so when requests arrive, we can process them quickly by simply executing the formatter. Obviously, this is a simple demonstration. Usually, an ICU Message Format goes here instead in a real world application, a number format would be an input for a templating engine like mustache, handlebars. But, all follows the same idea.

Another distinction between the setup and execution is that all CLDR manipulation happens during setup, which can be really handy. For example, any missing CLDR error will be thrown as soon as the server is started in this example. So, all subsequent client requests are safe from any CLDR manipulation error.

Let’s talk a bit about client application. On client side, performance is also about how fast our page loads. So, size matters. How to get the smallest and leanest bundle for production?

Let’s see an example. Suppose we need a plural function for English. This is what we need. A function that given a number, outputs the plural form. But, in order to get that, we need the English plural rules and the library code to parse that rules in order to generate this simple function. In the end, this tiny thing is what matters. What if we could precompile that at build time and deploy only this tiny function. The good news is that it’s possible. All our formatters/parsers output the precompiled function. So, by managing that at
build time, we can remove the need for most of the library at runtime.

The plural function is an extreme example. But, we often can save bytes formatting or parsing other stuff too.

If you are familiar with templating engines like Handlebars, the idea of deploying precompiled bundles may sound familiar.

You know, what’s most important about this whole story is that we didn’t do it all ourselves. We’ve been systematically collecting feedback on every aspects of our idea and our implemention. Part of the solution is contribution from Wikipedia, part from Alex Sexton, part from people all over the internet.

The jQuery foundation has a mission that goes beyond supporting the development of our products. It’s about supporting three things.

Improve the open web, which means, having open standards publicly available and free to implement. ~~An open web~~

Make web accessible for everyone, it means including people with disabilities and people living in poor conditions. ~~An accessible web~~

And, ~~Collaboration with the development community~~.

“One thing that really struck me during my limited research was how many overlapping libraries there are especially for number and date-time formatting”

Norbert Lindenberg, the creator of ECMA-402 (the standard Internationalization API for JavaScript) said an year ago. It happens this is still true.

We have been in contact with lots of people in different organizations. As an effort of improving the coordination and potentially the collaboration between projects, we’ve created a common channel for communication. Hopefully, with the joint effort of this group, we’ll make this whole farm more productive.

It’s a very recent initiative. But, it’s already been used for annoucements of new accomplishments and we’re collaboratively producing a comparison grid to help clarify the differences, the gaps, the strenghts and weekness of each different projects.

All libraries that I’ve showed are functional. They work. So, feel free to use it. I hope you enjoy it. File bugs if you find any trouble. We really listen to what you say. Help us to design, implement, and test it. Collaborate with us. Join us.

Thank you

Leave a comment Cancel reply