December 17th, 2011

I don’t know why, it’s a Saturday morning, my wife has gone off to inform Father Christmas of what gifts to bring my daughter, my daughter is taking a nap (life is simple when you’re not yet nine months old), I’m feeling a bit ragged in that ‘almost the end of semester’ kind of way, I’m still caffeinating and had just started reading this when I suddenly thought, wouldn’t the relative sparseness and general lack of pollysyllabic words make Classical Chinese ideal for microblogging?

Now, that’s not even close to hypothesis quality. Not even a random thought, barely makes it to random thot level. Really, it’s just a thotikin. But how to test this wee thotikin? Being the least diligent student of Chinese in all of recorded history, legend and myth, I’m certainly not going to embarrass myself by attempting to actually write anything in Classical Chinese. But glancing at the shelf above me, I see a few bilingual – Classical and Modern Chinese – editions of a few of the Chinese classics. Surely the obvious method would be to find one or two passages of about microbloggable length and compare the original Classical text with the Modern translation for length. A glance in the Hanfeizi reveals a lot of rather long passages, not really microbloggable. Ah, but the Shanhai Jing – surely there’s a book that should have been published on Weibo! So here it is:



And putting that into Weibo, including the title and a colon to distinguish it from the text, leaves me 72 characters spare, so I only used 68.

Now, the modern Chinese translation by, er, somebody not me. Can’t find the translator’s name in the book:


And that, with the same 3 character title and colon, leaves me with only 26 characters, so that’s 114 characters all up. So the Classical Chinese uses only 60% of the characters of the Modern Chinese version.

So how does it compare with English? I’m not the only one to have gotten the impression that one can squeeze a lot more information into 140 Chinese characters than 140 English characters – although it must be said that’s not necessarily true. I was sure I had a trilingual (Classical and Modern Chinese and English) copy of the Daode Jing lying around, but I guess it must be helping clutter up my parents’ house in New Zealand. And in any case, like the Hanfeizi it’s not really a microbloggable book. I do have a similarly trilingual copy of the Zhuangzi with me, but again, not really microbloggable. But I do have a bilingual Classical Chinese and English copy of the Analects. So let’s try some randomly chosen passage.

Book 2, 1:


A mere 24 characters, punctuation included.

Arthur Waley’s translation:

The Master said, He who rules by moral force is like the pole-star, which remains in its place while all the lesser stars do homage to it.

75 69 (I misread my own handwriting, would you believe. Thanks, Jean, for catching that error) characters, and that’s with the little translators note (te) removed. The original needs only 32 35% of the number of characters of the translation.

So there you go, through what is obviously two super rigorous experiments of great scientific virtue I have proved that in fact, one could, by using Classical Chinese, squeeze into one’s microblog of choice almost twice as much information than by using Modern Chinese and over three times as much information than by using English. Therefore, because verbosity is a virtue, we must all rebel against the character limits imposed on us by the likes of Weibo and Twitter and do all our microblogging in Classical Chinese.

    Actually Weibo counts only one character for two English letters. Your English sample is 138 characters long in Twitter (and 69 in Weibo, I am not sure how you arrived at 75).

    I prefer the Weibo way as is it closer to the underlying representation and size taken by the character, at least in an abstract way. The alphabet can fit in a byte (8 bits so 2^8=256 values) while encoding of Chinese characters (like GB2312) required two bytes (with a maximum of 2^16=65536 values). Of course, now we are not that stingy with disk space and Unicode is here to save the World of these pesky encoding issues. Still, if everyone in Twitter started to use 140 Chinese characters instead of 140 honest-to-God-without-even-a-diacritic roman letters, somewhere a few hard-drives would start crying.

    OK, not sure what I am trying to prove here, but as your experience is super rigorous, I thought it should be mentioned.

    Jean, thanks for that, although I must admit the encoding issues are a bit beyond me.

    I seem to remember a few years ago somebody experimenting to see exactly how many Chinese characters Twitter would allow by typing out 一二三四五六七八九十 repeatedly until Twitter squealed “Stop!” and then counting the result, and the result was 80. I don’t know if that’s still the case.

    I got my character counts by pasting those passages into Weibo then subtracting the remaining character allowance from 140. Now, maths is not my strongpoint, and neither is handwriting. Glancing at the notepad next to me (yes, I did the subtraction the old fashioned way. I used a calculator for the percentages, y’know, maths not being my strongpoint) I see that I misread my own 0 for a 6. Permit me a Homer Simpson moment. So, yes, that should be 69 for the English sample, and as soon as I’ve finished this comment I’ll go and correct the post.

    So, another reason Weibo is better than Twitter is that it lets you write twice as much English? If that’s true, then I’m quite happy to make Twitter’s hard drives cry.

    Heh, loved Brendan’s comment! And interesting comparison.

    When Obama won the 2008 election, his victory speech was translated into 文言文 on the Chinese web! Using that as a sample, link below, the English version is 10556 bytes, whereas the 文言文 version is merely 10752 bytes, based on bytes count in my vim editor. That really shows how compact 文言文 can be, with the encoding discussion above.

    I remember 和菜头 wrote a 本-拉登列传 with 司马迁’s 《史记》 style,文言文 all the way. It was hilarious. I think he did another one similar to this on his other blog, perhaps about Jobs, but I couldn’t find it now


    How on earth did you get held up for moderation, Mr Ji? Sorry about that.

    Awesome links, thanks.

    So, glancing back at Jean’s comment, a Hanzi takes 2 bytes whereas a Latin letter needs only 1, correct? So a text written in Hanzi should need twice as many bytes as a text written in Latin script with an identical character count. Is that how it works? If so, then 文言文 coming in at only slightly more bytes (196 more, assuming I’m not misreading my own handwriting again) is indeed compact.

    So I guess my winter holiday plans are sorted then – when I’m not on baby care duty it’ll be nose to the 文言文 grindstone.

    Actually, the size calculations are not so simple. If we want to compare byte count, we need to know the encoding. GB2312 and Big5, the old encodings used respectively for simplified and traditional Chinese, will use two bytes per character. UTF8 will use three bytes for Hanzi and only one for normal ascii letters (UTF8 is a variable length encoding so not all characters will require the same number of bits).

    UTF8 is the best technical solution, but there are still a lot of documents and websites using GB2312 and Big5.

    Thanks, Jean. The technicalities are way beyond me, though. But we’re still able to say that the byte counts Mr Ji gave us show that with Classical Chinese we’re getting much more information into fewer characters than with English, even if it does cause a few hard drives to cry, right?