If you have been learning Chinese you no doubt have wondered, “how many characters do I know?” Perhaps you would like to see if you’ve made a dent in the roughly 4000 characters it usually takes to be considered literate. Or perhaps you want to see some measure of all that effort you’ve put into studying. Or maybe you want to impress your mom. Regardless of your reason, you would rather not run through the roughly 10,000 characters in modern use to count how many you know.
Fear not! Statistics to the rescue
Fortunately it only takes a relatively small sample of characters to accurately estimate how many characters you know overall. So we at WordSwing put together a little gadget to help you out. This gadget will sample characters and ask you a simple yes or no question: do you know the character? The more answers you give the more accurate the estimate.
But isn’t this question fraught with peril? you might ask. What does it mean to know a character? Well, luckily it can mean whatever you want it to mean. If you want it to mean the character is vaguely familiar, then the resulting estimate will be how many characters are vaguely familiar to you. If you only answer yes if you’ve mastered the character, then the estimate will be for how many characters you’ve mastered. If you want to try it several ways, just clear the estimate and start over.
Curious how it works?
As in any estimate, there are some assumptions that go into it. We assume that the characters you know are a sampling from the overall frequency distribution of Chinese characters (based on a large corpus of modern Chinese) and that you know each character with a probability proportional to the frequency of the character. Thus each character represents a binomial draw and our goal is simply to estimate the constant of proportionality that scales the integral of the frequency distribution to the estimated number of characters you know, which we do by a variant of logistic regression. This model is about as simplistic as one can make it, and undoubtedly, the modeling assumptions are not exactly correct, but hopefully the Law of large numbers helps us out and we believe this represents a fairly accurate estimate of how many characters you know.