Recently I have come across a meme posted by my friend. He is an English teacher and runs his own language school. The meme suggests that the Polish language is relies heavily on letters of the alphabet located towards the end, somewhat in opposition to the English language. Since I already had both Polish and English dictionary in a digital form, as I needed them for my AI Hangman project, I figured I would do a quick check to verify this claim. Here we go!
I used an English dictionary containing around 466K entries and a Polish one containing around 4.3M(yes, four point three million) words. The mind blowing number comes from the fact that we express tenses (past, present and future) by modifying the headword itself, therefore the number of possible combinations of each word is much higher than in English.Using Python I iterated through all the words and calculated the probability of a letter occurring in a randomly selected word from each dictionary. The formula for each letter is simple:
Probability(letter) = Number of words that contain a given letter / number of all words in the dictionary.
I used Tableau Public to visualize my data as I like this software very much. The first graph shows the letters alphabetically. You can see there are some letters at the end that don’t exist in English. Also, natively letters X, Q and V don’t exist in the Polish language. Only foreign words contain them and there aren’t that many of them in Polish. Otherwise the shapes of the distributions are roughly similar.
The second graph shows what letters are most likely being used in any randomly selected word. The first 5 in Polish are I,A,O,E,N. Not really letters from the end of the alphabet. But you can see that W,Z and Y are far more prevalent in Polish than in English.
In order to have the full view of the differences between English and Polish I calculated the LIFT measure. It tells you how a number compares to baseline. For instance, the probability of using an O when writing an English word is 49% (0.4899 on the graph above), but in Polish it is 66% (0.6670). So Polish people are 36% more likely to use an O in any given word. The below graph shows the differences with English as the baseline.
It’s clear that we use more W, Y and Z. These letters are indeed at the end of the alphabet. But we also use far more J and K that are more in the middle. Also, this is a good example of baseline fallacy. Since the probability of using a Z in English is really small – only 3.5% - then the percentage increase in probability seems enormous. If you look at the second graph again you will see that W,Z Y are not at the top of most used letters, but are 7th,8th and 9th, respectively.
Much more liberal use of W,Y and Z in Polish create the impression that we tend to use only the letter from the end of the alphabet, which does not seem to be supported by numbers. Admittedly, if we took into consideration only the most frequently used words and not all dictionary entries, the result might have been different. But with the available data we can only say that the meme is an exaggeration.