Monday, April 12, 2021

Zipf's law

Zipf’s (basic) law states that, across a corpus of natural language, the frequency of any word in that corpus is inversely proportional to its rank in the frequency table.

So the most frequent word, ranking first in the frequency table, sets the frequency for all the other, less frequent words. The second most frequent word is half (1/2) as common as that, the third is one-third (1/3) as common, and so on. This is readily seen in the two graphs, the first of which uses normal linear scale in its axes, and the second uses logarithmic scales, which transforms the curve into a straight line.

Zipf's (basic) law shown in word frequencies in a corpus of written or spoken language, with linear axis scales.
Zipf’s (basic) law shown in word frequencies in a corpus of written or spoken language, with linear axis scales.
The same values, this time with both axes being logarithmic in scale, transform the curve into a straight line.
The same values, this time with both axes being logarithmic in scale, transform the curve into a straight line


"Zipf's law, the rank vs. frequency rule, also works if you apply it to the sizes of cities. The city with the largest population in any country is generally twice as large as the next-biggest, and so on. Incredibly, Zipf's law for cities has held true for every country in the world, for the past century."


Zipf's law is close even for the largest US states (in square miles).

Alaska 570,641

Texas 261,914 2.18

California 155,973 3.66

Montana 145,556 3.92

New Mexico 121,365 4.70

Arizona 113,642 5.02


No comments:

Post a Comment