Vote count:
0
For learning purpose, I have downloaded the 1-gram corpus and total_counts data from http://ift.tt/ScU7eA for English Version 20090715.
Now when running http://ift.tt/1j4HPCb for word finish case-insensitive from year 2007 to 2008 from the corpus of English(2009) it gives the value as 0.002169%.
But when view the 1-gram corpus downloaded, for year 2008 for word finish it states it has 3285 occurrences in 2682 pages from 678 books. For 2007, word finish has 2233 occurrences in 1926 pages from 497 books.
When view the total_counts data for 2008 year, for 1-grams it is 13598879452 occurrences in 38478867 pages from 149373 books. For 2007 it is 12552104939 occurrences in 31451897 from 112252 books.
Initially, tried to divide the occurrence of word finish (3285) with the total occurrence of 1-grams 13598879452, but the value is different from 0.002169%.
I search on internet but was unable to find a formula or steps that could help me to deduce the value as given on ngram viewer.
How exactly is the calculation done?
How to calculate Google nGram value that it gives in viewer for 1-grams from Unigram corpus
Aucun commentaire:
Enregistrer un commentaire