dimanche 21 septembre 2014

WordCount: how inefficient is McIlroy's solution?


Vote count:

0




Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:



tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q


Read the full story at http://ift.tt/rBYk3G .


Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.


My question is: how bad is that?

(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)


I can understand that sort -rn | sed ${1}q may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z? It looks pretty good to me. About sort | uniq -c, is that a terribly slow way to determine the frequencies?


A few considerations:



  • tr should be linear time (?)

  • sort I'm not sure of, but I'm assuming it's not that bad

  • uniq should be linear time too

  • spawning processes should be linear time (in the number of processes)



asked 56 secs ago







WordCount: how inefficient is McIlroy's solution?

Aucun commentaire:

Enregistrer un commentaire