3D grphique: WordCount: how inefficient is McIlroy's solution?

dimanche 21 septembre 2014

WordCount: how inefficient is McIlroy's solution?

Vote count:

0

Long story short: in 1986 an interviewer asked Donald Knuth to write a program that takes a text and a number N in input, and lists the N most used words sorted by their frequencies. Knuth produced a 10-pages Pascal program, to which Douglas McIlroy replied with the following 6-lines shell script:


tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

Read the full story at http://ift.tt/rBYk3G .

Of course they had very different goals: Knuth was showing his concepts of literate programming and built everything from scratch, while McIlroy used a few common UNIX utilities to achieve the shortest source code.

My question is: how bad is that?

(Purely from a runtime-speed point of view, since I'm pretty sure we all agree that 6 lines of code is easier to understand/maintain than 10 pages, literate programming or not.)

I can understand that sort -rn | sed ${1}q may not be the most efficient way to extract the common words, but what's wrong with tr -sc A-za-z '\n' | tr A-Z a-z? It looks pretty good to me. About sort | uniq -c, is that a terribly slow way to determine the frequencies?

A few considerations:

tr should be linear time (?)

sort I'm not sure of, but I'm assuming it's not that bad

uniq should be linear time too

spawning processes should be linear time (in the number of processes)

asked 56 secs ago

izabera

112

WordCount: how inefficient is McIlroy's solution?

3D grphique

dimanche 21 septembre 2014

WordCount: how inefficient is McIlroy's solution?

Vote count:

0

Aucun commentaire:

Enregistrer un commentaire

dimanche 21 septembre 2014

WordCount: how inefficient is McIlroy's solution?

Vote count: 0

Aucun commentaire:

Enregistrer un commentaire

Vote count:

0