jeudi 16 avril 2015

Efficiently counting non-NA elements in data.table


Vote count:

0




Sometimes I need to count the number of non-NA elements in one or another column in my data.table. What is the best data.table-tailored way to do so?


For concreteness, let's work with this:



dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key="id")


The first thing that comes to my mind works like this:



dt[!is.na(var),N:=.N,by=id]


But this has the unfortunate shortcoming that N does not get assigned to any row where var is missing, i.e. dt[is.na(var),N]=NA.


So I work around this by appending:



dt[!is.na(var),N:=.N,by=id][,N:=max(N,na.rm=T),by=id]


However, I'm not sure this is the best approach; another option I thought of and one suggested by the analog to this question for data.frames would be:



dt[,N:=length(var[!is.na(var)]),by=id]


and



dt[,N:=sum(!is.na(var)),by=id]


Comparing computation time of these (average over 100 trials), the last seems to be the fastest:



time 1 | time 2 | time 3
.075 | .065 | .043


Does anyone know a speedier way for data.table?



asked 35 secs ago







Efficiently counting non-NA elements in data.table

Aucun commentaire:

Enregistrer un commentaire