Vote count:
0
I'm working with a large number (1,983) of CSV files. Posts on stackoverflow have said that lists are easier to work with so I've approached my task that way. I have read the CSVs in and gotten the first part of my task accomplished: what is the maximum number of concurrent users of the application? (A:203) Here's that code:
# get a list of the files
files <- list.files("my_path_here",pattern="*.CSV$", recursive = TRUE, full.names=TRUE)
#read in the csv's and store them as a list of dataframes
tables <- lapply(files, read.csv)
#store the counts of the number of users here
counts<-rep(NA,length(tables))
#loop thru the files to find the count and store that value
for (i in 1:length(files)) {
counts[i] <- length(tables[[i]][[2]])
}
#what's the largest number?
max(counts)
#203
The 2nd part of the task is to show the count of each title for each file. The contents of each file will be something like this: compute_0001 compute_0002 [1] 3/26/2015 6:00:00 Business System Manager;Lead CoPath Analyst
[2] Regional Histotechnologist;Hist Tech - Ht
[3] Regional Histotechnologist;Tissue Tech
[4] SDX Histotechnologist;Histology Tech
[5] SDX Histotechnologist;Histology Tech
[6] Regional Histotechnologist;Lab Asst II Histology
[7] CytoPrep Tech;Histo Tech - Ht
[8] Regional Histotechnologist;Tissue Tech
[9] Histology Supervisor;Supv Reg Lab Unit
[10] Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT
What will differ from file to file is the time stamp in compute_0001, name of the file and the number of users (ie length of the file). My approach was to try this:
>col2 <- sapply(tables,summary, maxsum=300) # gives me a list of 1983 elements that is 23.6Mb
(I noticed that when doing a summary() on the files I would get something like this - which is why I was trying it)
>col2[[1]]
compute_0001 compute_0002
1] Business System Manager;Lead CoPath Analyst :1
[2] Regional Histotechnologist;Hist Tech - Ht :1
[3] Regional Histotechnologist;Tissue Tech :1
[4] SDX Histotechnologist;Histology Tech :1
[5] SDX Histotechnologist;Histology Tech :1
[6] Regional Histotechnologist;Lab Asst II Histology :2
[7] CytoPrep Tech;Histo Tech - Ht :4
[8] Regional Histotechnologist;Tissue Tech :1
[9 Histotech/FC Tech/PA/Diener;Pathology Tissue Technician;;CONTRACT :1
The above is actually many different people. For my purposes, [2],[3], [6] and [8] are the same title (even though the stuff after the ";" is different. The truth is that even [4] and [5] could also be considered the same as [2,3,6,8]).
That ":1" (or generally ":#") is the number of users with that title at that particular time. I was hoping to grab that character, make it numeric and add them up to get a count of the users with each title for each file. Each file is an observation at a particular datetime.
I tried something like this:
>for (s in 1:length(col2)) {
>split <- strsplit(col2[[s]][,2], ":")
>#... make it numeric so I can do addition with it
>num <- as.numeric(split[[s]][2])
>#... and put it in the correct df
>tables[[s]]$count <- num
# After dealing with the ":" I was going to handle splitting on the first ";"
>}
But I couldn't get the loop to iterate more than a single time or past the first element of col2.
A more experienced useR suggested something like this:
>strsplit(x = as.character(compute2[[s]]),split=";",fixed=TRUE)
He said "However this results in a messy list also, since there are multiple ";" in some lines. What I would #suggest is to use grep() with a regex that returns the text before the first ";"- use that with sapply(compute2,grep()) and then you can run sapply(??,table) on the list that is returned to tally the job titles."
I'd prefer not to get into regex but, following his advice, I tried:
>for (s in 1:length(tables)){
>+ split <- strsplit(x = >as.character(compute2[[s]]),split=";",fixed=TRUE)
>+ }
split is a list of only 122 , not nearly long enough so it's not iterating thru the loop either. So, I figured I'd skip the loop and try:
>title_split<- sapply(compute2, strsplit, x = as.character(compute2[[1]]),split=";",fixed=TRUE)
But that gave me more than 50 warnings and a matrix that had 105,000+ elements that was 20.2Mb in size.
Like I said, I'd prefer to not venture into the world of regex, since I think I should be able to split on the ":" first and then the first of the ";" and return the string that precedes the ";". I'm just not sure why the loop is failing.
What I eventually want is a table that shows the count of each title (collapsed for duplicates like [2],[3], [6] and [8] above) for each file (which represents an observation at a particular datetime). I'm pretty agnostic as to approach, so if I have to do it via regex, then so be it.
Sorry for the lengthy post but I suspect that part of my problem (besides being brand new to stackoverflow, R and not understanding regex well) is that I'm not well versed in list manipulation and I wanted you to have the context.
Many thanks for reading.
How do I parse the contents of hundreds of csvs that are in a list of dataframes and split on ";" and ";" in loops?
Aucun commentaire:
Enregistrer un commentaire