r - Split a single column into a multiple columns dataframe -
so i've solved initial problem using suggestion g. grothendieck, again, clean way of doing after. initial post here. reality file little more subtle .
it looks (see data section @ end of post reproducible format):
a1 100 200 txt a2 string 300 400 txt txt txt txt txt txt a3 string string 150 250 a2 . . .
preliminary data wrangling looks :
type <- cumsum(raw_data[[1]] %in% c("a1","a2","a3")) v <- tapply(raw_data[[1]], type, c, simplify = false) m <- t(do.call(cbind, lapply(v, ts))) raw_data<- as.data.frame(m, stringsasfactors = false) raw_data[] <- lapply(raw_data, type.convert, as.is = true) raw_data$occurences <- 0
giving :
v1 v2 v3 v4 v5 v6 v7 1 a1 100 200 txt <na> <na> <na> 2 a2 string 300 400 txt txt txt txt txt txt 3 a3 string string 150 250 <na> <na> 4 a2 <na> <na> <na> <na> <na> <na>
the issue df[3,4] should in df[3,2], , should state "2" in new column. same goes on line 2, df[2,3] should in df[2,2] , state "1" in same addiotional column. in other words, i'm chasing :
v1 v2 v3 v4 v5 v6 occurences 1 a1 100 200 txt <na> <na> 0 2 a2 300 400 txt txt txt txt txt txt 1 3 a3 150 250 <na> <na> <na> 2 4 a2 na na <na> <na> <na> 0
the string right after something, not occur , 1 time or several occurences. did handle problem :
#count "string" occurences , readjust values in expected columns formatstring <- function(df) { z <<- which(df[,2] %in% "string") if (length(z) > 0){ (i in z){ df$occurences = df$occurences + 1 (j in 2:ncol(df)-1){ if (is.na(df[i,j]) | is.na(df[i,j+1])){ df[i,j] = na } else { df[i,j] = df[i,j+1] } } } } z <<- which(df[,2] %in% "string") if(length(z) > 0){formatstring(df)} }
this function supposed process exclusively lines string found in column 2, increment last column (occurences) , shift values 1 column left go expected be. is.na thing try , stop loop when we're starting see nas. once we've processed these lines, again if string in column 2, , if call again function.
now problem function looks processing (takes 20 sec on 19k observations , 261 columns, not sure it's best in terms of processing time), expect dataframe not updated @ end of loop. z got updated seems working way should.
what missing ?
the data
data in reproductible form :
df <- structure(list(v1 = c("a1", "100", "200", "txt ", "a2", "string", "300","400", "txt txt", "txt", "txt txt txt", "a3", "string", "string", "150", "250", "a2")), .names = "v1", row.names = c(na, -14l), class = "data.frame")
opposed op's approach, general idea of suggestion data cleaning in long form , reshape data long wide form last step.
it appears main objective have integer values aligned in columns v2
, v3
of final wide format table while keeping track of number of removed string
rows between group header , appearence of first integer row within each group.
therefore, data.table
(development version 1.9.7) approach below looking first row containing integer value within each group rather remove rows containing explicitely character string string
. thus, approach more flexible.
in addition, assumed same group header may appear multiple times.
library(data.table) # read data (to make reproducible example) dt <- fread("a1 100 200 txt a2 string 300 400 txt txt txt txt txt txt a3 string string 150 250 a2 ", header = false, sep = "\n") # identify group headers regular expression , push them down dt <- dt[v1 %like% "^a[1-3]$", grp := v1][, grp := zoo::na.locf(grp)] # count groups in case of multiple appearances of same group headers dt[v1 == grp, grp_cnt := .i][, grp_cnt := zoo::na.locf(grp_cnt)] # remove "string" rows # add row count within each individual group dt[, id := seq_len(.n), = grp_cnt] # find first occurrence of integer in each group regex first_int <- dt[v1 %like% "^\\d+$", .(min_id = min(id)), = grp_cnt] # non-equi join start each group row containing first integer # (requires data.table development version 1.9.7) dt <- dt[first_int, on = c("grp_cnt", "id>=min_id")] # compute occurences number of dropped "string" rows dt[, occurences := id - 2l] print(dt) # grp_cnt v1 grp id # 1: 1 100 a1 2 # 2: 1 200 a1 3 # 3: 1 txt a1 4 # 4: 2 300 a2 3 # 5: 2 400 a2 4 # 6: 2 txt txt a2 5 # 7: 2 txt a2 6 # 8: 2 txt txt txt a2 7 # 9: 3 150 a3 4 #10: 3 250 a3 5 # prepare dcast: add column names each group # (one added have column names in line q) dt[, col := paste0("v", seq_len(.n) + 1), = grp] # reshape long wide form z <- dcast(dt, grp_cnt + grp + occurences ~ col, value.var = "v1")[, grp_cnt := null] # type conversion on new columns new_cols <- dt[, unique(col)] z[, (new_cols) := lapply(.sd, type.convert, as.is = true), .sdcols = new_cols] print(z) # grp occurences v2 v3 v4 v5 v6 #1: a1 0 100 200 txt na na #2: a2 1 300 400 txt txt txt txt txt txt #3: a3 2 150 250 na na na str(z) #classes ‘data.table’ , 'data.frame': 3 obs. of 7 variables: # $ grp : chr "a1" "a2" "a3" # $ occurences: int 0 1 2 # $ v2 : int 100 300 150 # $ v3 : int 200 400 250 # $ v4 : chr "txt" "txt txt" na # $ v5 : chr na "txt" na # $ v6 : chr na "txt txt txt" na # - attr(*, ".internal.selfref")=<externalptr>
note, second appearance of a2
has been dropped because there aren't rows below second a2
(in original one-column file) contain integer value.
if production data contain other group headers a1
, a2
, , a3
regular expression identify group headers has amended accordingly.
column names in line op's expected result (except v1
called grp
clarity). column order different shouldn't relevant.
Comments
Post a Comment