r - Split a single column into a multiple columns dataframe -


so i've solved initial problem using suggestion g. grothendieck, again, clean way of doing after. initial post here. reality file little more subtle .

it looks (see data section @ end of post reproducible format):

a1 100 200 txt  a2 string 300 400 txt txt txt txt txt txt a3 string string 150 250 a2 . . . 

preliminary data wrangling looks :

type <- cumsum(raw_data[[1]] %in% c("a1","a2","a3")) v <- tapply(raw_data[[1]], type, c, simplify = false) m <- t(do.call(cbind, lapply(v, ts)))  raw_data<- as.data.frame(m, stringsasfactors = false) raw_data[] <- lapply(raw_data, type.convert, as.is = true)  raw_data$occurences <- 0 

giving :

  v1     v2     v3   v4      v5   v6          v7 1 a1    100    200 txt     <na> <na>        <na> 2 a2 string    300  400 txt txt  txt txt txt txt 3 a3 string string  150     250 <na>        <na> 4 a2   <na>   <na> <na>    <na> <na>        <na> 

the issue df[3,4] should in df[3,2], , should state "2" in new column. same goes on line 2, df[2,3] should in df[2,2] , state "1" in same addiotional column. in other words, i'm chasing :

  v1  v2  v3      v4   v5          v6 occurences 1 a1 100 200    txt  <na>        <na>          0 2 a2 300 400 txt txt  txt txt txt txt          1 3 a3 150 250    <na> <na>        <na>          2 4 a2  na  na    <na> <na>        <na>          0 

the string right after something, not occur , 1 time or several occurences. did handle problem :

#count "string" occurences , readjust values in expected columns formatstring <- function(df) {   z <<- which(df[,2] %in% "string")   if (length(z) > 0){     (i in z){       df$occurences = df$occurences + 1       (j in 2:ncol(df)-1){             if (is.na(df[i,j]) | is.na(df[i,j+1])){           df[i,j] = na         } else {           df[i,j] = df[i,j+1]         }       }     }   }   z <<- which(df[,2] %in% "string")   if(length(z) > 0){formatstring(df)} } 

this function supposed process exclusively lines string found in column 2, increment last column (occurences) , shift values 1 column left go expected be. is.na thing try , stop loop when we're starting see nas. once we've processed these lines, again if string in column 2, , if call again function.

now problem function looks processing (takes 20 sec on 19k observations , 261 columns, not sure it's best in terms of processing time), expect dataframe not updated @ end of loop. z got updated seems working way should.

what missing ?

the data

data in reproductible form :

df <- structure(list(v1 = c("a1", "100", "200", "txt ", "a2", "string", "300","400", "txt txt", "txt", "txt txt txt", "a3", "string", "string", "150", "250", "a2")), .names = "v1", row.names = c(na, -14l), class = "data.frame") 

opposed op's approach, general idea of suggestion data cleaning in long form , reshape data long wide form last step.

it appears main objective have integer values aligned in columns v2 , v3 of final wide format table while keeping track of number of removed string rows between group header , appearence of first integer row within each group.

therefore, data.table (development version 1.9.7) approach below looking first row containing integer value within each group rather remove rows containing explicitely character string string. thus, approach more flexible.

in addition, assumed same group header may appear multiple times.

library(data.table)  # read data (to make reproducible example) dt <- fread("a1             100             200             txt              a2             string             300             400             txt txt             txt             txt txt txt             a3             string             string             150             250             a2             ", header = false, sep = "\n")  # identify group headers regular expression , push them down dt <- dt[v1 %like% "^a[1-3]$", grp := v1][, grp := zoo::na.locf(grp)] # count groups in case of multiple appearances of same group headers dt[v1 == grp, grp_cnt := .i][, grp_cnt := zoo::na.locf(grp_cnt)]  # remove "string" rows # add row count within each individual group dt[, id := seq_len(.n), = grp_cnt]  # find first occurrence of integer in each group regex first_int <- dt[v1 %like% "^\\d+$", .(min_id = min(id)), = grp_cnt]  # non-equi join start each group row containing first integer # (requires data.table development version 1.9.7) dt <- dt[first_int, on = c("grp_cnt", "id>=min_id")]  # compute occurences number of dropped "string" rows dt[, occurences := id - 2l]  print(dt) #    grp_cnt          v1 grp id # 1:       1         100  a1  2 # 2:       1         200  a1  3 # 3:       1         txt  a1  4 # 4:       2         300  a2  3 # 5:       2         400  a2  4 # 6:       2     txt txt  a2  5 # 7:       2         txt  a2  6 # 8:       2 txt txt txt  a2  7 # 9:       3         150  a3  4 #10:       3         250  a3  5  # prepare dcast: add column names each group # (one added have column names in line q) dt[, col := paste0("v", seq_len(.n) + 1), = grp]  # reshape long wide form z <- dcast(dt, grp_cnt + grp + occurences ~ col, value.var = "v1")[, grp_cnt := null]  # type conversion on new columns new_cols <- dt[, unique(col)] z[, (new_cols) := lapply(.sd, type.convert, as.is = true), .sdcols = new_cols]  print(z) #   grp occurences  v2  v3      v4  v5          v6 #1:  a1          0 100 200     txt  na          na #2:  a2          1 300 400 txt txt txt txt txt txt #3:  a3          2 150 250      na  na          na  str(z) #classes ‘data.table’ , 'data.frame': 3 obs. of  7 variables: # $ grp       : chr  "a1" "a2" "a3" # $ occurences: int  0 1 2 # $ v2        : int  100 300 150 # $ v3        : int  200 400 250 # $ v4        : chr  "txt" "txt txt" na # $ v5        : chr  na "txt" na # $ v6        : chr  na "txt txt txt" na # - attr(*, ".internal.selfref")=<externalptr>  

note, second appearance of a2 has been dropped because there aren't rows below second a2 (in original one-column file) contain integer value.

if production data contain other group headers a1, a2, , a3 regular expression identify group headers has amended accordingly.

column names in line op's expected result (except v1 called grp clarity). column order different shouldn't relevant.


Comments

Popular posts from this blog

php - How to display all orders for a single product showing the most recent first? Woocommerce -

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

angularjs - How restrict admin panel using in backend laravel and admin panel on angular? -