R: Find optimal sub-matrix -

i have large symmetric matrix , want reduce smaller matrix matrix_small rows(matrix_small) = n. mean of matrix_small should maximized. is there way achieve goal in r better algorithm have? better either faster same mean or higher mean same speed.

i feel there should smarter way search min often. i'm not aware of way set sql-like index matrix in r increase performance.

library(microbenchmark) set.seed(2016) sym_matrix <- matrix(runif(1e+06), ncol = 1000) sym_matrix[lower.tri(sym_matrix)] <- t(sym_matrix)[lower.tri(sym_matrix)] diag(sym_matrix) <- na  rownames(sym_matrix) <- 1:1000 colnames(sym_matrix) <- 1:1000  findnrows <- function(sym_matrix, nrows){ # return matrix rows(matrix) = nrows. # mean(matrix) should maximized    set.seed(2017)   k <- nrow(sym_matrix)   (i in nrows:(k-1)) { #eliminate rows minimum values     min_rows <- arrayind(which.min(sym_matrix), dim(sym_matrix))     choose_row <- sample(min_rows, 1)     sym_matrix <- sym_matrix[-choose_row, -choose_row]   }   sym_matrix }  microbenchmark(findnrows(sym_matrix = sym_matrix, nrows = 10), times = 25l)  mean(findnrows(sym_matrix = sym_matrix, nrows = 10), na.rm = true)

the problem 1 of finding optimal nrows rows (and likewise columns) symmetric matrix maximizes sum of elements in selected sub-matrix. unlike so-called maximum subarray problem in 2d, has solution using kadane's algorithm, key issue here chosen rows need not contiguous. result, problem seems harder combinatorial optimization. brute force approach searches on combinations of nrows rows (here 10) out of n rows (here 1000) impractical. however, simple approach, different op's algorithm, random search in space of combinations randomly select nrows rows (and likewise columns) symmetric matrix @ each trial , keep best set of nrows rows across sequential trials:

findnrows.random <- function(sym_matrix, nrows, ntrials){   set.seed(2017)   s.rows <- sample.int(nrow(sym_matrix),nrows)   s <- sym_matrix[s.rows,s.rows]   (i in 1:ntrials) {     t.rows <- sample.int(nrow(sym_matrix),nrows)     t <- sym_matrix[t.rows,t.rows]     if (sum(s,na.rm=true) < sum(t,na.rm=true)) {       s.rows <- t.rows       s <- t     }   }   return(s) }

this algorithm, implemented in r, fast large number of trials, , 1000 trials produces result (for particular data set , seed) surprisingly on-par op's result 500 times faster. speaks more towards sub-optimality of op's algorithm optimality of random search because 1000 samples small portion of overall search space. in adition, construction, performance in terms of mean value of selected sub-matrix guaranteed increase number of trials increase. therefore, same compute time, simple random search outperform op's algorithm.

## op results microbenchmark(findnrows(sym_matrix = sym_matrix, nrows = 10), times = 2l) ##unit: seconds ##                                           expr      min       lq     mean  median       uq      max neval ## findnrows(sym_matrix = sym_matrix, nrows = 10) 11.67548 11.69193 11.70937 11.6997 11.71076 11.87105    25 mean(findnrows(sym_matrix = sym_matrix, nrows = 10), na.rm = true) ##[1] 0.6256406  ## random search microbenchmark(findnrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25l) ##unit: milliseconds ##                                                                  expr      min       lq     mean   median       uq      max neval ## findnrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 21.81462 23.20069 27.06079 23.89368 26.25163 46.77016    25 mean(findnrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = true) ##[1] 0.6374652

now, instead of throwing away previous set of selected nrows rows if next set has bigger sum, can seek improve performance of random search trying improve previous set of selected rows using new set of selected rows. heurestic employ rowsum resulting sub-matrices. is, @ each trial seek replace rows in current sub-matrix rows in newly selected sub-matrix have larger rowsums (or equivalently larger rowmeans). seems reasonable given values uniformly distributed in full matrix because row in selected sub-matrix higher rowmean on average have higher value elements across full row. of course, after replace rows (if any) in current sub-matrix rows newly selected sub-matrix form new sub-matrix, still check see if new sub-matrix better (i.e., has bigger sum) current sub-matrix before replacing current best sub-matrix next trial. code follows:

findnrows.faster <- function(sym_matrix, nrows, ntrials){   set.seed(2017)   s.rows <- sample.int(nrow(sym_matrix),nrows)   s.means <- rowsums(sym_matrix[s.rows,s.rows],na.rm=true)   (i in 1:ntrials) {     t.rows <- sample.int(nrow(sym_matrix),nrows)     t.means <- rowsums(sym_matrix[t.rows,t.rows],na.rm=true)     st.rows <- c(s.rows,t.rows)     st.means <- c(s.means,t.means)     ## need make sure not have duplicates before choose best nrows     dups <- duplicated(st.rows)     st.rows <- st.rows[!dups]     st.means <- st.means[!dups]     new.rows <- st.rows[order(st.means,decreasing=true)[1:nrows]]     new.means <- rowsums(sym_matrix[new.rows,new.rows],na.rm=true)     if (sum(s.means) < sum(new.means)) {       s.rows <- new.rows       s.means <- new.means     }   }   sym_matrix[s.rows,s.rows] }

this algorithm slower result better plain random search. note comparison in performance findnrows.random apples apples since same number of trials used and same seed used randomly select same rows each trial. however, note expect optimal algorithm select sub-matrix mean on 0.9, algorithm far optimal.

## improved random search microbenchmark(findnrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25l) ##unit: milliseconds ##                                                                  expr      min       lq     mean   median       uq      max neval ## findnrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 135.0531 136.3961 137.1123 136.7667 137.0439 143.0155    25 mean(findnrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = true) ##[1] 0.7797313

Search This Blog

Swift

R: Find optimal sub-matrix -

Comments

Post a Comment

Popular posts from this blog

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

jsf - "PropertyNotWritableException: Illegal Syntax for Set Operation" error when setting value in bean -

laravel - Undefined property: Illuminate\Pagination\LengthAwarePaginator::$id (View: F:\project\resources\views\admin\carousels\index.blade.php) -