R: Find optimal sub-matrix -
i have large symmetric matrix , want reduce smaller matrix matrix_small rows(matrix_small) = n. mean of matrix_small should maximized. is there way achieve goal in r better algorithm have? better either faster same mean or higher mean same speed.
i feel there should smarter way search min often. i'm not aware of way set sql-like index matrix in r increase performance.
library(microbenchmark) set.seed(2016) sym_matrix <- matrix(runif(1e+06), ncol = 1000) sym_matrix[lower.tri(sym_matrix)] <- t(sym_matrix)[lower.tri(sym_matrix)] diag(sym_matrix) <- na rownames(sym_matrix) <- 1:1000 colnames(sym_matrix) <- 1:1000 findnrows <- function(sym_matrix, nrows){ # return matrix rows(matrix) = nrows. # mean(matrix) should maximized set.seed(2017) k <- nrow(sym_matrix) (i in nrows:(k-1)) { #eliminate rows minimum values min_rows <- arrayind(which.min(sym_matrix), dim(sym_matrix)) choose_row <- sample(min_rows, 1) sym_matrix <- sym_matrix[-choose_row, -choose_row] } sym_matrix } microbenchmark(findnrows(sym_matrix = sym_matrix, nrows = 10), times = 25l) mean(findnrows(sym_matrix = sym_matrix, nrows = 10), na.rm = true)
the problem 1 of finding optimal nrows
rows (and likewise columns) symmetric matrix maximizes sum of elements in selected sub-matrix. unlike so-called maximum subarray problem in 2d, has solution using kadane's algorithm, key issue here chosen rows need not contiguous. result, problem seems harder combinatorial optimization. brute force approach searches on combinations of nrows
rows (here 10) out of n
rows (here 1000) impractical. however, simple approach, different op's algorithm, random search in space of combinations randomly select nrows
rows (and likewise columns) symmetric matrix @ each trial , keep best set of nrows
rows across sequential trials:
findnrows.random <- function(sym_matrix, nrows, ntrials){ set.seed(2017) s.rows <- sample.int(nrow(sym_matrix),nrows) s <- sym_matrix[s.rows,s.rows] (i in 1:ntrials) { t.rows <- sample.int(nrow(sym_matrix),nrows) t <- sym_matrix[t.rows,t.rows] if (sum(s,na.rm=true) < sum(t,na.rm=true)) { s.rows <- t.rows s <- t } } return(s) }
this algorithm, implemented in r, fast large number of trials, , 1000
trials produces result (for particular data set , seed) surprisingly on-par op's result 500
times faster. speaks more towards sub-optimality of op's algorithm optimality of random search because 1000
samples small portion of overall search space. in adition, construction, performance in terms of mean value of selected sub-matrix guaranteed increase number of trials increase. therefore, same compute time, simple random search outperform op's algorithm.
## op results microbenchmark(findnrows(sym_matrix = sym_matrix, nrows = 10), times = 2l) ##unit: seconds ## expr min lq mean median uq max neval ## findnrows(sym_matrix = sym_matrix, nrows = 10) 11.67548 11.69193 11.70937 11.6997 11.71076 11.87105 25 mean(findnrows(sym_matrix = sym_matrix, nrows = 10), na.rm = true) ##[1] 0.6256406 ## random search microbenchmark(findnrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25l) ##unit: milliseconds ## expr min lq mean median uq max neval ## findnrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 21.81462 23.20069 27.06079 23.89368 26.25163 46.77016 25 mean(findnrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = true) ##[1] 0.6374652
now, instead of throwing away previous set of selected nrows
rows if next set has bigger sum, can seek improve performance of random search trying improve previous set of selected rows using new set of selected rows. heurestic employ rowsum
resulting sub-matrices. is, @ each trial seek replace rows in current sub-matrix rows in newly selected sub-matrix have larger rowsum
s (or equivalently larger rowmean
s). seems reasonable given values uniformly distributed in full matrix because row in selected sub-matrix higher rowmean
on average have higher value elements across full row. of course, after replace rows (if any) in current sub-matrix rows newly selected sub-matrix form new sub-matrix, still check see if new sub-matrix better (i.e., has bigger sum) current sub-matrix before replacing current best sub-matrix next trial. code follows:
findnrows.faster <- function(sym_matrix, nrows, ntrials){ set.seed(2017) s.rows <- sample.int(nrow(sym_matrix),nrows) s.means <- rowsums(sym_matrix[s.rows,s.rows],na.rm=true) (i in 1:ntrials) { t.rows <- sample.int(nrow(sym_matrix),nrows) t.means <- rowsums(sym_matrix[t.rows,t.rows],na.rm=true) st.rows <- c(s.rows,t.rows) st.means <- c(s.means,t.means) ## need make sure not have duplicates before choose best nrows dups <- duplicated(st.rows) st.rows <- st.rows[!dups] st.means <- st.means[!dups] new.rows <- st.rows[order(st.means,decreasing=true)[1:nrows]] new.means <- rowsums(sym_matrix[new.rows,new.rows],na.rm=true) if (sum(s.means) < sum(new.means)) { s.rows <- new.rows s.means <- new.means } } sym_matrix[s.rows,s.rows] }
this algorithm slower result better plain random search. note comparison in performance findnrows.random
apples apples since same number of trials used and same seed used randomly select same rows each trial. however, note expect optimal algorithm select sub-matrix mean on 0.9
, algorithm far optimal.
## improved random search microbenchmark(findnrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25l) ##unit: milliseconds ## expr min lq mean median uq max neval ## findnrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 135.0531 136.3961 137.1123 136.7667 137.0439 143.0155 25 mean(findnrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = true) ##[1] 0.7797313
Comments
Post a Comment