Ioannis Tsamardinos, Giorgos Borboudakis, Pavlos Katsogridakis, Polyvios Pratikakis, Vassilis Christophides

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm forfeature selection (FS) in Big Data settings (high dimensionality and/or samplesize). To tackle the challenges of Big Data FS PFBP partitions the data matrixboth in terms of rows (samples, training examples) as well as columns(features). By employing the concepts of $p$-values of conditional independencetests and meta-analysis techniques PFBP manages to rely only on computationslocal to a partition while minimizing communication costs. Then, it employspowerful and safe (asymptotically sound) heuristics to make early, approximatedecisions, such as Early Dropping of features from consideration in subsequentiterations, Early Stopping of consideration of features within the sameiteration, or Early Return of the winner in each iteration. PFBP providesasymptotic guarantees of optimality for data distributions faithfullyrepresentable by a causal network (Bayesian network or maximal ancestralgraph). Our empirical analysis confirms a super-linear speedup of the algorithmwith increasing sample size, linear scalability with respect to the number offeatures and processing cores, while dominating other competitive algorithms inits class.