Klimasauskas GroupTM
Leading the Edge in Emerging Technologies

HomeApplicationsNewsFeedbackSite MapAboutArticles & PapersRecipes

Home > News > Press Release
 |  Press Release June 18, 2003

Breakthrough in Small Data Set Classifiers


June 18, 2003, Sewickley, PA - Klimasauskas Group announced today a breakthrough in small data set classifiers. There are thousands of situations where data is expensive to collect, yet where the ability to model this data could have tremendous value. Examples range from pharmacological research, to classifying oceanic strata, to predicting behavior. In many of these cases, not only is the data expensive to collect, but there are potentially dozens or even hundreds of possible inputs. The challenge for both statistics and neural technologies has been developing meaningful models that predict effectively in new situations. Klimasauskas Group has developed an approach that is statistically sound, and produces highly effective models.

The basic insights that led to this development were:

  • Variable selection is one of the most important factors in the performance of linear models. This was reaffirmed in this research in which it was possible to exceed the performance of the best neural network models using all of the input data with models that used much smaller subsets of data.
  • Limiting the number of input variables in a model to 1/3 or fewer variables than cases ensures some degree of confidence in the predictive capability of a model.
  • Evaluating each set of variables using n-fold cross-validation (leave-one-out) effectively multiplies the size of the data set for testing purposes.
  • Creating voting ensembles improves stability and reliability of outputs. In lieu of using different subsets of data (boosting or bagging), high-performing solutions with minimal overlap in terms of input variables are used. This is much like taking a poll of experts, each making a prediction from a very different perspective. The voting combination actually induces a non-linear decision boundary. The composite models make good use of a far larger portion of the inputs than otherwise statistically reasonable.

This approach to solving was developed under a project funded by the Defense Advanced Research Projects Agency (DARPA). In the DARPA research, the typical problem had 25-100 cases and 40-90 input variables. A genetic algorithm is used to find synergistic subsets of variables that are highly predictive of the target outcomes. To address the issue of small data set sizes, solutions are ranked based on a full n-fold cross-validation of a linear, sigmoid, or softmax single layer neural network. Special genetic operators were developed that act not only on individuals, but all possible nearby (in a Hamming sense) solutions, as well as trans-generational operators that are capable to performing hybrid cross-over like operations on ensembles of individuals across multiple generations. In an approach similar to bootstrapping or bagging, multiple networks are collected together into ensembles to produce a final prediction. However, unlike bootstrapping or bagging, solutions are selected based on minimally overlapping (maximally spanning) subsets of variables.

"From our research, this approach represents a real break-through in building effective classifiers using small data sets" according to Casey Klimasauskas, President. "We are looking forward to working with several companies in applying this to their data-modeling problems in the future."

Contact:

Casey Klimasauskas
Klimasauskas Group
164 Saint Ives Way
Zelienople, PA 16063
Telephone: 412-916-9476
Email:
 

Send mail to with questions or comments about this web site.
All rights reserved. Updated: 02/25/2007 .