Klimasauskas GroupTM
Leading the Edge in Emerging Technologies

HomeApplicationsNewsFeedbackSite MapAboutArticles & PapersRecipes

Home > Applications > Small Data Sets
 |  Reliable Models for Small Data Sets

Modeling small data sets is always a challenge. If there are only a few inputs and few examples, techniques like Probabilistic Neural Network (classification) or Generalized Regression Neural Network (regression) work well. However, what do you do when you have 25 or 50 examples and 50 or 100 inputs? Klimasauskas Group has developed an effective method for constructing statistically relevant non-linear classifiers for this kind of application. We are looking for other opportunities to license this technology, or work with others to adapt it to new situation. Through the DARPA project in which this was developed, we have staff with security clearances (currently inactive). If you have this kind of problem, contact us to see how we can help. If your application meets the following criteria, call me, , today at 412-916-9476.

  • Classification application
  • Small Data Sets - typically less than 500 examples
  • Many Input - 25-200 (or more)

Otherwise, please visit our applications section.


Reliable Classifier Based on a Small Data Set.

This technology grew out of a project for DARPA for predicting human behavior. There were typically 25-100 incidents which identify the type or absence of a specific behavior. Each incident included 40-90 features of the context. Klimasauskas group developed an approach to building high performance statistically reliable models that maximized the use of input data, despite the small numbers of training cases. This same approach is applicable to a large number of situations which are input rich, data poor, and where accurate predictions have a high value.

The basic approach is to use a Genetic Algorithm to select synergistic sets of inputs. In order to get this to work effectively, a special mutation operator was created that considers all solutions within a unit distance of the current solution. A second cross-over like operator was developed that transcends generations. Together these led to the evolution of very high performing sets of solutions. Each solution was further tested to determine the degree to which it could be further reduced without reducing performance. Performance was measured by a full n-fold cross-validation in which for each element of the data set, a model was trained on all but that item. The model was tested on the one item left out. This was repeated for every element of the training set. Statistically, this has the impact of substantively multiplying the effective size of the data set and producing highly relevant measures of performance. Robust regression techniques were used that provided optimal solutions.

Typically after 300-500,000 solutions were tested, the top-performing solutions were analyzed. Any which exhibited idiosyncratic behavior were eliminated. Of those remaining, a small subset was selected for inclusion in an ensemble model. The solutions were selected in such a way as to maximally span the set of inputs. As an example, three models were selected, containing 11, 10, and 14 inputs spanning a total 27 unique inputs. This is similar to identifying three experts, each which make their decision on substantively different information, each voting with one vote for the outcome. The interaction of the voting ensemble induces a piece-wise non-linear decision boundary in the solution space.

Deployment is done in three ways. A Java source module is produced which is a stand-alone classifier. Second, a Java run-time is available which is capable of loading and executing an ASCII file representing the model. Finally, a Visual Basic routine allows the model to be run from Excel.

Klimasauskas Group is interested in working with other companies to apply this technology to their applications.


Send mail to with questions or comments about this web site.
All rights reserved. Updated: 05/19/2007 .