Combining nearest neighbour classifiers based on small subsamples for big data analytics

Combining nearest neighbour classifiers based on small subsamples for big data analytics Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.