Hi, I've been scanning the literature trying to find an adequate approximate k-neighbour for my outlandish data set, but I remain stymied. Perhaps someone can help?

The dataset is huge, both in cardinality and in the number of dimensions, although the former is orders of magnitude higher than the latter. The vectors are of binary values, and we're minimizing the Hamming distance. Let's assume I've already done the obvious procedures, such as the typical techniques for dimensionality reduction, random sampling a subset, etc, and I'm now dealing with what's left. Which is still gigantic.

Formally, let T be the training data set (subset of in {0,1}^d) and n the cardinality #T.

For the training phase, the algorithm mustn't exceed o(d*n) for complexity and storage (give or take some extra Log factors). For the classification phase, complexity shouldn't exceed o(ln(n)*d). I'm well aware that with such stringent requirements, the approximate solution can't be very good. I can live with that.

thanks all for your time! cheers