2016-04-11

study

高维海量数据快速检索库

falconn

这个库使用的是局部敏感哈希构建索引。效率较高。

python实现

官方的api文档https://falconn-lib.org/pdoc/falconn/其实非常详细，这里把其中关键的地方再提炼一下。
安装非常简单

pip install falconn

数据准备，需要将所有的索引特征存储到一个numpy的二维array中，其中每行代表一个数据，行数表示数据点数，列表示特征维度。注意，最好把数据转换为float32数据类型，提高算法性能，并将数据归一化减去均值，提高算法性能。具体操作如下

dataset = dataset.astype(numpy.float32)
dataset -= numpy.mean(dataset, axis=0)

构建哈希超参数，将数据维度和点数传入get_default_parameters获得合适的超参数，得到的超参数可以通过手动微调或者compute_number_of_hash_functions来对其中细节进行微调。

num_points, dim = dataset.shape
parms = falconn.get_default_parameters(num_points, dim)
falconn.compute_number_of_hash_functions(7, parms)

创建索引实例LSHIndex，并通过方法setup构建索引。

lsh_index = falconn.LSHIndex(parms)
lsh_index.setup(dataset)

查询的函数有很多，可以根据需要进行选择

find_k_nearest_neighbors()
find_near_neighbors()
find_nearest_neighbor()
get_candidates_with_duplicates()
get_unique_candidates()

C++实现

TBD

Panns

https://github.com/ryanrhymes/panns这是一个用python写的针对高维数据的approximate k-nearest neighbors算法包，目前支持欧式距离和余弦距离两种度量。对500维以上的高维特征进行了优化。虽然性能弱于Annoy，但是更加轻量，比FLANN和scikit-learn更为好用

安装

sudo pip install panns --upgrade

使用示例

from panns import *

# create an index of Euclidean distance
p = PannsIndex(dimension=100, metric='euclidean')

# generate a 1000 x 100 dataset
for i in xrange(1000):
    v = gaussian_vector(100)
    p.add_vector(v)

p.parallelize(True)        # 多核并行，默认是False
# build an index of 128 trees and save to a file
p.build(128)    # 64 is default

build之后调用save函数保存生成的索引，会生成两个文件.idx保存索引树和.idx.npy保存数据库向量矩阵，这个向量矩阵可以保存成内存文件或者mmap文件，根据如下注释选取保存类型。这两个文件要放在同一个目录下面，加载模型的时候才不会出错。

# save the index as an in-memory file if the raw dataset is small or medium size
# later panns will load the entire .npy file in to the physical memory
p.save('test.idx', mmap=False)

# save the index as mmap file if the raw dataset is huge
# usually, your OS will handle the dynamic loading
p.save('test.idx', mmap=True)

加载模型，查询索引，返回最近邻的前10个查询结果

p1 = PannsIndex(metric='euclidean')
p1.load('test.idx')
v1 = a_new_vector(100)
n = p1.query(v1, 10)

Annoy

https://github.com/spotify/annoy这是一个用C++编写的python打包的最近邻搜索算法包，应该是目前性能最好的算法包。

安装

pip install annoy

实例

tbd

参考资料

机器视觉：十亿规模的深度描述子如何有效索引

杨现的个人博客

分享计算机视觉、算法、生活累积的点滴