在Python中使用UMap进行降维的注意点

Python的UMap库(umap-learn)提供了一种特征降维的新方法,但是将其与SkLearn(scikit-learn)等机器学习库结合使用时,可能出现数据类型问题,例如下面的代码:

import numpy as np
import sklearn
import umap
 
# Generate features
arrFeatures = GenerateFeatures()
 
# Run dimensionality reduction with UMap
umpDimReductionFitter = umap.umap_.UMAP(n_components=nComponentCount)
umpDimReductionFitter.fit(arrFeatures)
arrFeaturesTransformed = umpDimReductionFitter.transform(arrFeatures)
 
# Other preprocessing
arrFeaturesTransformed = np.array(PreprocessData(arrFeaturesTransformed))
 
# Clustering with K-Means
# Fitting
clsCluster = sklearn.cluster.KMeans(n_clusters=nClusterCount)
clsCluster.fit(arrFeaturesTransformed)
# Clustering
arrFeaturesForPredicting = GenerateFeaturesForPredicting()
arrFeaturesForPredictingTransformed = umpDimReductionFitter.transform(arrFeaturesForPredicting)
arrFeaturesForPredictingTransformed = np.array(PreprocessData(arrFeaturesForPredictingTransformed))
arrResults = clsCluster.predict(arrFeaturesForPredictingTransformed)

此时,可能在执行clsCluster.predict时出现下面的错误:

ValueError: Buffer dtype mismatch, expected 'const float' but got 'double'

这表明先前执行的数据存在数据类型问题,将umap.umap_.UMAP替换为sklearn.decomposition.PCA则可以正常工作。因此在调用np.array()后使用下面的代码检查arrFeaturesTransformed的数据类型:

print(arrFeaturesTransformed.dtype.name)

发现输出为float64,故修改两处np.array()调用为:

arrFeaturesTransformed = np.array(PreprocessData(arrFeaturesTransformed), dtype=np.float64)
# ...
arrFeaturesForPredictingTransformed = np.array(PreprocessData(arrFeaturesForPredictingTransformed), dtype=np.float64)

此时脚本运行正常。

参考资料:

it
除非特别注明,本页内容采用以下授权方式: Creative Commons Attribution-ShareAlike 3.0 License