1. mltools机器学习工具包应用示例

mltools是从项目实战中提炼出来的一套机器学习工具包,其主要目标是加快数据探索、数据抽取、清洗转换、模型训练,让机器学习工程师专注数据分析和模型选择与评估。依赖的包主要有:numpy、pandas、sklearn、seaborn,统计部分有一点scipy.stats、 statsmodels.stats内容。

示例部分内容为演示如果正确使用mltools机器学习工具包。

2. 加载库和数据

[1]:
# Data manipulation
import pandas as pd
import numpy as np
from mltools import explore,feature,plot,mlcluster
from sklearn.datasets import make_blobs

# More Data Preprocessing & Machine Learning
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer,KNNImputer,IterativeImputer
from sklearn import cluster
#from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler,StandardScaler,RobustScaler
from sklearn.preprocessing import PowerTransformer,QuantileTransformer
from sklearn.mixture import GaussianMixture

2.1. 数据产生

sklearn中的示例数据没有很好反映项目分析过程中的实际情况,因此自己产生数据尽量反映实际情况。选用make_blobs产生样本数n_samples=500,特征数n_features=5的数据集,并将其转换成DataFrame格式。

[2]:
X, y = make_blobs(n_samples=500,n_features=5,random_state=1)
data=pd.DataFrame(X)
data=data.add_prefix('col_')
data
[2]:
col_0 col_1 col_2 col_3 col_4
0 -0.141771 3.654010 -5.561710 6.749957 -10.852914
1 0.153257 5.490891 -6.294500 8.674817 -9.285383
2 -0.533027 3.889833 -4.095322 6.606851 -9.233252
3 -8.643535 -5.900542 -1.883994 -2.157399 2.746535
4 -8.789110 -6.779138 -1.622573 -2.022040 0.066330
... ... ... ... ... ...
495 -0.031133 3.026263 -9.972184 -2.564078 -6.760700
496 -2.183581 2.950700 -9.002360 -4.301135 -8.456502
497 -7.936597 -6.994555 -3.890593 -2.240007 2.246613
498 -8.286480 -6.189705 -1.910791 -2.266651 -0.153064
499 -8.544259 -6.542836 -2.444902 -1.410030 1.076530

500 rows × 5 columns

2.2. 加入Nan

[3]:
#加入Nan
np.random.seed(100)
choice_list=[False,True]
p=[0.98,0.02]
mask = np.random.choice(choice_list, size=data.shape,p=p,replace=True)
data=data.mask(mask)
data[data.isnull().any(axis=1)]
[3]:
col_0 col_1 col_2 col_3 col_4
7 -1.286570 NaN -6.176012 8.124907 -10.120067
16 -7.883550 -5.089507 -2.704176 NaN 1.533754
21 -7.131833 -5.573617 NaN -1.741799 -0.128209
27 -1.381526 5.096522 NaN NaN NaN
29 -1.937638 NaN -9.959491 -2.463272 -4.467091
46 -7.750437 -5.205726 NaN -3.258757 -0.313626
56 NaN -5.085142 -3.509735 -0.551581 -0.409423
73 -6.963466 -6.963849 -3.420619 NaN 0.137765
89 -0.923590 3.055128 -8.873487 -4.539752 NaN
90 -8.934573 NaN -3.604700 -1.409286 2.194878
95 0.833859 NaN -7.228685 7.163602 -7.794042
117 -6.913303 -7.804767 NaN -2.670003 1.175212
127 -7.618104 -5.807176 -1.887277 NaN -0.398539
132 -0.408670 3.963432 -6.710602 NaN -10.945749
138 -0.939700 5.784000 -9.912916 -4.464145 NaN
139 -0.203479 4.152881 -6.715668 NaN -8.827473
140 -0.576282 NaN -5.975531 7.387683 -10.071748
160 -8.328621 -6.772762 -2.714645 -3.689514 NaN
164 -0.584305 4.802059 -10.227054 -4.497832 NaN
166 -1.949085 NaN -5.111942 7.527439 -10.178813
175 -1.731877 NaN -5.210435 9.771156 -9.591718
188 -0.347004 4.217912 NaN 8.964273 -10.266166
192 NaN 5.831704 -6.950209 6.876678 -10.238876
219 -9.234538 -7.610726 -3.759727 NaN 1.507948
221 -0.903702 NaN -6.410009 8.215268 -11.519422
226 -9.188071 NaN -3.444390 -2.188615 NaN
242 -2.857065 1.680626 -8.593365 -4.073770 NaN
251 -9.074263 -6.421705 -2.793038 NaN -0.936332
265 NaN -7.659325 -1.195496 -1.668099 1.495399
293 -8.888966 -6.142600 -4.033103 -1.893275 NaN
316 -7.193460 -7.067357 NaN -2.906014 2.840359
346 NaN 5.371377 -10.739167 -2.634757 -7.374552
364 -6.876723 -7.914889 NaN -1.589900 2.075189
373 -3.176953 NaN -5.125969 8.706169 -10.143616
379 NaN 4.141529 -8.890430 -5.302085 -6.473979
381 -8.301266 -7.224314 -4.448187 NaN 1.395000
391 -8.128794 -8.077865 -3.902213 NaN 1.371146
395 -9.424622 -7.716436 -3.994138 -3.361358 NaN
397 -0.614359 5.416865 NaN -3.235759 -7.315587
399 -8.976534 -6.099075 NaN -2.172709 1.261013
402 -0.192786 NaN -8.981948 -3.237669 -5.332705
421 -0.885676 NaN -9.750204 -5.669639 -6.477809
443 -1.351757 NaN -8.470801 -4.297299 -8.402905
449 -7.586178 -5.888475 -1.585598 NaN 1.088758
489 -7.704753 NaN -3.396437 -1.353495 2.916061

2.3. 加入重复数据

[4]:
''' Repeat without index  '''
ds=data.sample(n=10,random_state=0)
data = pd.concat([data]+[ds]*3, ignore_index=True)
data
[4]:
col_0 col_1 col_2 col_3 col_4
0 -0.141771 3.654010 -5.561710 6.749957 -10.852914
1 0.153257 5.490891 -6.294500 8.674817 -9.285383
2 -0.533027 3.889833 -4.095322 6.606851 -9.233252
3 -8.643535 -5.900542 -1.883994 -2.157399 2.746535
4 -8.789110 -6.779138 -1.622573 -2.022040 0.066330
... ... ... ... ... ...
525 -2.003885 5.311449 -9.693497 -2.498029 -7.332512
526 -7.193460 -7.067357 NaN -2.906014 2.840359
527 -7.704753 NaN -3.396437 -1.353495 2.916061
528 -8.731049 -6.014554 -3.033054 -2.248544 -0.737050
529 -1.722663 4.089757 -11.366473 -5.031462 -7.357258

530 rows × 5 columns

3. 数据整理(探索)

[5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_0   525 non-null    float64
 1   col_1   510 non-null    float64
 2   col_2   518 non-null    float64
 3   col_3   519 non-null    float64
 4   col_4   521 non-null    float64
dtypes: float64(5)
memory usage: 20.8 KB

3.1. 空缺值分析

[6]:
explore.null_count(data)
[6]:
total_missing missing_percent
col_1 20 3.77
col_2 12 2.26
col_3 11 2.08
col_4 9 1.70
col_0 5 0.94
[7]:
len(data[data.isnull().any(axis=1)])
[7]:
54

本数据集共有530个样本,含有空值的样本为54个,占比为10.2%,可以考虑补充空缺数据。实际分析中是删除或补充或两种方法都用,要根据项目的具体情况决定。

3.2. 重复值分析

[8]:
data.duplicated().value_counts()
[8]:
False    500
True      30
dtype: int64

本数据集共有530个样本,重复数据30条。实际分析中是删除或保留,要根据项目的具体情况决定。

3.3. 唯一值分析

[9]:
explore.unique_count(data)
[9]:
nunique

唯一值是某一列中只有一个值。当前数据集没有唯一值出现。实际分析中出现唯一值的列需要删除,因为这个属性对模型选择没有任何帮助。

3.4. 负值分析

[10]:
explore.negative_count(data)
[10]:
negative_count negative_percent
col_2 518 97.74
col_0 506 95.47
col_4 387 73.02
col_3 352 66.42
col_1 176 33.21

当前数据集有负值,此处只是演示negative_count的用法。实际项目中某些列是不能出现负值的,负值意味着异常需要删除。

3.5. 异常数据分析

异常值分析有两种方法:IQR和均值标准差方法,阈值可以任意设定。对IQR方法,通常设定method=‘IQR’, threshold=1.5;极端异常值分析threshold=3。均值标准差方法通常设定method=‘mean_std’, threshold=3。 outlier_df表示各列异常值情况。异常值的处理在特征清洗与转换进行。

[11]:
#IQR探索
outlier_count,outlier_df=explore.outlier_detect(data,method='IQR',threshold=1.5)
[12]:
outlier_count
[12]:
lower_fence upper_fence lower_outlier upper_outlier total_outlier total_percent
col_0 -16.91 7.86 0 0 0 0.0
col_1 -20.60 19.24 0 0 0 0.0
col_2 -17.83 5.08 0 0 0 0.0
col_3 -18.69 22.18 0 0 0 0.0
col_4 -22.27 13.49 0 0 0 0.0
[13]:
outlier_df
[13]:
col_0 col_1 col_2 col_3 col_4

3.6. 可视化

[14]:
plot.hist_plot(data=data,ncol=5,figsize_x=20)
../_images/examples_mltools_example_30_0.png

4. 数据清洗及特征选择

数据清洗比较简单,此处不讲。特征选择是本节的主要内容。

这一部分的特征选择主要是解决特征之间的多重共线性问题,解决方法是VIF和Corr方法。

4.1. VIF特征选择

VIF方法不能包含NaN值,因此需要处理。此处只是简单的删除,实际项目如何处理要根据具体情况再做决定。

[15]:
df_notnull=data[~data.isnull().any(axis=1)]
[16]:
VIF=feature.get_VIF(df_notnull)
VIF
[16]:
VIF Column
0 4.841115 col_3
1 10.261113 col_0
2 12.944953 col_1
3 24.882742 col_2
4 26.979943 col_4

说明col_2、col_4列出现了高度相关,删除高度相关的列。

[17]:
feature.get_VIF(df_notnull.drop(['col_2','col_4'],axis=1))
[17]:
VIF Column
0 1.095365 col_3
1 1.308480 col_0
2 1.410221 col_1

从上面VIF结果知道,col_0、col_1、col_3的VIF都小于5,选以上3列作为特征是合适的。

实际项目中特征数量众多,需要综合考虑VIF、corr分析结果,与业务专家讨论确认特征!

5. 特征转换与可视化

将特征转换作为聚类参数,详细内容在模型选择与评估部分讨论

5.1. 特征转换及其正态检验

[18]:
data=data.drop(labels=['col_2','col_4'],axis=1)
trans_method_list = ['cbrt', 'reciprocal', 'square', 'cube', 'minmax', 'zscore', 'robust', 'quant', 'yeo']
feature.transformer(data,trans_method_list=trans_method_list,test_method='ks')
[18]:
orginal cbrt reciprocal square cube minmax zscore robust quant yeo
col_0 - - - - - - - - Normal -
col_1 - - - - - - - - Normal -
col_3 - - - - - - - - Normal -

5.2. 特征转换可视化

[19]:
trans_method_list = ['cbrt', 'reciprocal', 'square', 'cube', 'minmax', 'zscore', 'robust', 'quant', 'yeo']
plot.trans_plot(data=data,trans_method_list=trans_method_list)
../_images/examples_mltools_example_47_0.png
../_images/examples_mltools_example_47_1.png
../_images/examples_mltools_example_47_2.png

6. 模型选择及评估

mlcluster.ClusterScorePlot继承了pipeline,参数定义中的cls为pipeline的steps,本例中清洗、转换、聚类作为参数设置。 mlcluster.ClusterScorePlot还可以设置清洗、转换、聚类方法的参数,对于复杂项目可以构建更复杂的管道。

6.1. 模型初评

开始时我们不知道有多少个聚类,假设选取聚类数n_clusters=2,选取GaussianMixture、KMeans聚类,看看效果。从下图可以看出n_clusters应该设为3。

[20]:
#setup steps:
n_clusters=2
cls =[('imputer', SimpleImputer(strategy='median')),
      ('scaler', StandardScaler()),
      ('cluster',cluster.AgglomerativeClustering(n_clusters=n_clusters))]
#setup param_grid
param_grid = [
    { 'imputer': [None],
     'scaler': [None],
     'cluster':[cluster.KMeans(n_clusters=n_clusters),
                GaussianMixture(n_components=n_clusters)]}
    ]
#setup scoring:
scoring = ['da', 'si', 'ca']
clustersearch=mlcluster.ClusterScorePlot(steps=cls,param_grid=param_grid,scoring=scoring)
clustersearch.plot(data,ncol=1)
../_images/examples_mltools_example_52_0.png

6.2. 模型调参

6.2.1. 参数定义

imputer、scaler、cluster三个阶段的任务以列表的方式在cls中定义,其参数在param_grid中以字典的形式定义,聚类评价在scoring中定义。

[21]:
#setup steps:
n_clusters=3
cls =[('imputer', SimpleImputer(strategy='median')),
      ('scaler', StandardScaler()),
      ('cluster',cluster.AgglomerativeClustering(n_clusters=n_clusters))]
#setup param_grid
lr = LinearRegression()
imp = IterativeImputer(estimator=lr,missing_values=np.nan, max_iter=10, verbose=2, imputation_order='roman',random_state=0)
imp1 = IterativeImputer(random_state=0)
knn = KNNImputer(n_neighbors=2, add_indicator=True)

param_grid = [
    { 'imputer': [SimpleImputer(strategy='median'),imp,imp1,knn, None],
     'scaler': [MinMaxScaler(),StandardScaler(),RobustScaler(),PowerTransformer(),None],
     'cluster':[cluster.AgglomerativeClustering(n_clusters=n_clusters),
                 cluster.DBSCAN(eps=0.30, min_samples=10),
                cluster.KMeans(n_clusters=n_clusters),
                GaussianMixture(n_components=n_clusters)]},

    { 'imputer': [SimpleImputer(strategy='median'),imp,imp1,knn, None],
     'scaler': [QuantileTransformer(n_quantiles=100,random_state=0)],
     'cluster':[cluster.AgglomerativeClustering(n_clusters=n_clusters),
                 cluster.DBSCAN(eps=0.30, min_samples=10),
                cluster.KMeans(n_clusters=n_clusters),
                GaussianMixture(n_components=n_clusters)],
    'scaler__output_distribution':['uniform', 'normal']}]
#setup scoring:
scoring = ['da', 'si', 'ca']

6.2.2. 参数网格搜索

通过参数网格搜索得到最佳参数

[22]:
clustersearch=mlcluster.ClusterScorePlot(steps=cls,param_grid=param_grid,scoring=scoring)
df_score=clustersearch.get_score(data)

df_best_score=clustersearch.get_best_score(df_score,n_best_score=5)
pd.set_option('display.max_rows',200)
pd.set_option('display.max_colwidth',200)
df_best_score.reset_index(drop=True)

[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.02
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.01
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
[22]:
param da si ca
0 {'cluster': AgglomerativeClustering(n_clusters=3), 'imputer': KNNImputer(add_indicator=True, n_neighbors=2), 'scaler': None} 0.266248 0.813408 5384.130887
1 {'cluster': AgglomerativeClustering(n_clusters=3), 'imputer': None, 'scaler': None} 0.263954 0.814534 5035.913689
2 {'cluster': KMeans(n_clusters=3), 'imputer': IterativeImputer(estimator=LinearRegression(), imputation_order='roman', random_state=0, verbose=2), 'scaler': None} 0.270683 0.809068 5133.544989
3 {'cluster': KMeans(n_clusters=3), 'imputer': IterativeImputer(random_state=0), 'scaler': None} 0.270805 0.808884 5123.920033
4 {'cluster': KMeans(n_clusters=3), 'imputer': KNNImputer(add_indicator=True, n_neighbors=2), 'scaler': None} 0.266248 0.813408 5384.130887
5 {'cluster': KMeans(n_clusters=3), 'imputer': None, 'scaler': None} 0.263954 0.814534 5035.913689
6 {'cluster': GaussianMixture(n_components=3), 'imputer': KNNImputer(add_indicator=True, n_neighbors=2), 'scaler': None} 0.266248 0.813408 5384.130887
7 {'cluster': GaussianMixture(n_components=3), 'imputer': None, 'scaler': None} 0.263954 0.814534 5035.913689

6.2.3. 根据最佳参数绘图

[23]:
#setup param_grid
param_grid=list(df_best_score['param'])

clustersearch=mlcluster.ClusterScorePlot(steps=cls,param_grid=param_grid,scoring=scoring)
clustersearch.plot(data,param_type='',ncol=1)
[IterativeImputer] Completing matrix with shape (530, 3)
[IterativeImputer] Ending imputation round 1/10, elapsed time 0.00
[IterativeImputer] Change: 7.620284332034443, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 2/10, elapsed time 0.00
[IterativeImputer] Change: 0.11316669398578494, scaled tolerance: 0.011335356993707154
[IterativeImputer] Ending imputation round 3/10, elapsed time 0.01
[IterativeImputer] Change: 0.00012243243720222452, scaled tolerance: 0.011335356993707154
[IterativeImputer] Early stopping criterion reached.
../_images/examples_mltools_example_59_1.png