Scikit-Learn
Scikit-learn 是开源的 Python 机器学习库,提供了数据预处理、交叉验证、算法与可视化算法等一系列接口。
Basic Example: 基本用例
1 | from sklearn import neighbors, datasets, preprocessing |
数据加载与切分
我们一般使用 NumPy 中的数组或者 Pandas 中的 DataFrame 等数据结构来存放数据:
1 | import numpy as np |
NumPy 还提供了方便的接口帮我们划分训练数据与测试数据:
1 | from sklearn.cross_validation import train_test_split |
Model: 模型
模型创建
监督学习
Linear Regression1
2from sklearn.linear_model import LinearRegression
True) lr = LinearRegression(normalize=
Support Vector Machines1
2from sklearn.svm import SVC
'linear') svc = SVC(kernel=
Naive Bayes1
2from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
KNN1
2from sklearn import neighbors
5) knn = neighbors.KNeighborsClassifier(n_neighbors=
无监督学习
Principal Component Analysis1
2from sklearn.decomposition import PCA
0.95) pca = PCA(n_components=
KMeans1
2from sklearn.cluster import KMeans
3, random_state=0) k_means = KMeans(n_clusters=
模型拟合
有监督学习
1 | lr.fit(X, y) |
无监督学习1
2 k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)
模型预测
有监督预测
1 | 2,5))) y_pred = svc.predict(np.random.random(( |
无监督预测1
y_pred = k_means.predict(X_test)
模型评估
分类度量
Accuracy Scope1
2
3 knn.score(X_test, y_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
Classification Report1
2from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Confusion Matrix1
2from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
回归度量
Mean Absolute Error1
2
3from sklearn.metrics import mean_absolute_error
3, -0.5, 2] y_true = [
mean_absolute_error(y_true, y_pred)
Mean Squared Error1
2from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)
R2 Score1
2from sklearn.metrics import r2_score
r2_score(y_true, y_pred)
聚类度量
Adjusted Rand Index1
2from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)
Homogeneity1
2from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)
V-measure1
2from sklearn.metrics import v_measure_score
metrics.v_measure_score(y_true, y_pred)
交叉验证
1 | from sklearn.cross_validation import cross_val_score |
数据预处理
标准化
1 | from sklearn.preprocessing import StandardScaler |
归一化
1 | from sklearn.preprocessing import Normalizer |
二值化
1 | from sklearn.preprocessing import Binarizer |
类条件编码
1 | from sklearn.preprocessing import LabelEncoder |
缺失值推导
1 | from sklearn.preprocessing import Imputer |
多项式属性生成
1 | from sklearn.preprocessing import PolynomialFeatures |
模型调优
Grid Search1
2
3
4
5
6
7from sklearn.grid_search import GridSearchCV
"n_neighbors": np.arange(1,3), "metric": ["euclidean", "cityblock"]} params = {
grid = GridSearchCV(estimator=knn,
param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)
Randomized Parameter Optimization1
2
3
4
5
6
7
8
9from sklearn.grid_search import RandomizedSearchCV
"n_neighbors": range(1,5), "weights": ["uniform", "distance"]} params = {
rsearch = RandomizedSearchCV(estimator=knn,
param_distributions=params,
cv=4,
n_iter=8,
random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)
贝叶斯分类器
1 | # 导入基础库 |