2025-03-14 04:58:59 +08:00

138 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: db1的机器学习笔记
createTime: 2025/02/24 03:13:20
permalink: /learning-notes/ml/personal/db1/
icon: /avatar/db1.jpg
---
编写者: [@db1](/friends/persons/)
## 代码示例
**监督机器学习的代码示例及步骤(以scikit-learn框架为例)**
### 1. 数据预处理
**数据集导入**
```python
df = pd.read_excel('NHANES.xlsx')
```
**数据集的划分** 以常见的82划分数据集为训练集和测试集
```python
y=df['GDM']#分类变量
X=df[['Age.Class', 'BMI.Class', 'Pregnancy.times','Parturition.times','UA','ALT','AST','GGT','PLT','Ca','NPAR']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
**是否需要 缺失值的插补**
**是否需要 平衡类别**
在这一步考虑训练集的类别是否平衡是否需要进行平衡类别。例如对不平衡的数据集进行SMOTE过采样。
```python
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print("平衡后数据集类别分布:", Counter(y_resampled))
X_train=X_resampled
y_train=y_resampled
```
**数据的归一化**
### 2. 模型选择
这里以常见的几种模型为例并以分类问题为例尝试得到其训练集和测试集的accuracyAUC并以一个表格的形式展现出来。
```python
# Model creation
List_of_models = [LogisticRegression(random_state=42),
KNeighborsClassifier(),
GaussianNB(),
DecisionTreeClassifier(random_state=42),
SVC(probability=True),
RandomForestClassifier(random_state=42)]
List_of_models_for_graph = ["LogisticRegression", "KNeighborsClassifier", "GaussianNB", "DecisionTreeClassifier","SVM", "RandomForestClassifier"]
df_results = pd.DataFrame(index = ["train accuracy", "test accuracy","train auc","test auc"])
for i in range(len(List_of_models)):
model_class = List_of_models[i]
model_class.fit(X_train, y_train)
y_scores = model_class.predict_proba(X_train)[:, 1] # 获取正类的预测概率
auc1 = roc_auc_score(y_train, y_scores)
y_scores = model_class.predict_proba(X_test)[:, 1] # 获取正类的预测概率
auc2 = roc_auc_score(y_test, y_scores)
results_classification = np.array([model_class.score(X_train,y_train), model_class.score(X_test,y_test),auc1,auc2])
df_results[List_of_models_for_graph[i]] = results_classification
df_graph = df_results.transpose()#行列交换,转置
```
### 3. 调参
常用的调参方式为网格搜索optuna。以下为对随机森林进行网格搜索调参的代码
```python
RandomForestClassifier_parameters = {'criterion' : ['gini', 'entropy'],
'n_estimators' : [1,10,20,30, 100, 200, 400],
'min_samples_split' : [7,9,10,20,30],
'min_samples_leaf' : [1,2,5,7,9,10],
'max_features' : ['sqrt', 'log2'],
}
RandomForestClassifier_GridSearchCV = GridSearchCV(estimator = RandomForestClassifier(random_state=42), param_grid = RandomForestClassifier_parameters, cv=5, n_jobs=-1, verbose=0
)
RandomForestClassifier_GridSearchCV.fit(X_train, y_train)
List_of_models[5]=RandomForestClassifier_GridSearchCV.best_estimator_#更新Model列表
print(RandomForestClassifier_GridSearchCV.best_params_)
```
### 4. 结果显示
例如以ROC作图的形式输出结果
```python
models = {
"KNeighbors Classifier": List_of_models[1],
"GaussianNB": List_of_models[2],
"DecisionTree Classifier": List_of_models[3],
"SVM": List_of_models[4],
"RF": List_of_models[5],
}
# 绘制ROC曲线
plt.figure(figsize=(10, 10))
for name, model in models.items():
y_scores = model.predict_proba(X_test)[:, 1] # 获取正类的概率
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.title('ROC Curve for Testing Set')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc='lower right')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.grid()
plt.show()
```
------
### 5. 完整代码下载
<RepoCard repo="db1-bot/mL-learning-notes"></RepoCard>