如何在有基本事实的情况下，仅为数据帧找到正-Python问题

How to find True Postive only for Data Frame while having Ground Truth?(如何在有基本事实的情况下，仅为数据帧找到正确的正值？)

本文介绍了如何在有基本事实的情况下，仅为数据帧找到正确的正值？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先，对于冗长的描述，我深表歉意，但我希望每个人都能理解我所做的事情。

我正在研究一个检测模型，它可以预测14种不同的病理，并且我已经制作了一个对任何新的测试图像进行预测的推理文件。我和具有25k以上测试图像的数据集已经找到了他们的预测，并制作了一个这样的文件Dataframe。

在此数据框中，我有(了解我的情况的信息很少)：

image_name______00000003_000.png
label_____[[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024.0, 1024.0], [119.195767195767, 339.166137566138, 470.281481481481, 511.458201058202]], ['Cardiomegaly', 'Edema', 'Infiltration']]
Bounding_Box_____True/False
Atelectasis _____0.172639399766922
Cardiomegaly _____0.064461663365364
Consolidation _____0.436323910951614
Edema _____0.152604594826698
Effusion _____0.077432356774807
Emphysema _____0.569778263568878
Fibrosis _____0.333310723304749
Hernia _____0.219542726874351
Infiltration _____0.240452200174332
Mass _____0.291741400957108
Nodule _____0.076222963631153
Pleural_Thickening_____ 0.294208467006683
Pneumonia _____0.281939893960953
Pneumothorax _____0.386653006076813

我想要的： 我们可以通过两种方法找到它：例如，为每个单独的类获取这些行。 LIKE首先查找包含Cardiomegaly单标签或多标签的所有行。

然后应用下面的或查找方法和专业知识来查找TP。

我想要像['Cardiomegaly', 'Edema', 'Infiltration']这样具有基本事实并具有14种病理概率的图像。我想找出True Positive这些实际标签是否具有以下项的最高概率值：

Like forCardiomegaly如果它找到了最高概率-那么制作一个新的col并将其放入True。我不知道我应该怎么做多标签，在找到第一个后，我应该做第二个label，如果它的概率最高，那么我可以如何操作。在@tlentali的帮助下，我已经完成了最后一次尝试，谢谢你帮了我一把。以下是我所做的：

df = pd.read_csv('/home/ali/Desktop/CX/sample.csv')
df["best_score"] = df.drop(['file', 'set', 'label', 'bbx'], axis=1).idxmax(axis=1)
df['evaluation'] = df.apply(lambda x: x["best_score"] in x["label"], axis=1)
df.groupby('best_score')['evaluation'].mean()

这给我的感觉是：

best_score
Atelectasis           0.452465
Cardiomegaly          0.250000
Consolidation         0.123164
Edema                 0.029520
Effusion              0.555459
Emphysema             0.068618
Fibrosis              0.066116
Hernia                0.032258
Infiltration          0.400000
Mass                  0.177524
Nodule                0.604167
Pleural_Thickening    0.188482
Pneumonia             0.049133
Pneumothorax          0.108156
Name: evaluation, dtype: float64

这不是我想要的，而且它只适用于单一品牌，不适用于多个品牌。请帮帮我，很抱歉用了这么长的描述，但这只是让每个人都明白我想要什么。谢谢您

推荐答案

来自您的DataFrame：

>>> import pandas as pd

>>> df
                file    set     label                                        bbx    Atelectasis Cardiomegaly    Consolidation   Edema   Effusion    Emphysema   Fibrosis    Hernia  Infiltration    Mass    Nodule  Pleural_Thickening  Pneumonia   Pneumothorax
0   00000003_000.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.145712    0.028958    0.205006    0.055228    0.115680    0.376638    0.349124    0.357694    0.122496    0.202218    0.075018    0.118994    0.195345    0.215577
1   00000003_001.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.132639    0.046136    0.169713    0.092743    0.285383    0.614464    0.311035    0.344040    0.117032    0.447748    0.152327    0.094364    0.174125    0.316022
2   00000003_002.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.233026    0.042541    0.227911    0.047988    0.116835    0.595102    0.330304    0.367272    0.117985    0.298624    0.109354    0.133473    0.185444    0.379627
3   00000003_003.png    Test    [[[0.0, 0.0, 1024.0, 1024.0], [0.0, 0.0, 1024....   False   0.298693    0.022646    0.237977    0.035348    0.143645    0.487804    0.384509    0.379062    0.083205    0.625744    0.102377    0.207353    0.184517    0.354402
4   00000003_004.png    Test    [[[0.0, 0.0, 1024.0, 1024.0]], ['Hernia']]  False   0.522152    0.052897    0.237475    0.082139    0.200029    0.473421    0.377468    0.336104    0.106339    0.488078    0.088047    0.146686    0.200919    0.313684

首先，我们eval列label提取我们期望预测的类：

>>> df['label'] = df['label'].apply(eval)
>>> df['class'] = df.label.apply(lambda x: x[1])
>>> df
0                                              [Hernia]
1                                              [Hernia]
2                                              [Hernia]
3                                [Hernia, Infiltration]
4                                              [Hernia]
5                                              [Hernia]
6                                              [Hernia]
7                                              [Hernia]
8                                          [No Finding]
9                             [Emphysema, Pneumothorax]
10                            [Emphysema, Pneumothorax]
11                                 [Pleural_Thickening]
12    [Effusion, Emphysema, Infiltration, Pneumothorax]
13    [Emphysema, Infiltration, Pleural_Thickening, ...
14                             [Effusion, Infiltration]
15                                       [Infiltration]
Name: class, dtype: object

然后，我们explode列class如下所示逐行获取预期的类：

>>> df = df.explode('class')
>>> df = df.reset_index(drop=True)
>>> df['class']
0                 Hernia
1                 Hernia
2                 Hernia
3                 Hernia
4           Infiltration
5                 Hernia
6                 Hernia
7                 Hernia
8                 Hernia
9             No Finding
10             Emphysema
11          Pneumothorax
12             Emphysema
13          Pneumothorax
14    Pleural_Thickening
15              Effusion
16             Emphysema
17          Infiltration
18          Pneumothorax
19             Emphysema
20          Infiltration
21    Pleural_Thickening
22          Pneumothorax
23              Effusion
24          Infiltration
25          Infiltration
Name: class, dtype: object

然后，我们将数据转换为虚拟格式：

>>> classes = ['Atelectasis', 
...            'Cardiomegaly',
...            'Consolidation', 
...            'Edema', 
...            'Effusion', 
...            'Emphysema', 
...            'Fibrosis', 
...            'Hernia',
...            'Infiltration', 
...            'Mass', 
...            'Nodule', 
...            'Pleural_Thickening', 
...            'Pneumonia',
...            'Pneumothorax',
...            'No Finding']
>>> s = df['class']
>>> df_classes = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
>>> df_classes.head()
    Effusion    Emphysema   Hernia  Infiltration    No Finding  Pleural_Thickening  Pneumothorax
0   0           0           1       0               0           0                   0
1   0           0           1       0               0           0                   0
2   0           0           1       0               0           0                   0
3   0           0           1       0               0           0                   0
4   0           0           0       1               0           0                   0

由于我们目前正在处理一个玩具数据集，因此我们必须进行一些调整，以便将所有需要的类都用作虚拟对象格式：

>>> df_classes['Atelectasis'] = 0 
>>> df_classes['Cardiomegaly'] = 0 
>>> df_classes['Consolidation'] = 0 
>>> df_classes['Edema'] = 0 
>>> df_classes['Fibrosis'] = 0 
>>> df_classes['Mass'] = 0 
>>> df_classes['Nodule'] = 0 
>>> df_classes['Pneumonia'] = 0 
>>> df['No Finding'] = 0

现在，我们可以使用sklearn获取TRP并最终获得AUC：

from sklearn.metrics import roc_curve, auc


n_classes = len(classes)
y_test = df_classes[classes].to_numpy()
y_score = df[classes].to_numpy()

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

现在，我们可以查看roc_auc值，nan是因为并非所有类都在玩具数据集中进行了预测：

>>> roc_auc
 1: nan,
 2: nan,
 3: nan,
 4: 0.3125,
 5: 0.7613636363636364,
 6: nan,
 7: 0.9479166666666666,
 8: 0.6190476190476191,
 9: nan,
 10: nan,
 11: 0.30208333333333337,
 12: nan,
 13: 0.7840909090909091,
 14: 0.5,
 'micro': 0.66562764158918}

我们现在可以根据TPR和FPR为每个类绘制ROC_AUC曲线(在这里注意到classe，因为我们在玩具数据集上工作，有些类是空的)：

import matplotlib.pyplot as plt


plt.figure()
lw = 2
classe = 7
plt.plot(fpr[classe], tpr[classe], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[classe])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

这篇关于如何在有基本事实的情况下，仅为数据帧找到正确的正值？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！