ПРИМЕНЕНИЕ RANDOM FOREST (КЛАССИФИКАЦИЯ) — АЛГОРИТМ МАШИННОГО ОБУЧЕНИЯ С НУЛЯ С РЕАЛЬНЫМИ НАБОРАМИ ДАННЫХ

1. Understanding the datasets

Информация о наборе данных:

Набор данных содержит случаи из исследования, которое проводилось в период с 1958 по 1970 год в больнице Биллингс Чикагского университета о выживании пациентов, перенесших операцию по поводу рака молочной железы.

Информация об атрибутах:

X1 — Возраст пациента на момент операции (числовой) X2 — Год операции пациента (год — 1900, числовой) X3 — Количество выявленных положительных подмышечных лимфоузлов (числовой) Y — Статус выживания (признак класса) — 1 = пациент выжил 5 лет и более — 2 = пациент умер в течение 5 лет

In [ ]:

2. Importing Datasets

In [1]:

import numpy as np
import pandas as pd
df = pd.read_csv("survival.csv")
print(df)
X1  X2  X3  Y
0    30  64   1  1
1    30  62   3  1
2    30  65   0  1
3    31  59   2  1
4    31  65   4  1
5    33  58  10  1
6    33  60   0  1
7    34  59   0  2
8    34  66   9  2
9    34  58  30  1
10   34  60   1  1
11   34  61  10  1
12   34  67   7  1
13   34  60   0  1
14   35  64  13  1
15   35  63   0  1
16   36  60   1  1
17   36  69   0  1
18   37  60   0  1
19   37  63   0  1
20   37  58   0  1
21   37  59   6  1
22   37  60  15  1
23   37  63   0  1
24   38  69  21  2
25   38  59   2  1
26   38  60   0  1
27   38  60   0  1
28   38  62   3  1
29   38  64   1  1
..   ..  ..  .. ..
276  67  66   0  1
277  67  61   0  1
278  67  65   0  1
279  68  67   0  1
280  68  68   0  1
281  69  67   8  2
282  69  60   0  1
283  69  65   0  1
284  69  66   0  1
285  70  58   0  2
286  70  58   4  2
287  70  66  14  1
288  70  67   0  1
289  70  68   0  1
290  70  59   8  1
291  70  63   0  1
292  71  68   2  1
293  72  63   0  2
294  72  58   0  1
295  72  64   0  1
296  72  67   3  1
297  73  62   0  1
298  73  68   0  1
299  74  65   3  2
300  74  63   0  1
301  75  62   1  1
302  76  67   0  1
303  77  65   3  1
304  78  65   1  2
305  83  58   2  2
[306 rows x 4 columns]

In [ ]:

3. Splitting datas for training

In [2]:

X_train = df[['X1', 'X2', 'X3' ]][:306].values.reshape(306, 3)
y_train = df[['Y']][:306].values.reshape(306, 1)

In [3]:

print("Training data - Input")
print(X_train)
print("\n\nTraining data - Output")
print(y_train)
Training data - Input
[[30 64  1]
 [30 62  3]
 [30 65  0]
 [31 59  2]
 [31 65  4]
 [33 58 10]
 [33 60  0]
 [34 59  0]
 [34 66  9]
 [34 58 30]
 [34 60  1]
 [34 61 10]
 [34 67  7]
 [34 60  0]
 [35 64 13]
 [35 63  0]
 [36 60  1]
 [36 69  0]
 [37 60  0]
 [37 63  0]
 [37 58  0]
 [37 59  6]
 [37 60 15]
 [37 63  0]
 [38 69 21]
 [38 59  2]
 [38 60  0]
 [38 60  0]
 [38 62  3]
 [38 64  1]
 [38 66  0]
 [38 66 11]
 [38 60  1]
 [38 67  5]
 [39 66  0]
 [39 63  0]
 [39 67  0]
 [39 58  0]
 [39 59  2]
 [39 63  4]
 [40 58  2]
 [40 58  0]
 [40 65  0]
 [41 60 23]
 [41 64  0]
 [41 67  0]
 [41 58  0]
 [41 59  8]
 [41 59  0]
 [41 64  0]
 [41 69  8]
 [41 65  0]
 [41 65  0]
 [42 69  1]
 [42 59  0]
 [42 58  0]
 [42 60  1]
 [42 59  2]
 [42 61  4]
 [42 62 20]
 [42 65  0]
 [42 63  1]
 [43 58 52]
 [43 59  2]
 [43 64  0]
 [43 64  0]
 [43 63 14]
 [43 64  2]
 [43 64  3]
 [43 60  0]
 [43 63  2]
 [43 65  0]
 [43 66  4]
 [44 64  6]
 [44 58  9]
 [44 63 19]
 [44 61  0]
 [44 63  1]
 [44 61  0]
 [44 67 16]
 [45 65  6]
 [45 66  0]
 [45 67  1]
 [45 60  0]
 [45 67  0]
 [45 59 14]
 [45 64  0]
 [45 68  0]
 [45 67  1]
 [46 58  2]
 [46 69  3]
 [46 62  5]
 [46 65 20]
 [46 62  0]
 [46 58  3]
 [46 63  0]
 [47 63 23]
 [47 62  0]
 [47 65  0]
 [47 61  0]
 [47 63  6]
 [47 66  0]
 [47 67  0]
 [47 58  3]
 [47 60  4]
 [47 68  4]
 [47 66 12]
 [48 58 11]
 [48 58 11]
 [48 67  7]
 [48 61  8]
 [48 62  2]
 [48 64  0]
 [48 66  0]
 [49 63  0]
 [49 64 10]
 [49 61  1]
 [49 62  0]
 [49 66  0]
 [49 60  1]
 [49 62  1]
 [49 63  3]
 [49 61  0]
 [49 67  1]
 [50 63 13]
 [50 64  0]
 [50 59  0]
 [50 61  6]
 [50 61  0]
 [50 63  1]
 [50 58  1]
 [50 59  2]
 [50 61  0]
 [50 64  0]
 [50 65  4]
 [50 66  1]
 [51 59 13]
 [51 59  3]
 [51 64  7]
 [51 59  1]
 [51 65  0]
 [51 66  1]
 [52 69  3]
 [52 59  2]
 [52 62  3]
 [52 66  4]
 [52 61  0]
 [52 63  4]
 [52 69  0]
 [52 60  4]
 [52 60  5]
 [52 62  0]
 [52 62  1]
 [52 64  0]
 [52 65  0]
 [52 68  0]
 [53 58  4]
 [53 65  1]
 [53 59  3]
 [53 60  9]
 [53 63 24]
 [53 65 12]
 [53 58  1]
 [53 60  1]
 [53 60  2]
 [53 61  1]
 [53 63  0]
 [54 60 11]
 [54 65 23]
 [54 65  5]
 [54 68  7]
 [54 59  7]
 [54 60  3]
 [54 66  0]
 [54 67 46]
 [54 62  0]
 [54 69  7]
 [54 63 19]
 [54 58  1]
 [54 62  0]
 [55 63  6]
 [55 68 15]
 [55 58  1]
 [55 58  0]
 [55 58  1]
 [55 66 18]
 [55 66  0]
 [55 69  3]
 [55 69 22]
 [55 67  1]
 [56 65  9]
 [56 66  3]
 [56 60  0]
 [56 66  2]
 [56 66  1]
 [56 67  0]
 [56 60  0]
 [57 61  5]
 [57 62 14]
 [57 64  1]
 [57 64  9]
 [57 69  0]
 [57 61  0]
 [57 62  0]
 [57 63  0]
 [57 64  0]
 [57 64  0]
 [57 67  0]
 [58 59  0]
 [58 60  3]
 [58 61  1]
 [58 67  0]
 [58 58  0]
 [58 58  3]
 [58 61  2]
 [59 62 35]
 [59 60  0]
 [59 63  0]
 [59 64  1]
 [59 64  4]
 [59 64  0]
 [59 64  7]
 [59 67  3]
 [60 59 17]
 [60 65  0]
 [60 61  1]
 [60 67  2]
 [60 61 25]
 [60 64  0]
 [61 62  5]
 [61 65  0]
 [61 68  1]
 [61 59  0]
 [61 59  0]
 [61 64  0]
 [61 65  8]
 [61 68  0]
 [61 59  0]
 [62 59 13]
 [62 58  0]
 [62 65 19]
 [62 62  6]
 [62 66  0]
 [62 66  0]
 [62 58  0]
 [63 60  1]
 [63 61  0]
 [63 62  0]
 [63 63  0]
 [63 63  0]
 [63 66  0]
 [63 61  9]
 [63 61 28]
 [64 58  0]
 [64 65 22]
 [64 66  0]
 [64 61  0]
 [64 68  0]
 [65 58  0]
 [65 61  2]
 [65 62 22]
 [65 66 15]
 [65 58  0]
 [65 64  0]
 [65 67  0]
 [65 59  2]
 [65 64  0]
 [65 67  1]
 [66 58  0]
 [66 61 13]
 [66 58  0]
 [66 58  1]
 [66 68  0]
 [67 64  8]
 [67 63  1]
 [67 66  0]
 [67 66  0]
 [67 61  0]
 [67 65  0]
 [68 67  0]
 [68 68  0]
 [69 67  8]
 [69 60  0]
 [69 65  0]
 [69 66  0]
 [70 58  0]
 [70 58  4]
 [70 66 14]
 [70 67  0]
 [70 68  0]
 [70 59  8]
 [70 63  0]
 [71 68  2]
 [72 63  0]
 [72 58  0]
 [72 64  0]
 [72 67  3]
 [73 62  0]
 [73 68  0]
 [74 65  3]
 [74 63  0]
 [75 62  1]
 [76 67  0]
 [77 65  3]
 [78 65  1]
 [83 58  2]]

Training data - Output
[[1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [2]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]]

In [ ]:

4. Implementing RANDOM FOREST CLASSIFICATION

In [4]:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [ ]:

5. Fitting the datasets

In [5]:

clf.fit(X_train,y_train.ravel())

Выход[5]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [ ]:

6. Predicitng Sample Data

In [6]:

print(clf.predict([[83,58,2]]))
[2]

In [ ]:

7. Predicting the data for the trained data

In [7]:

# This will help evaluation of the result
y_pred= clf.predict(X_train)
print(y_pred)
[1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 2
 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 2 1
 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 1 1
 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2
 2 2 2 2 1 1 1 1 1 2 2 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 2 1 1
 1 1 1 2 1 1 1 1 2 2]

In [ ]:

8. Report Generation

In [8]:

from sklearn.metrics import classification_report
report = classification_report(y_train, y_pred)
print(report)
precision    recall  f1-score   support
          1       0.97      0.98      0.98       225
          2       0.95      0.91      0.93        81
avg / total       0.96      0.96      0.96       306