Decision Tree Module

Decision Tree module includes three core classes:

These classes all together allow to train Decision Tree model on any dataset and then perform classification.

class scripts.models.decision_tree.DecisionTree(algorithm='greedy', criterion='gini', max_depth=None, max_features='auto', min_samples_split=2, random_state=None)

Bases: object

DecisionTree data structure which can be utilized for Classification or Regression.

This class serves as a core data structure for implementing decision tree algorithm. Its main functionality is to build the tree and then allow to traverse it in order to make predictions.

Parameters
  • criterion (str, optional) – How to calculate impurity of nodes.

  • algorithm (str, optional) – How to approach building process of the tree.

  • max_depth (int, optional) – Maximum depth of the tree.

  • max_features (int or float or str, optional) – How many features to consider during each split of the node.

  • min_samples_split (int, optional) – Minimum samples required to be present within the node in order to be able to further split it.

  • random_state (int, optional) – If using random alforithm, it is useful to specify this parameter in order to ensure reproducibility of results.

root

Root node of this tree.

Type

Node

num_nodes

Number of nodes that this tree has.

Type

int

num_leaf_nodes

Number of leaf nodes that this tree has.

Type

int

_best_split(X, y)

Find best split from given feature space.

This methods finds best split from given feature space using given criterion (e.g. gini).

Parameters
  • X (2d-array) – Subset of the original dataset (self.X)

  • y (1d-array) – Subset of the original targets (self.y)

Returns

  • loss (float) – Measure of how good the split of this node is.

  • p (int) – Index of the feature according to which the split is supposed to be done.

  • val (float or int) – Value according which to do the split.

_check_criterion(node)

Check criteria neccessary to split the given node.

Parameters

node (Node) – Leaf node which you want to evaluate.

Returns

Can you split the given node.

Return type

bool

_evaluate_leaf(node)

Return predicted class and probabilities of all classes.

Parameters

node (Node) – Leaf node which you want to evaluate.

Returns

  • predict (int) – Predicted class.

  • predict_proba (list) – Probabilities of all classes

_get_child_info(node, X)

Get info about given node’s children.

By info, it is precisely meant size of both children and relevant indices.

Parameters
  • node (Node) – Parent node.

  • X (2d array) – Subset of the original training dataset (self.X)

Returns

  • sizes (list) – List with size of both children.

  • left_indices (1d array) – Indices of samples in the left node.

  • right_indices (1d array) – Indices of samples in the right node.

_is_pure(node)

Return if given node is pure or not.

Pure means that the given node only contains samples with the same class.

Parameters

node (Node) – Leaf node which you want to evaluate.

Returns

Is given node pure or not.

Return type

bool

_run_init_setup()

Run neccessary proccesses for initialization of the class.

Notes

The neccessary processes are:

  • Assignment of relevant criterion function

_run_training_setup(X, y)

Run preliminary setup before the start of training.

Parameters
  • X (2d-array) – Training dataset

  • y (1d-array) – Target values.

Notes

The following things are done:

  • Validation of input tensors X and y - X must be 2 dimensional. If it is 1 dimensional, transformation to 2D is attempted. y must be 1 dimensional.

  • X and y must have identical first dimension

  • Parse number of training records n, size of feature space p and # of unique classes k

  • Map provided unique classes to internal normalized format, i.e., 0, 1, 2, …

  • Determination of max features variable

_split(curr)

Split provided node.

High level overview of algorithm (see further explanation in Notes):

  1. Compute optimal split for the given Node.

  2. Compute info about two child nodes which is needed to initiliaze them

  3. Initialize child nodes and for each

  1. IF ALLOWED split it further

  2. ELSE Make it leaf node

Repeat the same process recursively.

Parameters

curr (Node) – Current node which should be splitted.

Notes

STEP 1. In order to find optimal split of the node, the following pseudo algorithm is used:

  1. Define proper feature space, i.e. features to be considered for the split. This is determined based on the parameter self.max_features.

  2. For each feature within the proper feature space:

    1. Sort its values and make sure they are all unique

    2. For each pair of neighboring values, compute the split

    3. For this split compute the total score - weighted average of given criterion

    4. Return the best split based on the total score

  3. From the best splits for each feature, choose the overall best split and return it

STEP 2. Following steps are done:

  • Get boolean array which represents the split of data - True = left node, False = right node

  • Get indices of samples which should belong to left and right child nodes respectively

STEP 3. There are two possible scenario for the node in question:

  1. split it further

  2. make it leaf node

Split it further if the node:

  • is NOT pure (i.e. does not contain just one type of class)

  • its depth is not equal to maximum depth specified

  • Contains more samples than the specified minimum

  • Has more than 1 unique sample

Otherwise, turn it into a leaf node.

fit(X, y)

Build decision tree.

Parameters
  • X (2d-array) – Training dataset

  • y (1d-array) – Target values.

predict(X)

Predict classes for given dataset X.

Parameters

X (2d-array) – Dataset based on which you want to make a prediction.

Returns

Returns array with predictions.

Return type

1d-array

predict_proba(X)

Return 2d array with probabilities for given classes.

Traverse the tree until you reach a leaf node. Then, return the probabilites associated with all classes.

Parameters

X (2d array) – Samples based on which to predict corresponding class probabilities.

Returns

n x k array where n is number of provided samples and k is number of classes.

Return type

2d array

class scripts.models.decision_tree.DecisionTreeClassifier(algorithm='greedy', criterion='gini', max_depth=None, max_features='auto', min_samples_split=2, random_state=None)

Bases: scripts.base._base_classifier.BaseClassifier, scripts.models.decision_tree._decision_tree.DecisionTree

Class allowing to specify parameters of the model.

This class serves as a high level client facing API. It allows to specify all hyper parameters important for training process.

Parameters
  • criterion (str, optional) – How to calculate impurity of nodes.

  • algorithm (str, optional) – How to approach building process of the tree.

  • max_depth (int, optional) – Maximum depth of the tree.

  • max_features (int or float or str, optional) – How many features to consider during each split of the node.

  • min_samples_split (int, optional) – Minimum samples required to be present within the node in order to be able to further split it.

  • random_state (int, optional) – If using random alforithm, it is useful to specify this parameter in order to ensure reproducibility of results.

class scripts.models.decision_tree.Node(size=None, values=None, depth=None, _type='internal')

Bases: object

Core building block for a Decision Tree.

The main purpose of a Node class is to hold information about decisions of the given decision tree. See what is meant by information in the below description of attributes.

Parameters
  • size (int, optional) – Number of samples which are part of this node.

  • values (iterable, optional) – Row indices of the samples which are part of this node.

  • depth (int, optional) – Depth of this node in decision tree.

  • _type (str, optional) – What type of node this node is - root, internal, leaf.

p

Index of a feature within the given training dataset. Split of a node is made according to this feature.

Type

int

val

Split the incoming dataset according to the p-th feature and its value (val).

Type

float or int

loss

Measure of how good the split of this node is. For example, when using gini impurity, this would be a weighted average of gini impurities of its child nodes.

Type

float or int

split

Amount of samples in left and right child of this node.

Type

list

left

Left child of this node.

Type

Node

right

Right child of this node.

Type

Node

predict

Predicted class based on the frequency of this class within this node. (Most common)

Type

int or str

predict_proba

Probabilites of the classes in question.

Type

list

decision(X)

Decide whether the given sample(s)’ p-th feature value is smaller than given threshold.

Parameters

x (nd-array) – Samples present within this node. (Actual values)

Returns

Boolean array.

Return type

1d-array

is_leaf()

Return if the given node is a leaf node or not.

Returns

Return type

bool