Decision Tree Module¶

Decision Tree module includes three core classes:

DecisionTreeClassifier,
DecisionTree
Node

These classes all together allow to train Decision Tree model on any dataset and then perform classification.

class scripts.models.decision_tree.DecisionTree(algorithm='greedy', criterion='gini', max_depth=None, max_features='auto', min_samples_split=2, random_state=None)¶

Bases: object

DecisionTree data structure which can be utilized for Classification or Regression.

This class serves as a core data structure for implementing decision tree algorithm. Its main functionality is to build the tree and then allow to traverse it in order to make predictions.

Parameters

criterion (str, optional) – How to calculate impurity of nodes.
algorithm (str, optional) – How to approach building process of the tree.
max_depth (int, optional) – Maximum depth of the tree.
max_features (int or float or str, optional) – How many features to consider during each split of the node.
min_samples_split (int, optional) – Minimum samples required to be present within the node in order to be able to further split it.
random_state (int, optional) – If using random alforithm, it is useful to specify this parameter in order to ensure reproducibility of results.

root¶

Root node of this tree.

Type: Node

num_nodes¶

Number of nodes that this tree has.

Type: int

num_leaf_nodes¶

Number of leaf nodes that this tree has.

Type: int

_best_split(X, y)¶

Find best split from given feature space.

This methods finds best split from given feature space using given criterion (e.g. gini).

Parameters

X (2d-array) – Subset of the original dataset (self.X)
y (1d-array) – Subset of the original targets (self.y)

Returns

loss (float) – Measure of how good the split of this node is.
p (int) – Index of the feature according to which the split is supposed to be done.
val (float or int) – Value according which to do the split.

_check_criterion(node)¶

Check criteria neccessary to split the given node.

Parameters: node (Node) – Leaf node which you want to evaluate.
Returns: Can you split the given node.
Return type: bool

_evaluate_leaf(node)¶

Return predicted class and probabilities of all classes.

Parameters

node (Node) – Leaf node which you want to evaluate.

Returns

predict (int) – Predicted class.
predict_proba (list) – Probabilities of all classes

_get_child_info(node, X)¶

Get info about given node’s children.

By info, it is precisely meant size of both children and relevant indices.

Parameters

node (Node) – Parent node.
X (2d array) – Subset of the original training dataset (self.X)

Returns

sizes (list) – List with size of both children.
left_indices (1d array) – Indices of samples in the left node.
right_indices (1d array) – Indices of samples in the right node.

_is_pure(node)¶

Return if given node is pure or not.

Pure means that the given node only contains samples with the same class.

Parameters: node (Node) – Leaf node which you want to evaluate.
Returns: Is given node pure or not.
Return type: bool

_run_init_setup()¶

Run neccessary proccesses for initialization of the class.

Notes

The neccessary processes are:

Assignment of relevant criterion function

_run_training_setup(X, y)¶

Run preliminary setup before the start of training.

Parameters

X (2d-array) – Training dataset
y (1d-array) – Target values.

Notes

The following things are done:

Validation of input tensors X and y - X must be 2 dimensional. If it is 1 dimensional, transformation to 2D is attempted. y must be 1 dimensional.
X and y must have identical first dimension
Parse number of training records n, size of feature space p and # of unique classes k
Map provided unique classes to internal normalized format, i.e., 0, 1, 2, …
Determination of max features variable

_split(curr)¶

Split provided node.

High level overview of algorithm (see further explanation in Notes):

Compute optimal split for the given Node.
Compute info about two child nodes which is needed to initiliaze them
Initialize child nodes and for each

IF ALLOWED split it further

ELSE Make it leaf node

Repeat the same process recursively.

Parameters: curr (Node) – Current node which should be splitted.

Notes

STEP 1. In order to find optimal split of the node, the following pseudo algorithm is used:

Define proper feature space, i.e. features to be considered for the split. This is determined based on the parameter self.max_features.
For each feature within the proper feature space:
1. Sort its values and make sure they are all unique
2. For each pair of neighboring values, compute the split
3. For this split compute the total score - weighted average of given criterion
4. Return the best split based on the total score
From the best splits for each feature, choose the overall best split and return it

STEP 2. Following steps are done:

Get boolean array which represents the split of data - True = left node, False = right node
Get indices of samples which should belong to left and right child nodes respectively

STEP 3. There are two possible scenario for the node in question:

split it further
make it leaf node

Split it further if the node:

is NOT pure (i.e. does not contain just one type of class)
its depth is not equal to maximum depth specified
Contains more samples than the specified minimum
Has more than 1 unique sample

Otherwise, turn it into a leaf node.

fit(X, y)¶

Build decision tree.

Parameters

X (2d-array) – Training dataset
y (1d-array) – Target values.

predict(X)¶

Predict classes for given dataset X.

Parameters: X (2d-array) – Dataset based on which you want to make a prediction.
Returns: Returns array with predictions.
Return type: 1d-array

predict_proba(X)¶

Return 2d array with probabilities for given classes.

Traverse the tree until you reach a leaf node. Then, return the probabilites associated with all classes.

Parameters: X (2d array) – Samples based on which to predict corresponding class probabilities.
Returns: n x k array where n is number of provided samples and k is number of classes.
Return type: 2d array

class scripts.models.decision_tree.DecisionTreeClassifier(algorithm='greedy', criterion='gini', max_depth=None, max_features='auto', min_samples_split=2, random_state=None)¶

Bases: scripts.base._base_classifier.BaseClassifier, scripts.models.decision_tree._decision_tree.DecisionTree

Class allowing to specify parameters of the model.

This class serves as a high level client facing API. It allows to specify all hyper parameters important for training process.

Parameters

criterion (str, optional) – How to calculate impurity of nodes.
algorithm (str, optional) – How to approach building process of the tree.
max_depth (int, optional) – Maximum depth of the tree.
max_features (int or float or str, optional) – How many features to consider during each split of the node.
min_samples_split (int, optional) – Minimum samples required to be present within the node in order to be able to further split it.
random_state (int, optional) – If using random alforithm, it is useful to specify this parameter in order to ensure reproducibility of results.

class scripts.models.decision_tree.Node(size=None, values=None, depth=None, _type='internal')¶

Bases: object

Core building block for a Decision Tree.

The main purpose of a Node class is to hold information about decisions of the given decision tree. See what is meant by information in the below description of attributes.

Parameters

size (int, optional) – Number of samples which are part of this node.
values (iterable, optional) – Row indices of the samples which are part of this node.
depth (int, optional) – Depth of this node in decision tree.
_type (str, optional) – What type of node this node is - root, internal, leaf.

p¶

Index of a feature within the given training dataset. Split of a node is made according to this feature.

Type: int

val¶

Split the incoming dataset according to the p-th feature and its value (val).

Type: float or int

loss¶

Measure of how good the split of this node is. For example, when using gini impurity, this would be a weighted average of gini impurities of its child nodes.

Type: float or int

split¶

Amount of samples in left and right child of this node.

Type: list

left¶

Left child of this node.

Type: Node

right¶

Right child of this node.

Type: Node

predict¶

Predicted class based on the frequency of this class within this node. (Most common)

Type: int or str

predict_proba¶

Probabilites of the classes in question.

Type: list

decision(X)¶

Decide whether the given sample(s)’ p-th feature value is smaller than given threshold.

Parameters: x (nd-array) – Samples present within this node. (Actual values)
Returns: Boolean array.
Return type: 1d-array

is_leaf()¶

Return if the given node is a leaf node or not.

Returns
Return type: bool