Decision Tree Module¶
Decision Tree module includes three core classes:
These classes all together allow to train Decision Tree model on any dataset and then perform classification.
- class scripts.models.decision_tree.DecisionTree(algorithm='greedy', criterion='gini', max_depth=None, max_features='auto', min_samples_split=2, random_state=None)¶
Bases:
objectDecisionTree data structure which can be utilized for Classification or Regression.
This class serves as a core data structure for implementing decision tree algorithm. Its main functionality is to build the tree and then allow to traverse it in order to make predictions.
- Parameters
criterion (str, optional) – How to calculate impurity of nodes.
algorithm (str, optional) – How to approach building process of the tree.
max_depth (int, optional) – Maximum depth of the tree.
max_features (int or float or str, optional) – How many features to consider during each split of the node.
min_samples_split (int, optional) – Minimum samples required to be present within the node in order to be able to further split it.
random_state (int, optional) – If using random alforithm, it is useful to specify this parameter in order to ensure reproducibility of results.
- num_nodes¶
Number of nodes that this tree has.
- Type
int
- num_leaf_nodes¶
Number of leaf nodes that this tree has.
- Type
int
- _best_split(X, y)¶
Find best split from given feature space.
This methods finds best split from given feature space using given criterion (e.g. gini).
- Parameters
X (2d-array) – Subset of the original dataset (self.X)
y (1d-array) – Subset of the original targets (self.y)
- Returns
loss (float) – Measure of how good the split of this node is.
p (int) – Index of the feature according to which the split is supposed to be done.
val (float or int) – Value according which to do the split.
- _check_criterion(node)¶
Check criteria neccessary to split the given node.
- Parameters
node (
Node) – Leaf node which you want to evaluate.- Returns
Can you split the given node.
- Return type
bool
- _evaluate_leaf(node)¶
Return predicted class and probabilities of all classes.
- Parameters
node (
Node) – Leaf node which you want to evaluate.- Returns
predict (int) – Predicted class.
predict_proba (list) – Probabilities of all classes
- _get_child_info(node, X)¶
Get info about given node’s children.
By info, it is precisely meant size of both children and relevant indices.
- Parameters
node (
Node) – Parent node.X (2d array) – Subset of the original training dataset (self.X)
- Returns
sizes (list) – List with size of both children.
left_indices (1d array) – Indices of samples in the left node.
right_indices (1d array) – Indices of samples in the right node.
- _is_pure(node)¶
Return if given node is pure or not.
Pure means that the given node only contains samples with the same class.
- Parameters
node (
Node) – Leaf node which you want to evaluate.- Returns
Is given node pure or not.
- Return type
bool
- _run_init_setup()¶
Run neccessary proccesses for initialization of the class.
Notes
The neccessary processes are:
Assignment of relevant criterion function
- _run_training_setup(X, y)¶
Run preliminary setup before the start of training.
- Parameters
X (2d-array) – Training dataset
y (1d-array) – Target values.
Notes
The following things are done:
Validation of input tensors X and y - X must be 2 dimensional. If it is 1 dimensional, transformation to 2D is attempted. y must be 1 dimensional.
X and y must have identical first dimension
Parse number of training records n, size of feature space p and # of unique classes k
Map provided unique classes to internal normalized format, i.e., 0, 1, 2, …
Determination of max features variable
- _split(curr)¶
Split provided node.
High level overview of algorithm (see further explanation in Notes):
Compute optimal split for the given
Node.Compute info about two child nodes which is needed to initiliaze them
Initialize child nodes and for each
IF ALLOWED split it further
ELSE Make it leaf node
Repeat the same process recursively.
- Parameters
curr (
Node) – Current node which should be splitted.
Notes
STEP 1. In order to find optimal split of the node, the following pseudo algorithm is used:
Define proper feature space, i.e. features to be considered for the split. This is determined based on the parameter
self.max_features.For each feature within the proper feature space:
Sort its values and make sure they are all unique
For each pair of neighboring values, compute the split
For this split compute the total score - weighted average of given
criterionReturn the best split based on the total score
From the best splits for each feature, choose the overall best split and return it
STEP 2. Following steps are done:
Get boolean array which represents the split of data - True = left node, False = right node
Get indices of samples which should belong to left and right child nodes respectively
STEP 3. There are two possible scenario for the node in question:
split it further
make it leaf node
Split it further if the node:
is NOT pure (i.e. does not contain just one type of class)
its depth is not equal to maximum depth specified
Contains more samples than the specified minimum
Has more than 1 unique sample
Otherwise, turn it into a leaf node.
- fit(X, y)¶
Build decision tree.
- Parameters
X (2d-array) – Training dataset
y (1d-array) – Target values.
- predict(X)¶
Predict classes for given dataset X.
- Parameters
X (2d-array) – Dataset based on which you want to make a prediction.
- Returns
Returns array with predictions.
- Return type
1d-array
- predict_proba(X)¶
Return 2d array with probabilities for given classes.
Traverse the tree until you reach a leaf node. Then, return the probabilites associated with all classes.
- Parameters
X (2d array) – Samples based on which to predict corresponding class probabilities.
- Returns
n x k array where n is number of provided samples and k is number of classes.
- Return type
2d array
- class scripts.models.decision_tree.DecisionTreeClassifier(algorithm='greedy', criterion='gini', max_depth=None, max_features='auto', min_samples_split=2, random_state=None)¶
Bases:
scripts.base._base_classifier.BaseClassifier,scripts.models.decision_tree._decision_tree.DecisionTreeClass allowing to specify parameters of the model.
This class serves as a high level client facing API. It allows to specify all hyper parameters important for training process.
- Parameters
criterion (str, optional) – How to calculate impurity of nodes.
algorithm (str, optional) – How to approach building process of the tree.
max_depth (int, optional) – Maximum depth of the tree.
max_features (int or float or str, optional) – How many features to consider during each split of the node.
min_samples_split (int, optional) – Minimum samples required to be present within the node in order to be able to further split it.
random_state (int, optional) – If using random alforithm, it is useful to specify this parameter in order to ensure reproducibility of results.
- class scripts.models.decision_tree.Node(size=None, values=None, depth=None, _type='internal')¶
Bases:
objectCore building block for a Decision Tree.
The main purpose of a Node class is to hold information about decisions of the given decision tree. See what is meant by information in the below description of attributes.
- Parameters
size (int, optional) – Number of samples which are part of this node.
values (iterable, optional) – Row indices of the samples which are part of this node.
depth (int, optional) – Depth of this node in decision tree.
_type (str, optional) – What type of node this node is - root, internal, leaf.
- p¶
Index of a feature within the given training dataset. Split of a node is made according to this feature.
- Type
int
- val¶
Split the incoming dataset according to the
p-th featureand itsvalue (val).- Type
float or int
- loss¶
Measure of how good the split of this node is. For example, when using gini impurity, this would be a weighted average of gini impurities of its child nodes.
- Type
float or int
- split¶
Amount of samples in left and right child of this node.
- Type
list
- predict¶
Predicted class based on the frequency of this class within this node. (Most common)
- Type
int or str
- predict_proba¶
Probabilites of the classes in question.
- Type
list
- decision(X)¶
Decide whether the given sample(s)’ p-th feature value is smaller than given threshold.
- Parameters
x (nd-array) – Samples present within this node. (Actual values)
- Returns
Boolean array.
- Return type
1d-array
- is_leaf()¶
Return if the given node is a leaf node or not.
- Returns
- Return type
bool