Split metrics¶
This module includes functions to quantify quality of split of data for decision tree building. The following functions are available:
- scripts.metrics._split.entropy(y)¶
Compute entropy of the given vector.
Entropy is a measure of disorder. The higher the entropy, the more disorder there is present. As an example, if you have binary classes where 50 % is positive and the rest negative, then your entropy would be 1 (high), if you only have positive samples, then your entropy is 0. (low) The formula for entropy is as follows:
\[E(Y) = \sum_i^k -p_i log_2 p_i\]where
kis number of classes you have.- Parameters
y (
numpy.ndarray) – 1d array of labels of classes.- Returns
Value between 0 to +inf depending on the number of clasess.
- Return type
float
- scripts.metrics._split.gini(y)¶
Compute gini impurity score.
Gini impurity is usually used within the context of DecisionTrees. The value ranges between 0 and 1. If 0, it means that within your dataset, you only have one class. If more than 0, it means that there is certain likelihood that you will misclassify given sample from yout dataset.
It can be computed using the following formula:
\[G(Y) = \sum_{i = 0}^{k} P(i)*(1 - P(i))\]where
kis number of classes and \(P(i)\) is probability of i-th class.- Parameters
y (
numpy.ndarray) – 1d array of labels of classes.- Returns
Float between 0 and 1.
- Return type
float