Split metrics

This module includes functions to quantify quality of split of data for decision tree building. The following functions are available:

scripts.metrics._split.entropy(y)

Compute entropy of the given vector.

Entropy is a measure of disorder. The higher the entropy, the more disorder there is present. As an example, if you have binary classes where 50 % is positive and the rest negative, then your entropy would be 1 (high), if you only have positive samples, then your entropy is 0. (low) The formula for entropy is as follows:

\[E(Y) = \sum_i^k -p_i log_2 p_i\]

where k is number of classes you have.

Parameters

y (numpy.ndarray) – 1d array of labels of classes.

Returns

Value between 0 to +inf depending on the number of clasess.

Return type

float

scripts.metrics._split.gini(y)

Compute gini impurity score.

Gini impurity is usually used within the context of DecisionTrees. The value ranges between 0 and 1. If 0, it means that within your dataset, you only have one class. If more than 0, it means that there is certain likelihood that you will misclassify given sample from yout dataset.

It can be computed using the following formula:

\[G(Y) = \sum_{i = 0}^{k} P(i)*(1 - P(i))\]

where k is number of classes and \(P(i)\) is probability of i-th class.

Parameters

y (numpy.ndarray) – 1d array of labels of classes.

Returns

Float between 0 and 1.

Return type

float