Generating data with available g.t. feature importance explanations#

We are going to see the available options for data generation with g.t. feature importance explanations.

1. Generating artificial data with SenecaFI#

[1]:

from teex.featureImportance.data import SenecaFI

We are going to explore SenecaFI, a method from [Evaluating local explanation methods on ground truth, Riccardo Guidotti, 2021].

note This method was not originally conceived as a data generation procedure, but rather as a way to generate transparent classifiers (i.e. a classifier with available ground truth explanations). We use that generated classifier and some artificially generated data to return a dataset with observations, labels and ground truth explanations. The dataset generated contains numerical features with a binary classification.

[4]:

# instance the data generator
dataGen = SenecaFI(nSamples=100, nFeatures=4, randomState=1)

# retrieve the generated observations
X, y, exps = dataGen[:]

[5]:

print(f'Observation: {X[0]} \nLabel: {y[0]} \nExplanation: {exps[0]}')

Observation: [ 0.34558419 -0.65128101  1.82843024 -0.59277453]
Label: 1
Explanation: [ 1.      1.      0.1897 -0.3012]

The ground truth FI explanations are scales to the range (-1, 1) by feature. That is, if a feature contains a 1 in a particular observation, that means that it is the observation where that feature is most important in the dataset. Inversely, if an observation contains a -1, it means that the specific feature contributes the most negatively in the dataset.

One can specify the number of points to be generated (nSamples), the number of features (nFeatures), the names of the features (featureNames) and the random state (randomState).

[6]:

dataGen.featureNames  # automatically generated

[6]:

['a', 'b', 'c', 'd']

The explanations are generated by first creating a random collection of points. Then, creating a random linear expression and finally evaluating its derivative at the points closest to the original observations. The underlying model can be accessed:

[7]:

model = dataGen.transparentModel
model

[7]:

<teex.featureImportance.data.TransparentLinearClassifier at 0x12bfd0be0>

This structure follows the sklearn API (.fit, .predict, .predict_proba) and can be used to test explainer methods, for example. An important method that it contains is the .explain, which given an observation, explains the prediction. All of the observations that the object receive must be of shape (nObservations, nFeatures).

Compute predictions:

[8]:

print(f'Single observation: {model.predict(X[0].reshape(1, -1))} \nMultiple observations: {model.predict(X[:10])}')

Single observation: [1]
Multiple observations: [1 1 1 0 0 1 0 1 0 0]

Compute class probabilities:

[9]:

print(f'Single observation: \n{model.predict_proba(X[0].reshape(1, -1))} \n\nMultiple observations: \n{model.predict_proba(X[:10])}')

Single observation:
[[0. 1.]]

Multiple observations:
[[0.         1.        ]
 [0.05753863 0.94246137]
 [0.25390828 0.74609172]
 [1.         0.        ]
 [0.80108987 0.19891013]
 [0.         1.        ]
 [1.         0.        ]
 [0.         1.        ]
 [1.         0.        ]
 [1.         0.        ]]

Compute explanations:

[10]:

print(f'Single observation: \n{model.explain(X[0].reshape(1, -1))} \n\nMultiple observations: \n{model.explain(X[:10])}')

Single observation:
[[1. 1. 1. 1.]]

Multiple observations:
[[ 1.      1.      1.     -1.    ]
 [ 1.      1.      0.0879 -0.4518]
 [ 1.      1.      0.0706 -0.6003]
 [ 1.      1.      0.2038 -0.237 ]
 [ 1.      1.      0.1008 -0.7226]
 [ 1.      1.     -1.      1.    ]
 [ 1.      1.      0.1038 -0.6127]
 [ 1.      1.      0.1553 -0.4214]
 [ 1.      1.      0.1394 -0.7511]
 [ 1.      1.     -0.3793 -0.749 ]]

Note that the scaler will work with the observations that it is explaining.