Evaluation of explanation quality: word importance vectors#

In this notebook, we are going to explore how we can use teex to evaluate word importance explanations

[12]:
from teex.wordImportance.data import Newsgroup
from teex.wordImportance.eval import word_importance_scores

The first step is to gather data with available word importance explanations. teex makes it simples:

[13]:
dataGen = Newsgroup()
X, y, exps = dataGen[:]
[14]:
X[1]
[14]:
b'From: aj2a@galen.med.Virginia.EDU (Amir Anthony Jazaeri)\nSubject: Re: Heat Shock Proteins\nOrganization: University of Virginia\nLines: 8\n\nby the way ms. olmstead dna is not degraded in the stomach nor\nunder pH of 2.  its degraded in the duodenum under approx.\nneutral pH by DNAase enzymes secreted by the pancreas.  my\npoint:  check your facts before yelling at other people for not\ndoing so.  just a friendly suggestion.\n\n\naaj 4/26/93\n'
[15]:
print(y[1], dataGen.classMap[y[1]])
1 medicine
[16]:
exps[1]
[16]:
{'shock': 0.5,
 'heat': 0.5,
 'proteins': 0.5,
 'dna': 0.5,
 'stomach': 0.5,
 'duodenum': 0.5,
 'dnaase': 0.5,
 'enzymes': 0.5,
 'pancreas': 0.5}

The second step is training an estimator and predicting explanations. We could use any system for training and generating the explanations. We are going to skip this step, as its independent to teex and up to the user to decide in which way to generate the explanations. Instead, we are going to use the ground truth explanations as if they were the predicted ones.

So, we compare the predicted explanations with the ground truth ones.

[19]:
metrics = ['prec', 'rec', 'fscore']
scores = word_importance_scores(exps, exps, metrics=metrics)
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1495: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1495: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1248: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1495: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(
[20]:
for i, metric in enumerate(metrics):
    print(f'{metric}: {scores[i]}')
prec: 0.9839572310447693
rec: 0.9839572310447693
fscore: 0.9839572310447693

We obtain quasi perfect scores, as the ground truths are exactly the same as the predictions. The fact that they are not 1 is due to instances when only 1 feature is available and thus, metrics are not well defined.