gpclarity.DataInfluenceMap

Constructor

gpclarity.data_influence.__init__(model)

Initialize with GP model.

Parameters:: model (GPy.models.GPRegression) – Trained GP model with predict and kern attributes
Raises:: ValueError – If model lacks required attributes

Example:

import GPy
from gpclarity import DataInfluenceMap

kernel = GPy.kern.RBF(input_dim=2)
model = GPy.models.GPRegression(X_train, y_train, kernel)
model.optimize()

influence = DataInfluenceMap(model)

Methods

gpclarity.data_influence.compute_influence_scores(X_train, y_train=None, *, use_cache=True) → InfluenceResult

Compute influence scores using leverage scores (optimized O(n³)).

Leverage scores computed via diagonal of hat matrix using cached Cholesky decomposition.

Parameters:

X_train (np.ndarray) – Training input locations with shape (n_train, n_dims)
y_train (np.ndarray, optional) – Training outputs with shape (n_train,) or (n_train, 1). Optional, validated if provided but not used for computation.
use_cache (bool) – Whether to use internal cache for kernel matrix. Default: True

Returns:

InfluenceResult containing scores and metadata

Return type:

InfluenceResult

Raises:

ValueError – If X_train is not 2D array-like
InfluenceError – If computation fails

Example:

result = influence.compute_influence_scores(X_train)

# Get most influential point
most_influential_idx = np.argmax(result.scores)
print(f"Point {most_influential_idx}: score={result.scores[most_influential_idx]:.3f}")

gpclarity.data_influence.compute_loo_variance_increase(X_train, y_train, *, n_jobs=1, verbose=False) → Tuple[np.ndarray, np.ndarray]

Exact Leave-One-Out variance increase with optional parallelization.

Parameters:

X_train (np.ndarray) – Training inputs with shape (n_train, n_dims)
y_train (np.ndarray) – Training outputs with shape (n_train,) or (n_train, 1)
n_jobs (int) – Number of parallel jobs. -1 for all cores, 1 for sequential. Default: 1
verbose (bool) – Whether to display progress bar. Requires tqdm. Default: False

Returns:

Tuple of (variance_increase, prediction_errors). Both arrays shape (n_train,).

Return type:

tuple

Raises:

InfluenceError – If computation fails

Example:

var_increase, pred_errors = influence.compute_loo_variance_increase(
    X_train, y_train, n_jobs=-1, verbose=True
)

# Identify outliers: high variance increase AND high prediction error
outlier_mask = (var_increase > np.percentile(var_increase, 95)) & \\
               (pred_errors > np.percentile(pred_errors, 95))
outliers = np.where(outlier_mask)[0]

gpclarity.data_influence.get_influence_report(X_train, y_train, *, compute_loo=True, n_jobs=1) → Dict[str, Any]

Comprehensive influence analysis report.

Parameters:

X_train (np.ndarray) – Training inputs
y_train (np.ndarray) – Training outputs
compute_loo (bool) – Whether to include LOO analysis (slow for large n). Default: True
n_jobs (int) – Parallel jobs for LOO computation. Default: 1

Returns:

Dictionary with influence statistics and diagnostics

Return type:

dict

Return structure:

{
    'computation_summary': {
        'total_time': float,
        'leverage_time': float,
        'n_points': int,
        'method': str
    },
    'influence_scores': {
        'mean': float,
        'std': float,
        'median': float,
        'max': float,
        'min': float,
        'p95': float,
        'p5': float
    },
    'most_influential_point': {
        'index': int,
        'location': List[float],
        'score': float
    },
    'least_influential_point': {
        'index': int,
        'location': List[float],
        'score': float
    },
    'diagnostics': {
        'high_leverage_count': int,
        'low_influence_count': int,
        'non_finite_scores': int
    },
    'loo_analysis': {  # Only if compute_loo=True
        'variance_increase': List[float],
        'prediction_errors': List[float],
        'mean_error': float,
        'max_error': float
    }
}

Example:

report = influence.get_influence_report(X_train, y_train, compute_loo=True)

# Check influence distribution
mean_score = report['influence_scores']['mean']
high_leverage = report['diagnostics']['high_leverage_count']

# Get most influential point details
most_inf = report['most_influential_point']
print(f"Most influential: Point {most_inf['index']} at {most_inf['location']}")

gpclarity.data_influence.plot_influence(X_train, influence_scores, ax=None, **scatter_kwargs) → plt.Axes

Visualize data point influence.

Delegated to gpclarity.plotting.plot_influence_map.

Parameters:

X_train (np.ndarray) – Training input locations
influence_scores (np.ndarray or InfluenceResult) – Computed scores or InfluenceResult
ax (plt.Axes, optional) – Matplotlib axes. Created if None.
scatter_kwargs – Additional arguments passed to ax.scatter()

Returns:

Matplotlib axes object

Return type:

plt.Axes

Raises:

ImportError – If matplotlib not installed
ValueError – If input dimensions > 2

Example:

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
influence.plot_influence(X_train, result, ax=ax, s=100, alpha=0.6, cmap='hot')
plt.show()

gpclarity.data_influence.clear_cache() → None

Clear internal computation cache to free memory.

Example:

influence.clear_cache()