Evaluation Module

The Evaluation module provides comprehensive tools to assess the quality of synthetic data along two critical dimensions: Utility and Privacy. This ensures that synthetic data maintains the statistical properties of the original data while protecting individual privacy.

Overview

When generating synthetic data, it is crucial to evaluate:

  1. Utility: How well the synthetic data preserves the statistical properties and usefulness of the original data

  2. Privacy: How effectively the synthetic data protects individual-level information from the original dataset

The Evaluation module provides metrics and visualizations for both aspects.

Part 1: Utility Evaluation

Utility evaluation assesses how well synthetic data can substitute for real data in downstream analyses.

Statistical Similarity Metrics

Univariate Statistics

Compare distributions of individual features:

  • Mean and Standard Deviation: Compare central tendency and spread

  • Kolmogorov-Smirnov Test: Test distribution similarity

  • Chi-Square Test: For categorical variables

  • Jensen-Shannon Divergence: Measure distribution distance

Multivariate Statistics

Assess relationships between features:

  • Correlation Matrices: Compare correlation structures

  • Covariance Matrices: Evaluate joint distributions

  • Principal Component Analysis: Compare PCA projections

  • Mutual Information: Measure feature dependencies

Machine Learning Utility

Evaluate synthetic data through ML model performance:

  • Train on Synthetic, Test on Real (TSTR): Train models on synthetic data and test on real data

  • Train on Real, Test on Synthetic (TRTS): Train models on real data and test on synthetic data

  • Feature Importance: Compare feature importance rankings

  • Model Performance: Classification/regression metrics (accuracy, F1, RMSE, etc.)

Visual Comparisons

Graphical methods for intuitive assessment:

  • Distribution Plots: Histograms and KDE plots

  • Box Plots: Compare quartiles and outliers

  • Scatter Plots: Visualize bivariate relationships

  • Heatmaps: Compare correlation matrices

  • PCA Plots: Visualize high-dimensional structure

Usage Example - Utility

from synomics.evaluation import UtilityEvaluator

# Initialize utility evaluator
evaluator = UtilityEvaluator()

# Compare distributions
stats_report = evaluator.statistical_similarity(
    real_data=real_data,
    synthetic_data=synthetic_data
)

# Evaluate ML utility
ml_report = evaluator.ml_efficiency(
    real_data=real_data,
    synthetic_data=synthetic_data,
    target_column='outcome',
    task='classification'
)

# Generate visual comparisons
evaluator.plot_distributions(
    real_data=real_data,
    synthetic_data=synthetic_data,
    columns=['age', 'gene_expression_1', 'gene_expression_2']
)

# Generate comprehensive report
utility_score = evaluator.overall_utility_score(
    real_data=real_data,
    synthetic_data=synthetic_data
)

print(f"Overall Utility Score: {utility_score:.3f}")

Part 2: Privacy Evaluation

Privacy evaluation ensures that synthetic data doesn’t leak sensitive information about individuals in the original dataset.

Distance-Based Privacy Metrics

Nearest Neighbor Distance Ratio (NNDR)

Measures how close synthetic records are to real records:

  • 5th Percentile Distance: Check if synthetic records are too similar to real ones

  • Distance Ratio: Compare distances to nearest real vs. nearest synthetic neighbors

Distance to Closest Record (DCR)

Evaluates the minimum distance between synthetic and real records:

  • Helps identify potential memorization

  • Lower distances indicate higher privacy risk

Membership Inference Attacks

Test if an attacker can determine whether a record was in the training data:

  • Attack Success Rate: Percentage of correctly identified members

  • Attack Advantage: Performance above random guessing

  • Confidence Scores: Measure attacker certainty

Attribute Inference Attacks

Evaluate if sensitive attributes can be inferred:

  • Attribute Disclosure Risk: Probability of inferring sensitive features

  • Feature Prediction Accuracy: How well features can be predicted from others

Re-identification Risk

Assess the risk of linking synthetic records back to real individuals:

  • k-Anonymity: Ensure k similar records exist

  • l-Diversity: Check diversity of sensitive attributes

  • t-Closeness: Measure distribution similarity of sensitive attributes

Differential Privacy Metrics

For methods with formal privacy guarantees:

  • Epsilon (ε) Budget: Track privacy budget consumption

  • Delta (δ) Parameter: Probability of privacy breach

  • Privacy Loss Distribution: Analyze worst-case privacy guarantees

Usage Example - Privacy

from synomics.evaluation import PrivacyEvaluator

# Initialize privacy evaluator
evaluator = PrivacyEvaluator()

# Distance-based privacy
dcr_score = evaluator.distance_to_closest_record(
    real_data=real_data,
    synthetic_data=synthetic_data
)

nndr_score = evaluator.nearest_neighbor_distance_ratio(
    real_data=real_data,
    synthetic_data=synthetic_data
)

# Membership inference attack
mia_results = evaluator.membership_inference_attack(
    real_data=real_data,
    synthetic_data=synthetic_data,
    holdout_data=holdout_data
)

# Attribute inference risk
aia_results = evaluator.attribute_inference_risk(
    real_data=real_data,
    synthetic_data=synthetic_data,
    sensitive_attributes=['diagnosis', 'genetic_marker']
)

# Re-identification risk
reid_score = evaluator.reidentification_risk(
    real_data=real_data,
    synthetic_data=synthetic_data,
    quasi_identifiers=['age', 'gender', 'location']
)

# Overall privacy score
privacy_score = evaluator.overall_privacy_score(
    real_data=real_data,
    synthetic_data=synthetic_data
)

print(f"Overall Privacy Score: {privacy_score:.3f}")

Comprehensive Evaluation

Combine utility and privacy for holistic assessment:

from synomics.evaluation import ComprehensiveEvaluator

# Initialize comprehensive evaluator
evaluator = ComprehensiveEvaluator()

# Run full evaluation
report = evaluator.evaluate(
    real_data=real_data,
    synthetic_data=synthetic_data,
    target_column='outcome',
    sensitive_attributes=['diagnosis'],
    quasi_identifiers=['age', 'gender']
)

# Generate detailed report
evaluator.generate_report(
    report=report,
    output_path='evaluation_report.html'
)

# Plot utility vs privacy tradeoff
evaluator.plot_utility_privacy_tradeoff(report)

Evaluation Metrics Summary

Utility Metrics

Utility Metrics

Metric

Description

Range

Statistical Similarity

Distribution comparison (KS, JSD)

0-1 (higher is better)

Correlation Preservation

Correlation matrix similarity

0-1 (higher is better)

ML Efficiency (TSTR)

Model performance ratio

0-1 (higher is better)

Feature Importance

Feature ranking similarity

0-1 (higher is better)

Privacy Metrics

Privacy Metrics

Metric

Description

Range

DCR (Distance to Closest Record)

Minimum distance to real records

0-∞ (higher is better)

NNDR

Nearest neighbor distance ratio

0-∞ (higher is better)

MIA Success Rate

Membership inference accuracy

0-1 (lower is better)

Re-identification Risk

Probability of re-identification

0-1 (lower is better)

Best Practices

  1. Evaluate Both Dimensions: Always assess both utility and privacy

  2. Use Multiple Metrics: No single metric captures all aspects

  3. Compare Across Methods: Evaluate different synthesis approaches

  4. Set Thresholds: Define acceptable utility and privacy levels

  5. Document Results: Keep detailed records of evaluation results

  6. Iterate: Use evaluation results to improve synthesis parameters

  7. Consider Context: Privacy requirements vary by use case and regulations

Interpreting Results

Utility-Privacy Tradeoff

There is often a tradeoff between utility and privacy:

  • High utility, low privacy: Synthetic data very similar to real data (potential privacy leaks)

  • Low utility, high privacy: Very private but less useful synthetic data

  • Balanced approach: Find optimal point for your application

See Also

  • preprocessing - Prepare data before synthesis

  • synthesizer - Generate synthetic data to evaluate