Evaluation Module
The Evaluation module provides comprehensive tools to assess the quality of synthetic data along two critical dimensions: Utility and Privacy. This ensures that synthetic data maintains the statistical properties of the original data while protecting individual privacy.
Overview
When generating synthetic data, it is crucial to evaluate:
Utility: How well the synthetic data preserves the statistical properties and usefulness of the original data
Privacy: How effectively the synthetic data protects individual-level information from the original dataset
The Evaluation module provides metrics and visualizations for both aspects.
Part 1: Utility Evaluation
Utility evaluation assesses how well synthetic data can substitute for real data in downstream analyses.
Statistical Similarity Metrics
Univariate Statistics
Compare distributions of individual features:
Mean and Standard Deviation: Compare central tendency and spread
Kolmogorov-Smirnov Test: Test distribution similarity
Chi-Square Test: For categorical variables
Jensen-Shannon Divergence: Measure distribution distance
Multivariate Statistics
Assess relationships between features:
Correlation Matrices: Compare correlation structures
Covariance Matrices: Evaluate joint distributions
Principal Component Analysis: Compare PCA projections
Mutual Information: Measure feature dependencies
Machine Learning Utility
Evaluate synthetic data through ML model performance:
Train on Synthetic, Test on Real (TSTR): Train models on synthetic data and test on real data
Train on Real, Test on Synthetic (TRTS): Train models on real data and test on synthetic data
Feature Importance: Compare feature importance rankings
Model Performance: Classification/regression metrics (accuracy, F1, RMSE, etc.)
Visual Comparisons
Graphical methods for intuitive assessment:
Distribution Plots: Histograms and KDE plots
Box Plots: Compare quartiles and outliers
Scatter Plots: Visualize bivariate relationships
Heatmaps: Compare correlation matrices
PCA Plots: Visualize high-dimensional structure
Usage Example - Utility
from synomics.evaluation import UtilityEvaluator
# Initialize utility evaluator
evaluator = UtilityEvaluator()
# Compare distributions
stats_report = evaluator.statistical_similarity(
real_data=real_data,
synthetic_data=synthetic_data
)
# Evaluate ML utility
ml_report = evaluator.ml_efficiency(
real_data=real_data,
synthetic_data=synthetic_data,
target_column='outcome',
task='classification'
)
# Generate visual comparisons
evaluator.plot_distributions(
real_data=real_data,
synthetic_data=synthetic_data,
columns=['age', 'gene_expression_1', 'gene_expression_2']
)
# Generate comprehensive report
utility_score = evaluator.overall_utility_score(
real_data=real_data,
synthetic_data=synthetic_data
)
print(f"Overall Utility Score: {utility_score:.3f}")
Part 2: Privacy Evaluation
Privacy evaluation ensures that synthetic data doesn’t leak sensitive information about individuals in the original dataset.
Distance-Based Privacy Metrics
Nearest Neighbor Distance Ratio (NNDR)
Measures how close synthetic records are to real records:
5th Percentile Distance: Check if synthetic records are too similar to real ones
Distance Ratio: Compare distances to nearest real vs. nearest synthetic neighbors
Distance to Closest Record (DCR)
Evaluates the minimum distance between synthetic and real records:
Helps identify potential memorization
Lower distances indicate higher privacy risk
Membership Inference Attacks
Test if an attacker can determine whether a record was in the training data:
Attack Success Rate: Percentage of correctly identified members
Attack Advantage: Performance above random guessing
Confidence Scores: Measure attacker certainty
Attribute Inference Attacks
Evaluate if sensitive attributes can be inferred:
Attribute Disclosure Risk: Probability of inferring sensitive features
Feature Prediction Accuracy: How well features can be predicted from others
Re-identification Risk
Assess the risk of linking synthetic records back to real individuals:
k-Anonymity: Ensure k similar records exist
l-Diversity: Check diversity of sensitive attributes
t-Closeness: Measure distribution similarity of sensitive attributes
Differential Privacy Metrics
For methods with formal privacy guarantees:
Epsilon (ε) Budget: Track privacy budget consumption
Delta (δ) Parameter: Probability of privacy breach
Privacy Loss Distribution: Analyze worst-case privacy guarantees
Usage Example - Privacy
from synomics.evaluation import PrivacyEvaluator
# Initialize privacy evaluator
evaluator = PrivacyEvaluator()
# Distance-based privacy
dcr_score = evaluator.distance_to_closest_record(
real_data=real_data,
synthetic_data=synthetic_data
)
nndr_score = evaluator.nearest_neighbor_distance_ratio(
real_data=real_data,
synthetic_data=synthetic_data
)
# Membership inference attack
mia_results = evaluator.membership_inference_attack(
real_data=real_data,
synthetic_data=synthetic_data,
holdout_data=holdout_data
)
# Attribute inference risk
aia_results = evaluator.attribute_inference_risk(
real_data=real_data,
synthetic_data=synthetic_data,
sensitive_attributes=['diagnosis', 'genetic_marker']
)
# Re-identification risk
reid_score = evaluator.reidentification_risk(
real_data=real_data,
synthetic_data=synthetic_data,
quasi_identifiers=['age', 'gender', 'location']
)
# Overall privacy score
privacy_score = evaluator.overall_privacy_score(
real_data=real_data,
synthetic_data=synthetic_data
)
print(f"Overall Privacy Score: {privacy_score:.3f}")
Comprehensive Evaluation
Combine utility and privacy for holistic assessment:
from synomics.evaluation import ComprehensiveEvaluator
# Initialize comprehensive evaluator
evaluator = ComprehensiveEvaluator()
# Run full evaluation
report = evaluator.evaluate(
real_data=real_data,
synthetic_data=synthetic_data,
target_column='outcome',
sensitive_attributes=['diagnosis'],
quasi_identifiers=['age', 'gender']
)
# Generate detailed report
evaluator.generate_report(
report=report,
output_path='evaluation_report.html'
)
# Plot utility vs privacy tradeoff
evaluator.plot_utility_privacy_tradeoff(report)
Evaluation Metrics Summary
Utility Metrics
Metric |
Description |
Range |
|---|---|---|
Statistical Similarity |
Distribution comparison (KS, JSD) |
0-1 (higher is better) |
Correlation Preservation |
Correlation matrix similarity |
0-1 (higher is better) |
ML Efficiency (TSTR) |
Model performance ratio |
0-1 (higher is better) |
Feature Importance |
Feature ranking similarity |
0-1 (higher is better) |
Privacy Metrics
Metric |
Description |
Range |
|---|---|---|
DCR (Distance to Closest Record) |
Minimum distance to real records |
0-∞ (higher is better) |
NNDR |
Nearest neighbor distance ratio |
0-∞ (higher is better) |
MIA Success Rate |
Membership inference accuracy |
0-1 (lower is better) |
Re-identification Risk |
Probability of re-identification |
0-1 (lower is better) |
Best Practices
Evaluate Both Dimensions: Always assess both utility and privacy
Use Multiple Metrics: No single metric captures all aspects
Compare Across Methods: Evaluate different synthesis approaches
Set Thresholds: Define acceptable utility and privacy levels
Document Results: Keep detailed records of evaluation results
Iterate: Use evaluation results to improve synthesis parameters
Consider Context: Privacy requirements vary by use case and regulations
Interpreting Results
Utility-Privacy Tradeoff
There is often a tradeoff between utility and privacy:
High utility, low privacy: Synthetic data very similar to real data (potential privacy leaks)
Low utility, high privacy: Very private but less useful synthetic data
Balanced approach: Find optimal point for your application
Recommended Thresholds
General guidelines (adjust based on your requirements):
Utility Score: > 0.7 for acceptable quality
Privacy Score: > 0.8 for sensitive data
DCR: > 0.1 for adequate privacy
MIA Success Rate: < 0.6 for acceptable privacy
See Also
preprocessing - Prepare data before synthesis
synthesizer - Generate synthetic data to evaluate