Evaluation Module
=================

The Evaluation module provides comprehensive tools to assess the quality of synthetic data along two critical dimensions: 
**Utility** and **Privacy**. This ensures that synthetic data maintains the statistical properties of the original data 
while protecting individual privacy.

Overview
--------

When generating synthetic data, it is crucial to evaluate:

1. **Utility**: How well the synthetic data preserves the statistical properties and usefulness of the original data
2. **Privacy**: How effectively the synthetic data protects individual-level information from the original dataset

The Evaluation module provides metrics and visualizations for both aspects.

Part 1: Utility Evaluation
---------------------------

Utility evaluation assesses how well synthetic data can substitute for real data in downstream analyses.

Statistical Similarity Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Univariate Statistics**

Compare distributions of individual features:

* **Mean and Standard Deviation**: Compare central tendency and spread
* **Kolmogorov-Smirnov Test**: Test distribution similarity
* **Chi-Square Test**: For categorical variables
* **Jensen-Shannon Divergence**: Measure distribution distance

**Multivariate Statistics**

Assess relationships between features:

* **Correlation Matrices**: Compare correlation structures
* **Covariance Matrices**: Evaluate joint distributions
* **Principal Component Analysis**: Compare PCA projections
* **Mutual Information**: Measure feature dependencies

Machine Learning Utility
~~~~~~~~~~~~~~~~~~~~~~~~

Evaluate synthetic data through ML model performance:

* **Train on Synthetic, Test on Real (TSTR)**: Train models on synthetic data and test on real data
* **Train on Real, Test on Synthetic (TRTS)**: Train models on real data and test on synthetic data
* **Feature Importance**: Compare feature importance rankings
* **Model Performance**: Classification/regression metrics (accuracy, F1, RMSE, etc.)

Visual Comparisons
~~~~~~~~~~~~~~~~~~

Graphical methods for intuitive assessment:

* **Distribution Plots**: Histograms and KDE plots
* **Box Plots**: Compare quartiles and outliers
* **Scatter Plots**: Visualize bivariate relationships
* **Heatmaps**: Compare correlation matrices
* **PCA Plots**: Visualize high-dimensional structure

Usage Example - Utility
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from synomics.evaluation import UtilityEvaluator
   
   # Initialize utility evaluator
   evaluator = UtilityEvaluator()
   
   # Compare distributions
   stats_report = evaluator.statistical_similarity(
       real_data=real_data,
       synthetic_data=synthetic_data
   )
   
   # Evaluate ML utility
   ml_report = evaluator.ml_efficiency(
       real_data=real_data,
       synthetic_data=synthetic_data,
       target_column='outcome',
       task='classification'
   )
   
   # Generate visual comparisons
   evaluator.plot_distributions(
       real_data=real_data,
       synthetic_data=synthetic_data,
       columns=['age', 'gene_expression_1', 'gene_expression_2']
   )
   
   # Generate comprehensive report
   utility_score = evaluator.overall_utility_score(
       real_data=real_data,
       synthetic_data=synthetic_data
   )
   
   print(f"Overall Utility Score: {utility_score:.3f}")

Part 2: Privacy Evaluation
---------------------------

Privacy evaluation ensures that synthetic data doesn't leak sensitive information about individuals in the original dataset.

Distance-Based Privacy Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Nearest Neighbor Distance Ratio (NNDR)**

Measures how close synthetic records are to real records:

* **5th Percentile Distance**: Check if synthetic records are too similar to real ones
* **Distance Ratio**: Compare distances to nearest real vs. nearest synthetic neighbors

**Distance to Closest Record (DCR)**

Evaluates the minimum distance between synthetic and real records:

* Helps identify potential memorization
* Lower distances indicate higher privacy risk

Membership Inference Attacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Test if an attacker can determine whether a record was in the training data:

* **Attack Success Rate**: Percentage of correctly identified members
* **Attack Advantage**: Performance above random guessing
* **Confidence Scores**: Measure attacker certainty

Attribute Inference Attacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Evaluate if sensitive attributes can be inferred:

* **Attribute Disclosure Risk**: Probability of inferring sensitive features
* **Feature Prediction Accuracy**: How well features can be predicted from others

Re-identification Risk
~~~~~~~~~~~~~~~~~~~~~~

Assess the risk of linking synthetic records back to real individuals:

* **k-Anonymity**: Ensure k similar records exist
* **l-Diversity**: Check diversity of sensitive attributes
* **t-Closeness**: Measure distribution similarity of sensitive attributes

Differential Privacy Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For methods with formal privacy guarantees:

* **Epsilon (ε) Budget**: Track privacy budget consumption
* **Delta (δ) Parameter**: Probability of privacy breach
* **Privacy Loss Distribution**: Analyze worst-case privacy guarantees

Usage Example - Privacy
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from synomics.evaluation import PrivacyEvaluator
   
   # Initialize privacy evaluator
   evaluator = PrivacyEvaluator()
   
   # Distance-based privacy
   dcr_score = evaluator.distance_to_closest_record(
       real_data=real_data,
       synthetic_data=synthetic_data
   )
   
   nndr_score = evaluator.nearest_neighbor_distance_ratio(
       real_data=real_data,
       synthetic_data=synthetic_data
   )
   
   # Membership inference attack
   mia_results = evaluator.membership_inference_attack(
       real_data=real_data,
       synthetic_data=synthetic_data,
       holdout_data=holdout_data
   )
   
   # Attribute inference risk
   aia_results = evaluator.attribute_inference_risk(
       real_data=real_data,
       synthetic_data=synthetic_data,
       sensitive_attributes=['diagnosis', 'genetic_marker']
   )
   
   # Re-identification risk
   reid_score = evaluator.reidentification_risk(
       real_data=real_data,
       synthetic_data=synthetic_data,
       quasi_identifiers=['age', 'gender', 'location']
   )
   
   # Overall privacy score
   privacy_score = evaluator.overall_privacy_score(
       real_data=real_data,
       synthetic_data=synthetic_data
   )
   
   print(f"Overall Privacy Score: {privacy_score:.3f}")

Comprehensive Evaluation
-------------------------

Combine utility and privacy for holistic assessment:

.. code-block:: python

   from synomics.evaluation import ComprehensiveEvaluator
   
   # Initialize comprehensive evaluator
   evaluator = ComprehensiveEvaluator()
   
   # Run full evaluation
   report = evaluator.evaluate(
       real_data=real_data,
       synthetic_data=synthetic_data,
       target_column='outcome',
       sensitive_attributes=['diagnosis'],
       quasi_identifiers=['age', 'gender']
   )
   
   # Generate detailed report
   evaluator.generate_report(
       report=report,
       output_path='evaluation_report.html'
   )
   
   # Plot utility vs privacy tradeoff
   evaluator.plot_utility_privacy_tradeoff(report)

Evaluation Metrics Summary
---------------------------

Utility Metrics
~~~~~~~~~~~~~~~

.. list-table:: Utility Metrics
   :header-rows: 1
   :widths: 30 50 20

   * - Metric
     - Description
     - Range
   * - Statistical Similarity
     - Distribution comparison (KS, JSD)
     - 0-1 (higher is better)
   * - Correlation Preservation
     - Correlation matrix similarity
     - 0-1 (higher is better)
   * - ML Efficiency (TSTR)
     - Model performance ratio
     - 0-1 (higher is better)
   * - Feature Importance
     - Feature ranking similarity
     - 0-1 (higher is better)

Privacy Metrics
~~~~~~~~~~~~~~~

.. list-table:: Privacy Metrics
   :header-rows: 1
   :widths: 30 50 20

   * - Metric
     - Description
     - Range
   * - DCR (Distance to Closest Record)
     - Minimum distance to real records
     - 0-∞ (higher is better)
   * - NNDR
     - Nearest neighbor distance ratio
     - 0-∞ (higher is better)
   * - MIA Success Rate
     - Membership inference accuracy
     - 0-1 (lower is better)
   * - Re-identification Risk
     - Probability of re-identification
     - 0-1 (lower is better)

Best Practices
--------------

1. **Evaluate Both Dimensions**: Always assess both utility and privacy
2. **Use Multiple Metrics**: No single metric captures all aspects
3. **Compare Across Methods**: Evaluate different synthesis approaches
4. **Set Thresholds**: Define acceptable utility and privacy levels
5. **Document Results**: Keep detailed records of evaluation results
6. **Iterate**: Use evaluation results to improve synthesis parameters
7. **Consider Context**: Privacy requirements vary by use case and regulations

Interpreting Results
--------------------

Utility-Privacy Tradeoff
~~~~~~~~~~~~~~~~~~~~~~~~

There is often a tradeoff between utility and privacy:

* **High utility, low privacy**: Synthetic data very similar to real data (potential privacy leaks)
* **Low utility, high privacy**: Very private but less useful synthetic data
* **Balanced approach**: Find optimal point for your application

Recommended Thresholds
~~~~~~~~~~~~~~~~~~~~~~

General guidelines (adjust based on your requirements):

* **Utility Score**: > 0.7 for acceptable quality
* **Privacy Score**: > 0.8 for sensitive data
* **DCR**: > 0.1 for adequate privacy
* **MIA Success Rate**: < 0.6 for acceptable privacy

See Also
--------

* :doc:`preprocessing` - Prepare data before synthesis
* :doc:`synthesizer` - Generate synthetic data to evaluate