Thesis

Causal discovery from observational tabular data with generative adversarial learning

Creator
Rights statement
Awarding institution
  • University of Strathclyde
Date of award
  • 2025
Thesis identifier
  • T17565
Person Identifier (Local)
  • 202092048
Qualification Level
Qualification Name
Department, School or Faculty
Abstract
  • Background Causal knowledge is essential for understanding complex systems and revealing relationships between variables. It enables researchers to transition beyond correlations, reason about cause and effect, and derive scientific insights. Although Randomized Controlled Trials (RCT) remain the gold standard for causal inference, they are often infeasible due to ethical, logistical, or financial constraints and may lack real-world applicability. In contrast, observational data offer abundant, diverse samples, making them well-suited for large-scale analysis. Despite susceptibility to confounding, advances in structure learning from observations allow researchers to identify causal relationships without relying on randomized experiments. Research objectives This thesis challenges conventional maximum likelihood estimation (MLE)-based methods by exploring adversarial causal discovery approaches. It leverages the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) framework to address key limitations: (1) model overfitting from simplistic loss functions; (2) dependence on single parametric assumptions that hinder accurate causal graph recovery reflective of true data relationships; (3) high computational cost from Augmented Lagrangian optimization in the NOTEARS framework; and (4) inability to perform causal discovery and tabular data synthesis simultaneously under a single framework. Methods Three models were developed using the WGAN-GP framework. The first, DAG-WGAN integrates WGAN-GP with variational inference, leveraging hybrid losses for improved causal modeling. The second, DAG-WGAN+ enhances continuous optimization with efficient structure learning techniques. The third, DAGAF captures variable interdependencies under various causal assumptions to generate synthetic data preserving causal relations. Results All models target multivariate causal discovery and were rigorously evaluated using Structural Hamming Distance (SHD). Results show they outperform leading methods in causal discovery across 97.47% of all test cases. In real-world experiments, the proposed models achieve superior accuracy (SHD = 8 vs. > 10 for state-of-the-art models). Findings further reveal that precise causal modeling enhances synthetic data quality by preserving underlying causal mechanisms.
Advisor / supervisor
  • Dong, Feng
  • Maguire, Roma
Resource Type
DOI

关系

项目