Week 3

General Experience

This week was a roller-coaster. I have had so many struggles to install packages, setting-up environments, downloading tools, fixing code, refacroring, writing/rewriting/deleting, etc. However, I was able to create a graphical causal model and run experiments on The LUCAS lung cancer toy dataset. I learned about networkx library for dealing with graphs. I have also learned a lot about using args. I have reviewed by OOP (Object-Oriented Programming) skills, brushed up my numpy, matplotlib, and pandas knowledge. I have been also working on research blogs, which are more or less a litterature review. I still do believe I need tons of practice, but I feel like I am getting there. It’s been a very exhuasting, laborous weeks, but it was enriching and interesting as well. I guess that’s what this research about!

Readings (Litrature Review, online blogposts, and tutorials)

I have read Review of Causal Discovery Methods Based on Graphical Models. I have also reviewed independence and read some parts of Causality for Machine Learning to get to know more about causality.

Experiments & Algorithms

Experiement using Jupyter notebooks

I have re-created the graphical causal model of the LUCAS database as shown below: For my experiment this week I have followed these steps:

  1. I have created a scm (structural causal model) using the LUCAS lung cancer toy dataset, as shown below: image Then, I generated dataframes using the scm as shown below as well: image
  2. Then, I have chosen a target outcome: it was Lung_Cancer for this dataset.
  3. I have applied multiple causal discovery algorithm (e.g. PC, GES, etc.) – you can learn more about these algorithms from the readings I have posted.
  4. After that, I extracted parents and children from the causal graph. Then, I generated dataframes for the outcome, parents, and children. These will be later used for the ERM, causal ERM, and anticausal ERM learning frameworks. The dataframes look exactly like the general dataframe but with only parents+outcome, children+outcome, or parents+children+outcome columns.
  5. Then, I created a target dataset with different target domains (each with different distributions)
  6. I ran my learning frameworks on the target dataset and generated a boxplot summarizing error values for anticausal+causal and causal models.

Results and Findings

Experiment results

The following figures show the error values for the causal and the anticausal models. image image image Again, these experiments show that once we increase the eta_transfer, that’s the exogenous error of the transfer model, the anticausal model have a very large error compared to the causal model. However, when the exogenous error is kept low, the anticausal model behaves better than the causal model.

Research Findings

  • Causal structure search mainly depends discovering causal relations by analyzing statistical properties of purely observational data.
  • Since it is too expensive, too time-consuming, or even impossible to discover causal relations by using interventions or randomized experiemnts, revealing causal information by analyzing purely observational data (aka causal discovery) is highly appreciated.
  • Structural Equation models (SEMs) or FCMs are a class of DGCMs (Directed Graphical Causal Models) that assumes the value of each variable is a determinisitc function of its direct causes in the graph and the unmeasured disturbances. Linear models are the most commons.
  • A markov condition happens when every variable, X in a DAG (directed acyclic graph) is independent of its non-descendants conditional on its parents( the variables with edges incident on X) The markov condition can be thought of as a generalization of familiar principle in experimental inference: fixing the values of variables that directly influence some variable of interest, X, screens off more remote causes that can only influence X via the more direct causes.
  • Causal search methods are nothing but statistical estimation of parameters describing a graphical causal structure.

Frustrations

This week I got really frustrated because I ran into some software issues and technicalities when installing R and its packages on my computer. I spent more than 10 hours trying to fix the installation issue of R, maneuvering my way through warnings (like the picture below). image I travelled from github issues to StackOverFlow to God knows where to figure it out. At the end, I ended up with a Ph.D in installation issues (just kidding!). In addition, I have been also tasked with making my code more modular and use terminal instead of jupyter notebooks. Therefore, I had some frustrations in refactoring, deleting, adding, replacing, etc. Ugh! Too much work! However, I believe that it was all worth it because now I have better coding skills and I am more familiar with OOP style.

Plans for Next Week

For next week, I am going to read two papers: Invariant Risk Minimization and Adaptive Risk Minimization. I will be doing some analysis and taking some notes on them. The main task is to look closely at experiments and understand the general concept of both methods. Then, I will start looking at DomainBed and familiarizing myself with its algorithms and the way it works.

Other things worth sharing

I am still happy and excited about the next steps. Something funny worth sharing is that I started questioning the relationships between many things that exist around me and I am wondering if the underlying reasons are invariant, they are just some spurious correlations. Next week we start getting into the hefty details and do more intense work. So, I am trying to be optimistic!

Written on June 20, 2021