Thursday, May 30, 2013

Comparing methods for Bayesian networks in R

I want to use Bayesian networks to look at the structure of the links between my variables. The idea behind this project is that we don't mind if it's based on data mining.

There are lots of packages available for R on CRAN that will implement this sort of network type analysis. I'm making a summary of what I gather some of the packages can do, the type of data they work on, and how they treats missing values.

This is a work in progress post, so I will update it as I do more reading and also make some notes as things go good or bad!

abn: Data Modelling with Additive Bayesian Networks

  • Determines the most robust empirical model of data from interdependent variables: structure discovery 
  • Equivalent to multivariate generalised linear modelling (including mixed models with random effects)
  • Families modelled: Gaussian, Poisson and Binomial. 
  • Missing data: cannot have missing entries (impute or remove them before analysis)
  • Can have mix of binary and continuous variables in data (doesn't handle true categorical variables, gets around this by instructing the user to dummy-code each level of a categorical variable as a yes/no binary outcome then 'ban' links between the dummy variables in a 'banned matrix'. This greatly increases the number of variables modelled though)
  • Model specification is using two matrices (ban and retain)
  • Outputs dot code for Graphviz

Notes: I have quite a few variables (25 to 30 as this coding method leaves you with dummy binary variables) on 177 data points (this was just a test run, I have a lot more data but for a quick and dirty run I used na.omit on the dataframe, which really took out a lot of the data. Imputation would be better for the real thing). A lot of them can hypothetically have many 'parent' nodes. I believe having lots of parent nodes increases the computation time. I set it to a maximum of 4 parent nodes. After 6 hours the search was not finished so I decided to kill the analysis. This was premature, but I got impatient, seeing as it was effectively meant to be a dummy rum. This could be very time consuming if you want to do a few models or if something goes wrong. Must be more patient.

bnlearn: Bayesian network structure learning, parameter learning and inference
  • Model specification for directed graphs the same as for deal package - it's like graph model notation. 
  • Arcs between variables can be made directional using function set.arc() rather than the matrix layout of the abn package.
To be continued...

No comments: