Thursday, May 30, 2013

Comparing methods for Bayesian networks in R

I want to use Bayesian networks to look at the structure of the links between my variables. The idea behind this project is that we don't mind if it's based on data mining.

There are lots of packages available for R on CRAN that will implement this sort of network type analysis. I'm making a summary of what I gather some of the packages can do, the type of data they work on, and how they treats missing values.

This is a work in progress post, so I will update it as I do more reading and also make some notes as things go good or bad!

abn: Data Modelling with Additive Bayesian Networks

  • Determines the most robust empirical model of data from interdependent variables: structure discovery 
  • Equivalent to multivariate generalised linear modelling (including mixed models with random effects)
  • Families modelled: Gaussian, Poisson and Binomial. 
  • Missing data: cannot have missing entries (impute or remove them before analysis)
  • Can have mix of binary and continuous variables in data (doesn't handle true categorical variables, gets around this by instructing the user to dummy-code each level of a categorical variable as a yes/no binary outcome then 'ban' links between the dummy variables in a 'banned matrix'. This greatly increases the number of variables modelled though)
  • Model specification is using two matrices (ban and retain)
  • Outputs dot code for Graphviz

Notes: I have quite a few variables (25 to 30 as this coding method leaves you with dummy binary variables) on 177 data points (this was just a test run, I have a lot more data but for a quick and dirty run I used na.omit on the dataframe, which really took out a lot of the data. Imputation would be better for the real thing). A lot of them can hypothetically have many 'parent' nodes. I believe having lots of parent nodes increases the computation time. I set it to a maximum of 4 parent nodes. After 6 hours the search was not finished so I decided to kill the analysis. This was premature, but I got impatient, seeing as it was effectively meant to be a dummy rum. This could be very time consuming if you want to do a few models or if something goes wrong. Must be more patient.

bnlearn: Bayesian network structure learning, parameter learning and inference
  • Model specification for directed graphs the same as for deal package - it's like graph model notation. 
  • Arcs between variables can be made directional using function set.arc() rather than the matrix layout of the abn package.
To be continued...

Monday, May 27, 2013

Drawing path diagrams - options part 2

Having said in my last post that I wan't going to use R/Graphviz, the last couple of days I've investigating using Graphviz! I like to change my mind sometimes.

After some thought, I decided that the way that Graphviz will automatically lay out and create the path diagram based on the specified links has to be worth looking into. The tikz/Latex method is more labour intensive in some ways because it doesn't automatically decide things like arrow head placement.

I couldn't instantly figure out how to make the Graphviz GUI work! I think I was quite tired. But it's simple (on Windows at any rate): install it from this website, open it up, and then paste some dot code into the GUI. Then press the 'running man' icon. It should tell you if and where there are syntax errors.

You can get dot code from R using the sem package and the function pathDiagram(). However, I have found that the messing around involved in getting to that stage with sem in R is not worth my time (as I don't want to use sem for the analysis). I kept getting an error message at earlier stages in the process (as pathDiagram needs a fitted sem model object, so you need to get that done first). So I just looked at some example dot codes for structural equation models (SEM) and worked out by trial-and-error how to modify the dot code.

For my needs the Graphviz syntax is quite simple. Here is a quickly made-up example of what I managed to achieve.

digraph "sem.wh.1" {
rankdir=LL;
size="24,24";
node [fontname="Helvetica" fontsize=10 shape=box];
edge [fontname="Helvetica" fontsize=10];
center=1;

"mrt" -> "states.pres" [label=""];
"us.cult" -> "states.pres" [label=""];
"flor.zone" -> "states.pres" [label=""];
"squares.cz" -> "states.pres" [label=""];
"flower.period" -> "states.pres" [label=""];
"dispersal.vectors2" -> "states.pres" [label=""];
"pollination.vectors2" -> "states.pres" [label=""];
"propagule length" -> "states.pres" [label=""];
"grime" -> "states.pres" [label=""];
"life.span.ingolf" -> "states.pres" [label=""];

"cz.cult" -> "mrt" [label=""];
"habitats" -> "mrt" [label=""];
"flor.zone" -> "mrt" [label=""];
"alt.range.cz" -> "mrt" [label=""];
"life.span.ingolf" -> "mrt" [label=""];
"growth.form" -> "mrt" [label=""];
"grime" -> "mrt" [label=""];
"flower.period" -> "mrt" [label=""];
"flower.period" -> "mrt" [label=""];


"habitats" -> "cz.cult" [label=""];
"flor.zone" -> "cz.cult" [label=""];
"alt.range.cz" -> "cz.cult" [label=""];
"squares.cz" -> "cz.cult" [label=""];
"sla" -> "cz.cult" [label=""];
"life.span.ingolf" -> "cz.cult" [label=""];
"growth.form" -> "cz.cult" [label=""];

"cz.cult" -> "us.cult" [label=""];
"habitats" -> "us.cult" [label=""];
"flor.zone" -> "us.cult" [label=""];
"alt.range.cz" -> "us.cult" [label=""];
"squares.cz" -> "us.cult" [label=""];

"flor.zone" -> "habitats" [label=""];
"alt.range.cz" -> "habitats" [label=""];
"squares.cz" -> "habitats" [label=""];
"ssb.range" -> "habitats" [label=""];
"life.span.ingolf" -> "habitats" [label=""];
"growth.form" -> "habitats" [label=""];
"grime" -> "habitats" [label=""];
"height" -> "habitats" [label=""];
"flower.period" -> "habitats" [label=""];
"dispersal.vectors2" -> "habitats" [label=""];



}



It's ok. It's getting quite complicated already though. Doing all this has forced me to really think about what I'm trying to model here. There are so many hypothesised links that it might be better to use some kind of machine-learning network analysis to look at the strong links instead. Something to look into.

Wednesday, May 22, 2013

Drawing path diagrams - options part 1

One of my main projects at Charles University involves path analysis. A problem I came up against in my PhD work was with finding a time-efficient way to draw multiple path diagrams. I didn't put too much effort into finding a solution back then and just used PowerPoint and Inkscape to draw diagrams. However, when I had to make small modifications to the diagrams and produce numerous diagrams this method quickly became overly time consuming and annoying.

My current project is going to demand even more complex diagrams. I want to find a better method.

Some initial reading has led me to the idea of doing-away with the manual drawing methods and I now think that a command line programming method could be better in the long run. Although there is usually a steep learning curve at the start, which is a bit off-putting.  Luckily, I can already use Latex and R so it's not too intimidating and I'll give it a try.

Here is a summary of what I've learned today:

This post from Andrew Wheeler on stack exchange is where I started off and I found it helpful because he describes the exact issues I've been having. The post steered me towards thinking that the Tikz/pgf drawing library in Latex will be the way to go rather than Graphviz. As I saw form another post, Graphviz has trouble drawing curved arrows and also can't handle some of the mathematical notation one might need.

I also rejected the idea of using the in-built visualisation functions from the sem package in R because I'm not planning on using sem to run the analysis in R [I want to use some method for confirmatory path analysis - to be decided if it's going Bayesian or not. For now, I need it to be phylogenetic, so thinking phylogenetic confirmatory path analysis].

Here is a basic diagram. It's only taken me a couple of hours to get to this stage. It's not pretty yet and doesn't have all my variables, but it's an ok start. Mostly, it's adapted from this code by Ivan Griffin.

The document class doesn't actually convert it into a .png. A thread says that this will be possible eventually. But having it like this does at least produce a cropped .pdf document.

Latex code:


\documentclass[convert={density=300,size=1080x800,outext=.png}]{standalone}
\usepackage{tikz} % Load tikz package
\usetikzlibrary{fit,positioning,calc,backgrounds}
\begin{document}
\centering
\begin{tikzpicture} % Encloses drawings in tikz env.

% Styles for states, and state edges
\tikzstyle{state} = [draw, very thick, fill=blue!10, rectangle, minimum height=3em, minimum width=7em, node distance=8em, font={\sffamily\bfseries}]
\tikzstyle{stateEdgePortion} = [black,thick];
\tikzstyle{stateEdge} = [stateEdgePortion,->];
\tikzstyle{edgeLabel} = [pos=0.5, text centered, font={\sffamily\small}];
\tikzstyle{main}=[circle, minimum size = 10mm, thick, draw =red!80, node distance = 16mm]
\tikzstyle{connect}=[-latex, thick]
\tikzstyle{box}=[rectangle, draw=green!100]

% Position the nodes (boxes)
\node[state, name=mrt] {MRT};
\node[state, name=gridcells, below of=mrt, left of=mrt, xshift=-2em] {Grid cells};
\node[state, name=propagule, below of=gridcells] {Propagule length};
\node[state, name=dna, below of=propagule, right of=propagule, xshift=2em] {DNA};
\node[state, name=states, below of=gridcells, right of=gridcells, xshift=20em, node distance=4em] {States};

% Connect the nodes (boxes) via edges (arrows)
\draw ($(propagule.north) + (-.0em,0)$)
edge[stateEdge] node[edgeLabel, xshift=-0em]{\emph{}}
($(gridcells.south) + (-.0em,0)$);
\draw ($(gridcells.north) + (.0em,0)$)
edge[stateEdge, bend left=22.5] node[edgeLabel, xshift=-0em]{\emph{}}
($(mrt.west) + (.0em,0)$);
\draw ($(dna.west) + (-0em,0)$)
edge[stateEdge, bend left=22.5] node[edgeLabel, xshift=-0em, yshift=0em]{}
($(propagule.south) + (0,0em)$);
\draw ($(mrt.east) + (0em,0)$)
edge[stateEdge] node[edgeLabel, xshift=0em, yshift=0em]{\emph{}}
($(states.north) + (0,0em)$);

\end{tikzpicture}
\end{document}
% note - compiled with pdflatex