Kirsty F. McGregor

Thursday, July 24, 2014

The return of the blog

I've had a break for about a year. A break from the blog and also a break from academic thinking. However, I feel another post brewing, inspired by some work I did recently on a massive postcode dataset (+10,000 data points). The aim was to look at childcare provision needs. Mapping people locations was really useful for the department involved, and it was good to finally apply some sophisticated graphical and analytical software to the huge dataset at my disposal (all personnel information for a world-leading university).

So, to follow will be some of my musings on a non-ArcGIS innocent getting the right GIS when Arc is not an option (because of licensing), dealing with UK postcodes and assigning geo-locations to them, then finally getting them up and visualised in a user-friendly way.

Monday, August 12, 2013

Using abn: additive Bayesian networks

Since finishing all the imputation stuff I’ve mostly been working on implementing a method for additive Bayesian networks to my plant invasion dataset. I did some basic research and decided to try the abn package for no particular reason apart from it seemed to do what I wanted best and looked to have good documentation.

One advantage of this over other Bayesian network packages based on my not-at-all-exhaustive research was that: (1) I think it is the newest and I like that; (2) it can model mixed data in that it can take variables that have a gaussian, poisson or binary distribution...what this means is that in fact categorical factors have to be turned into multiple yes (1) no (0) binary factors.

The abn packages seems to have come from epidemiological research. Thus I couldn't find any papers where it is used in ecology. I also liked the idea of trying something ecologists haven't used yet.

First, this package needs complete data, so (multiple) imputation or deletion of missing data (rarely ideal) is essential before you even start...but then I get the feeling that you could be layering models onto models with imputation...none of this is great but you have to work things out as you can I suppose. Then, because it takes so long to run and get out a network, re-running the abn analysis for each m imputed data sets becomes unappealing, meaning that you're unlikely to go the extra mile to get the benefit of multiple imputation.

Second, splitting up your categorical factor variables into several binary factors will inflate the number of variables plugged into the model, possibly a lot. I had 34 variables after splitting and a dataset of 466 species. This can increase computation time for the networks to run a lot. But there are ways around it (to do with ban and retain matrices as I will show).

Third, you will need to install Graphviz to visualise the networks.

So, let's go. First to manipulate the data and set up the distribution list.

# load abn library

library(abn)

# set working directory

setwd("your directory")

# load an imputed data set

imp <- read.csv("imputed_data_1.csv")

# subset and arrange data frame to have variable you want and force factors to be factors if they aren't

abn.nox <- imp[,c(4, 5:11, 13:27, 29, 32:41)]

abn.nox$dispersal.vector.no <- dat$dispersal.vector.no

# set factors to be factors

abn.nox$str.comepetitive <- as.factor(abn.nox$str.comepetitive)

abn.nox$str.stress <- as.factor(abn.nox$str.stress)

abn.nox$str.ruderal <- as.factor(abn.nox$str.ruderal)

abn.nox <- na.omit(abn.nox) # check all missing values are removed

abn.nox.sub <- abn.nox[,c(1:23, 28, 35, 25:27, 29:34)]

# setup distribution list for each node

subdists <- list(states.nox="poisson",

mrt="gaussian",

cz.cult="poisson",

us.cult="poisson",

habitats="gaussian",

flor.zone="gaussian",

alt.range.cz="gaussian",

squares.cz="gaussian",

clonal.index="poisson",

ldmc="gaussian",

ssb.range="gaussian",

sla="gaussian",

monocarp="binomial",

polycarp="binomial",

annual="binomial",

shrub="binomial",

tree="binomial",

str.comepetitive="binomial",

str.stress="binomial",

str.ruderal="binomial",

height="poisson",

flower.period="poisson",

popagule.length="poisson",

pol.vector.no="gaussian",

dispersal.vector.no="gaussian",

self.pol="binomial",

insect.pol="binomial",

wind.popl="binomial",

ant.dispersed="binomial",

exoz.dispersed="binomial",

endoz.dispersed="binomial",

wind.dispersed="binomial",

water.dispersed="binomial",

self.dispersed="binomial"

);

abn uses 'ban' and 'retain' matrices to allow or ban parent arcs (this is all really well explained on the website). These matrices can reflect prior beliefs about how some variables could interract with others and ban silly relationships you know aren't possible.

The idea is to start with one 'parent' and then alter the ban and retain matrix depending on the previous model to narrow down the search space, enabling more patents to be added.

TIP: make each matrix in excel first so you can have your variables labeled as column and row headings and then save matrix as a .txt file. Copy and paste the text from the .txt file into R and then add the needed commas to the end of the rows...Interpretation of the matrix is that the variable in the column is allowed (0) or banned (1) from causing with the variable in the row (i.e. there is directionality coded in these matrices)

###1 Parent

subban <- matrix(c(

0,0,1,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1,0

),byrow=TRUE,ncol=34);

colnames(subban)<-rownames(subban)<-names(abn.nox); #names must be set

subretain <- matrix(c(

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

),byrow=TRUE,ncol=34);

colnames(subretain)<-rownames(subretain)<-names(abn.nox); #names must be set

# parent limits set to one initially

maxparen <- 1

# Make cache

subcache1 <- buildscorecache(data.df=abn.sub, data.dists=subdists, dag.banned=subban, dag.retained=subretain, max.parents=maxparen);

# perform heuristic search. Other option is an exhaustive search followed by (computationally intensive) bootstrapping to adjust for overfitting. The heuristic search can be used to produce a 50% consensus network that gets around overfitting and is less computationally intensive.

subheur1 <-search.hillclimber(score.cache=subcache1, num.searches=1000, seed=0, verbose=FALSE, trace=FALSE, timing.on=TRUE);

# dot output for the consensus network in Graphviz

tographviz(dag.m=subheur1$consensus, data.df=abn.nox, data.dists=subdists, outfile="abn1.dot");

# fit model to 50% consensus directed acyclic graph (DAG)

cons.mat1 <- subheur1$consensus

cons.mod1 <- fitabn(dag.m = cons.mat1, data.df=abn.sub, data.dists=subdists,compute.fixed=TRUE);

# get the goodness of fit for the model (marginal log liklihood)

cons.mod1$mlik

Then, you are going to need to increase the number of parents allowed by adding arcs to the 'ban' matrix.

##### 2 parents

subban <- matrix(c(

0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,1,1,

1,1,1,1,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,0,1,

1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0

),byrow=TRUE,ncol=34);

colnames(subban)<-rownames(subban)<-names(abn.nox); #names must be set

This time the retain matrix is just the last consensus network from the last search

subretain <- cons.mat1

# parent limits

maxparen <- 2

# Make cache

subcache2 <- buildscorecache(data.df=abn.nox, data.dists=subdists, dag.banned=subban, dag.retained=subretain, max.parents=maxparen);

# perform heuristic search

subheur2 <-search.hillclimber(score.cache=subcache2, num.searches=1000, seed=0, verbose=FALSE, trace=FALSE, timing.on=TRUE, dag.retained=subretain);

# dot output for the consensus network

tographviz(dag.m=subheur2$consensus,data.df=abn.nox, data.dists=subdists, outfile="abn2.dot");

# fit model to consensus DAG and get the goodness of fit

cons.mat2 <- subheur2$consensus

cons.mod2 <- fitabn(dag.m = cons.mat2, data.df=abn.nox, data.dists=subdists,compute.fixed=TRUE);

cons.mod2$mlik

Then you just need to keep on upping the number of parents and appending links to the ban matrix (so, if you have set the parents to two but some variables only have one parent, all other parents can be banned because we know there are less than two parents supported). In my case, the optimal number was six parents.

When no more parents can be added OR there is no further improvement in fitted model goodness-of-fit, then you have the optimal number of parents.

Now to the bit that you can't find on the abn website...the programme has recently been changed a bit but the documentation isn't updated. Mostly, it's ok. But when plotting the posteriour density estimates, it all goes wrong. A quick helpful email with Fraser Lewis, the author of the package, and I had the right code (yes I could have worked it out but I am not good at writing loops!):

library(Cairo);

CairoPDF("abnNoxious6ParentsMarginals.pdf");

for(i in 1:length(cons.mod6nox$marginals)){

onenode<-cons.mod6nox$marginals[[i]];

for(j in 1:length(onenode)){

plot(onenode[[j]],type="l",main=names(onenode)[j]);

}

dev.off();

Then, I modified the .dot file so that all the labels were as I liked them, with proper publication-ready names and with the arcs labelled with median effect sizes, and also dashed to indicate negative median values, or solid to indicate positive median values.

And this is the Graphviz code to make the diagram (NB values for arcs are not actually derived from the data, they're just an example)

digraph dag {

"Number of states noxious"[shape=diamond, style=filled, fillcolor=grey];
"Minimum residence time"[shape=oval];
"Czech cultivation index"[shape=diamond];
"U.S. cultivation index"[shape=diamond];
"Habitats"[shape=oval];
"Floral zones"[shape=oval];
"Czech altitudinal range"[shape=oval];
"Czech grid squares"[shape=oval];
"Clonality index"[shape=diamond];
"Leaf dry matter content"[shape=oval];
"Seed bank longevity"[shape=oval];
"Specific leaf area"[shape=oval];
"Monocarp"[shape=square];
"Polycarp"[shape=square];
"Annual"[shape=square];
"Shrub"[shape=square];
"Tree"[shape=square];
"Competitive \n Grime strategy"[shape=square];
"Stress tolerant \n Grime strategy"[shape=square];
"Ruderal \n Grime strategy"[shape=square];
"Height"[shape=diamond];
"Flowering period"[shape=diamond];
"Popagule length"[shape=diamond];
"Number of \n pollination vectors"[shape=oval];
"Number of \n dispersal vectors"[shape=oval];
"Self pollinated"[shape=square];
"Insect pollinated"[shape=square];
"Wind pollinated"[shape=square];
"Ant dispersed"[shape=square];
"Exozoochory"[shape=square];
"Endozoochory"[shape=square];
"Wind dispersed"[shape=square];
"Water dispersed"[shape=square];
"Self dispersed"[shape=square];

"Minimum residence time"->"Number of states noxious" [label="0.48"];
"Czech cultivation index"->"U.S. cultivation index" [label="0.12"];
"U.S. cultivation index"->"Number of states noxious" [label="0.89"];
"U.S. cultivation index"->"Minimum residence time" [label="0.26"];
"Habitats"->"Czech altitudinal range" [arrowhead=none, arrowtail=none, label="0.25"];
"Habitats"->"Czech grid squares" [arrowhead=none, arrowtail=none, label="0.60"];
"Czech grid squares"->"Minimum residence time" [label="0.25"];
"Czech grid squares"->"U.S. cultivation index" [label="0.20"];
"Czech grid squares"->"Floral zones" [arrowhead=none, arrowtail=none, label="0.38"];
"Czech grid squares"->"Czech altitudinal range" [arrowhead=none, arrowtail=none, label="0.37"];
"Clonality index"->"Wind pollinated" [arrowhead=none, arrowtail=none, label="0.45"];
"Leaf dry matter content"->"Number of states noxious" [style=dashed, label="-0.45"];
"Leaf dry matter content"->"Specific leaf area" [arrowhead=none, arrowtail=none, style=dashed, label="-0.30"];
"Seed bank longevity"->"Habitats" [label="0.50"];
"Seed bank longevity"->"Czech grid squares" [label="0.27 "];
"Seed bank longevity"->"Popagule length" [style=dashed, label="-0.33"];
"Specific leaf area"->"Self pollinated" [arrowhead=none, arrowtail=none, label="0.46"];
"Polycarp"->"Clonality index" [label="0.84"];
"Polycarp"->"Height" [style=dashed, label="-1.05"];
"Annual"->"Czech cultivation index" [style=dashed, label="-1.93"];
"Annual"->"Clonality index" [style=dashed, label="-1.51"];
"Annual"->"Height" [style=dashed, label="-1.64"];
"Annual"->"Self pollinated" [label="1.73"];
"Annual"->"Ant dispersed" [style=dashed, label="-1.12"];
"Annual"->"Endozoochory" [style=dashed, label="-2.31"];
"Shrub"->"Czech cultivation index" [label="0.66"];
"Shrub"->"Leaf dry matter content" [label="0.79"];
"Shrub"->"Height" [label="0.64"];
"Shrub"->"Exozoochory" [style=dashed, label="-3.91"];
"Shrub"->"Endozoochory" [label="3.27"];
"Shrub"->"Wind dispersed" [style=dashed, label="-1.68"];
"Shrub"->"Water dispersed" [style=dashed, label="-22.77"];
"Tree"->"Number of states noxious" [style=dashed, label="-2.62"];
"Tree"->"Czech cultivation index" [label="1.35"];
"Tree"->"Leaf dry matter content" [label="0.76"];
"Tree"->"Height" [label="2.43"];
"Tree"->"Flowering period" [style=dashed, label="-0.64"];
"Tree"->"Self pollinated" [style=dashed, label="-1.94"];
"Tree"->"Exozoochory" [style=dashed, label="-3.00"];
"Tree"->"Water dispersed" [style=dashed, label="-22.55"];
"Competitive \n Grime strategy"->"Habitats" [label="0.71"];
"Competitive \n Grime strategy"->"Floral zones" [style=dashed, label="-0.54"];
"Competitive \n Grime strategy"->"Popagule length" [arrowhead=none, arrowtail=none, label="0.54"];
"Stress tolerant \n Grime strategy"->"Habitats" [style=dashed, label="-0.32"];
"Stress tolerant \n Grime strategy"->"Popagule length" [arrowhead=none, arrowtail=none, style=dashed, label="-0.47"];
"Height"->"Czech grid squares" [label="0.03"];
"Height"->"Competitive \n Grime strategy" [label="5.16"];
"Height"->"Stress tolerant \n Grime strategy" [arrowhead=none, arrowtail=none, style=dashed, label="-0.48"];
"Height"->"Ruderal \n Grime strategy" [arrowhead=none, arrowtail=none, style=dashed, label="-1.45"];
"Height"->"Popagule length" [arrowhead=none, arrowtail=none, label="0.03"];
"Height"->"Wind pollinated" [arrowhead=none, arrowtail=none, label="0.16"];
"Height"->"Ant dispersed" [arrowhead=none, arrowtail=none, style=dashed, label="-2.14"];
"Flowering period"->"Floral zones" [label="0.18"];
"Flowering period"->"Seed bank longevity" [arrowhead=none, arrowtail=none, label="0.23"];
"Flowering period"->"Insect pollinated" [arrowhead=none, arrowtail=none, label="0.41"];
"Number of \n dispersal vectors"->"Seed bank longevity" [arrowhead=none, arrowtail=none, label="0.22"];
"Self pollinated"->"Competitive \n Grime strategy" [arrowhead=none, arrowtail=none, style=dashed, label="-1.37"];
"Self pollinated"->"Ruderal \n Grime strategy" [arrowhead=none, arrowtail=none, label="0.66"];
"Self pollinated"->"Number of \n pollination vectors" [label="1.83"];
"Insect pollinated"->"Clonality index" [arrowhead=none, arrowtail=none, style=dashed, label="-0.32"];
"Insect pollinated"->"Number of \n pollination vectors" [label="1.74"];
"Insect pollinated"->"Exozoochory" [arrowhead=none, arrowtail=none, style=dashed, label="-2.03"];
"Insect pollinated"->"Water dispersed" [arrowhead=none, arrowtail=none, style=dashed, label="-1.25"];
"Wind pollinated"->"Number of states noxious" [label="0.89"];
"Wind pollinated"->"Leaf dry matter content" [arrowhead=none, arrowtail=none, label="0.79"];
"Wind pollinated"->"Number of \n pollination vectors" [label="1.75"];
"Ant dispersed"->"Competitive \n Grime strategy" [arrowhead=none, arrowtail=none, label="1.28"];
"Ant dispersed"->"Stress tolerant \n Grime strategy" [arrowhead=none, arrowtail=none, label="0.89"];
"Ant dispersed"->"Number of \n dispersal vectors" [label="1.17"];
"Ant dispersed"->"Self pollinated" [arrowhead=none, arrowtail=none, label="1.05"];
"Ant dispersed"->"Insect pollinated" [arrowhead=none, arrowtail=none, label="2.11"];
"Ant dispersed"->"Wind pollinated" [arrowhead=none, arrowtail=none, style=dashed, label="-2.70"];
"Exozoochory"->"Number of states noxious" [label="0.74"];
"Exozoochory"->"Popagule length" [arrowhead=none, arrowtail=none, label="0.38"];
"Exozoochory"->"Number of \n dispersal vectors" [label="1.03"];
"Exozoochory"->"Self pollinated" [arrowhead=none, arrowtail=none, style=dashed, label="-0.57"];
"Exozoochory"->"Wind pollinated" [arrowhead=none, arrowtail=none, label="2.30"];
"Endozoochory"->"Height" [arrowhead=none, arrowtail=none, style=dashed, label="-0.91"];
"Endozoochory"->"Popagule length" [arrowhead=none, arrowtail=none, label="0.43"];
"Endozoochory"->"Number of \n dispersal vectors" [label="0.78"];
"Endozoochory"->"Insect pollinated" [arrowhead=none, arrowtail=none, label="3.82"];
"Endozoochory"->"Wind pollinated" [arrowhead=none, arrowtail=none, style=dashed, label="-1.58"];
"Wind dispersed"->"Number of \n dispersal vectors" [label="0.85"];
"Water dispersed"->"Leaf dry matter content" [arrowhead=none, arrowtail=none, style=dashed, label="-0.58"];
"Water dispersed"->"Competitive \n Grime strategy" [arrowhead=none, arrowtail=none, style=dashed, label="-1.80"];
"Water dispersed"->"Number of \n dispersal vectors" [label="1.15"];

}

Friday, June 21, 2013

A sad week

The Charles University Department of Ecology, Institute of Botany lab group and invasion ecology community was hit this week with the sad passing of Prof. Vojta Jarošík. A prolific ecologist - check out his publication records and re-read some of his work. I was not fortunate enough to get to know him well because his illness began around the time I started working for him. Sad times.

Imputation of missing data with Amelia II - some notes

As I mentioned in my previous post on imputation of missing data, I found the Amelia II package for R to be user-friendly and effective. the procedure is documented in a nice way, but here is some of my R code and the outputs to illustrate.

# load Amelia package (I used the developer version)
install.packages("Amelia", repos="http://r.iq.harvard.edu", type = "source")
library(Amelia)

# load data
dat <- read.csv("CEplants_database_13.05.13.csv")

# change categorical variables into factors not integers
dat$cz.cult <- as.factor(dat$cz.cult)
dat$us.cult <- as.factor(dat$us.cult)
dat$dna.whole.cat <- as.factor(dat$dna.whole.cat)
dat$dna.haploid.cat <- as.factor(dat$dna.haploid.cat)
dat$alt.range.cz <- as.numeric(dat$alt.range.cz)
dat$clonal.index <- as.factor(dat$clonal.index)
dat$str.comepetitive <- as.factor(dat$str.comepetitive)
dat$str.stress<- as.factor(dat$str.stress)
dat$str.ruderal<- as.factor(dat$str.ruderal)

# drop some ID variables and others you don't need
num <- dat[,c(1, 8:14, 16:57, 59:61)]

At this point I did a test for multivariate normality (an assumption of Amelia) and visually inspected histograms of the continuous variables with either no transformation, natural log transformation, or square-root transformation. If the data were more normal through transformation, I noted this for later.

# check for highly correlated varaibles (these might need to be removed), load Hmisc
library(Hmisc)
# Pull out numeric variables only and do a Spearman's correlation test with P-values
numeric <- num[,c(2:5, 8:16, 29:31, 33:34, 52:53)]
corR <- rcorr(as.matrix(numeric), type="spearman")

# remove any highly correlated ones and any variables that are pseudo-replicated (i.e. any dummy variables derived from others in the data set)
sub <- num[,c(1:3, 5:21, 25:27, 29:34, 37:41, 44:50)]

# tell Amelia which variables to log transform
log <- c("states.pres", "state.nox", "sla", "height", "flower.period", "popagule.length")

# tell Amelia which variables to square-root transform
sqrt <- c("mrt", "habitats", "flor.zone", "alt.range.cz", "ldmc")

# tell Amelia which variables are nominal categorical
nom <- c("monocarp", "polycarp","annual", "shrub", "tree",
"str.comepetitive","str.stress","str.ruderal","ploidy","sex.type","self.pol","insect.pol","wind.popl","ant.dispersed","exoz.dispersed","endoz.dispersed","wind.dispersed","water.dispersed","self.dispersed")

# tell Amelia which variables are ordered categorical
ord <- c("cz.cult", "us.cult", "clonal.index")

# run a multiple imputation, making 5 new imputed data sets
a.out <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord
)

# run a multiple imputation, making 5 new imputed data sets, using a ridge prior which should be 1% of the total number of records (rows) in the dataset, and may improve the evenness of the number of cycles/iterations in each imputation (n=5) run. Ideally they should be quite similar.
a.out1 <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord,
empri = .01*nrow(sub)
)

# Check the density plots
par(mfrow =c(2,2))
compare.density(a.out, var = 4)
compare.density(a.out, var = 8)
compare.density(a.out, var = 13)
compare.density(a.out, var = 14)

There are a bunch of other tests you can do to see how well the imputation has worked. These are all covered well in the documentation, so I'm not including them here. I was planning to show the plots and explain the interpretation, but, I feel that time has moved on and I'm now working on something else, so in a different head space this week.

One note: I found that using the ridge prior didn't make much difference to the number of iterations each 'run' was doing. The documentation says that the iterations should be similar for each imputation, but I found mine weren't. Not sure what else to do about it. May have to live with it.

Thursday, June 6, 2013

Imputation - a difficult few days

It's become clear that I'll need to impute missing values is my dataset in order to sensibly implement further analyses. Case-wise deletion is not a good option because there are many variables with missing values, so deletion to complete-case only reduced the dataset from some 467 incomplete entries to 177 complete entries - a stats-crime.

As always, there seem to be quite a lot of options for imputing in R. I'm not going to get into the methods of imputation that are implemented because there is a lot of information available on individual approaches to imputation at the click of a button. Instead I will just list the packages that I looked at (certainly not an exhaustive list), outline what they can implement and then also make any notes that I find relevant.

Hmisc: Harrell Miscellaneous - I found the documentation for these functions overwhelming - too much (14 Latex typeset pages of instructions including the R example code...14!). So decided instead to continue my search in the hope of finding packages with shorter instructions. But here is what it can do:

impute: Single imputation. Simple imputation using constants, speciﬁed single-valued function of the non-NAs, or from a sample (with replacement) from the non-NA values. The print method places * after variable values that were imputed. Needs a vector or a transcan object as input. Can take a transcan object and use the imputed values created by transcan (with imputed=TRUE) to fill-in NAs.
transcan: Single imputation. Needs (a) a matirx of continuous variables and dummy-coded categorical variables or (b) a formula calling a data frame, factor variables then don't need to be dummy coded but just specified as a factor. Does best guess imputation (medians, modes)
transcan specifying n.impute: Approximate multiple imputation (only approximate because it freezes the imputation model before drawing the multiple imputations rather than using different estimates of regression coefficients for each imputation). Continuous variables: sampling with replacement (Bayesian bootstrapping), doesn't assume normality or symmetry of residuals. Categorical variables: best guess. Can use fit.mult.impute to re-fit any model n.impute times based on n.impute completed datasets
aregImpute: Multiple imputation. Additive regression, bootstrapping and predictive mean matching. It draws predicted values from a full Bayesian predictive distribution. Uses predictive mean matching with optional weighted probability sampling of donors rather than using only the closest match.

imputation

Single imputation
Fills missing values in a data matrix (or a data frame) using mean, k-Nearest Neighbour (kNN), SVD, Singular Value Threshold, Boosted Trees Imputation, Locally weighted least squares.
Feels like the ability of this to implement boosted trees is advantageous because boosted trees can handle data sets with categorical and numerical variables.
However I ran this and it gave an error message that made me sad: Error in checkMissing(x, y) : Missing values are not allowed in the response (am not sure what it's using as a response variable). And: Warning messages: 1: In gbmImpute(dat, max.iters = 20, cv.fold = 0, n.trees = 100, ... :Tree based imputation works best with larger data (> 1000 obs. So looks like I can't use this on my data set...moving on.

mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data - two dark days spent messing with this on and off...

Involved tediously re-coding categorical and binary variables to be numeric, always starting at 1 and not 0 (all my binaries were as 0 and 1).
Also need to re-arrange data frame so that all categorical variables are in the first x columns.
I found the functions (you need to go through using two different functions [prelim.mix, em.mix] to get to use the function to impute a data set) difficult to use and it kept crashing my computer. I couldn't impute using the whole data set at once.
Actually...I'm probably being unfair here...I think there were multicollinearity and other issues in my data that stopped this from working...BUT the point is that mix's documentation and error messages didn't help me identify the problems. The Amelia II documentation did (see below).

Amelia II - a breath of fresh air compared to everything else I've tried. Also it can be run using a GUI without R (didn't try this) or through R. This is a nice feature for non-R users. Also has intelligible documentation.

Multiple imputation (could do single by setting m=1)
Assumes multivariate normality
Assumes missing at random (can be made more plausible by including more variables)
EMB algorithm combines the classic EM algorithm with a bootstrap approach to take draws from this posterior; Honaker and King (2010)
Returns data in the scale and form it was input it (you can tell Amelia to apply natural log or square root transformations to make data more multivariate normal, but these will be removed in the returned data set)
I had to remove two highly correlated variables (Spearman's correlation coefficient was >0.9; latitude range and longitude range) because it gave the error: "The resulting variance matrix was not invertible. Please check your data for highly collinear variables."
I had to remove some nominal categorical variables because they were derived from a bunch of other binary variables (I was trying to reduce the number of binary variables you see). Because including the original binary and the derived nominal categorical variables is sort of using the information twice I suppose that is bad and creates issues. I think using them twice was giving this error: "Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'", though not 100% sure.
This method was computationally very fast. Also some diagnostics are available to check things are sensible.

All in all, this has taken a while and there is still checking to do and you can mess with the priors it looks like. But I now have 5 imputed data sets thanks to Amelia II.

Thursday, May 30, 2013

Comparing methods for Bayesian networks in R

I want to use Bayesian networks to look at the structure of the links between my variables. The idea behind this project is that we don't mind if it's based on data mining.

There are lots of packages available for R on CRAN that will implement this sort of network type analysis. I'm making a summary of what I gather some of the packages can do, the type of data they work on, and how they treats missing values.

This is a work in progress post, so I will update it as I do more reading and also make some notes as things go good or bad!

abn: Data Modelling with Additive Bayesian Networks

Determines the most robust empirical model of data from interdependent variables: structure discovery
Equivalent to multivariate generalised linear modelling (including mixed models with random effects)
Families modelled: Gaussian, Poisson and Binomial.
Missing data: cannot have missing entries (impute or remove them before analysis)
Can have mix of binary and continuous variables in data (doesn't handle true categorical variables, gets around this by instructing the user to dummy-code each level of a categorical variable as a yes/no binary outcome then 'ban' links between the dummy variables in a 'banned matrix'. This greatly increases the number of variables modelled though)
Model specification is using two matrices (ban and retain)
Outputs dot code for Graphviz

Notes: I have quite a few variables (25 to 30 as this coding method leaves you with dummy binary variables) on 177 data points (this was just a test run, I have a lot more data but for a quick and dirty run I used na.omit on the dataframe, which really took out a lot of the data. Imputation would be better for the real thing). A lot of them can hypothetically have many 'parent' nodes. I believe having lots of parent nodes increases the computation time. I set it to a maximum of 4 parent nodes. After 6 hours the search was not finished so I decided to kill the analysis. This was premature, but I got impatient, seeing as it was effectively meant to be a dummy rum. This could be very time consuming if you want to do a few models or if something goes wrong. Must be more patient.

bnlearn: Bayesian network structure learning, parameter learning and inference

Model specification for directed graphs the same as for deal package - it's like graph model notation.
Arcs between variables can be made directional using function set.arc() rather than the matrix layout of the abn package.

To be continued...

Monday, May 27, 2013

Drawing path diagrams - options part 2

Having said in my last post that I wan't going to use R/Graphviz, the last couple of days I've investigating using Graphviz! I like to change my mind sometimes.

After some thought, I decided that the way that Graphviz will automatically lay out and create the path diagram based on the specified links has to be worth looking into. The tikz/Latex method is more labour intensive in some ways because it doesn't automatically decide things like arrow head placement.

I couldn't instantly figure out how to make the Graphviz GUI work! I think I was quite tired. But it's simple (on Windows at any rate): install it from this website, open it up, and then paste some dot code into the GUI. Then press the 'running man' icon. It should tell you if and where there are syntax errors.

You can get dot code from R using the sem package and the function pathDiagram(). However, I have found that the messing around involved in getting to that stage with sem in R is not worth my time (as I don't want to use sem for the analysis). I kept getting an error message at earlier stages in the process (as pathDiagram needs a fitted sem model object, so you need to get that done first). So I just looked at some example dot codes for structural equation models (SEM) and worked out by trial-and-error how to modify the dot code.

For my needs the Graphviz syntax is quite simple. Here is a quickly made-up example of what I managed to achieve.

digraph "sem.wh.1" {
rankdir=LL;
size="24,24";
node [fontname="Helvetica" fontsize=10 shape=box];
edge [fontname="Helvetica" fontsize=10];
center=1;

"mrt" -> "states.pres" [label=""];
"us.cult" -> "states.pres" [label=""];
"flor.zone" -> "states.pres" [label=""];
"squares.cz" -> "states.pres" [label=""];
"flower.period" -> "states.pres" [label=""];
"dispersal.vectors2" -> "states.pres" [label=""];
"pollination.vectors2" -> "states.pres" [label=""];
"propagule length" -> "states.pres" [label=""];
"grime" -> "states.pres" [label=""];
"life.span.ingolf" -> "states.pres" [label=""];

"cz.cult" -> "mrt" [label=""];
"habitats" -> "mrt" [label=""];
"flor.zone" -> "mrt" [label=""];
"alt.range.cz" -> "mrt" [label=""];
"life.span.ingolf" -> "mrt" [label=""];
"growth.form" -> "mrt" [label=""];
"grime" -> "mrt" [label=""];
"flower.period" -> "mrt" [label=""];
"flower.period" -> "mrt" [label=""];

"habitats" -> "cz.cult" [label=""];
"flor.zone" -> "cz.cult" [label=""];
"alt.range.cz" -> "cz.cult" [label=""];
"squares.cz" -> "cz.cult" [label=""];
"sla" -> "cz.cult" [label=""];
"life.span.ingolf" -> "cz.cult" [label=""];
"growth.form" -> "cz.cult" [label=""];

"cz.cult" -> "us.cult" [label=""];
"habitats" -> "us.cult" [label=""];
"flor.zone" -> "us.cult" [label=""];
"alt.range.cz" -> "us.cult" [label=""];
"squares.cz" -> "us.cult" [label=""];

"flor.zone" -> "habitats" [label=""];
"alt.range.cz" -> "habitats" [label=""];
"squares.cz" -> "habitats" [label=""];
"ssb.range" -> "habitats" [label=""];
"life.span.ingolf" -> "habitats" [label=""];
"growth.form" -> "habitats" [label=""];
"grime" -> "habitats" [label=""];
"height" -> "habitats" [label=""];
"flower.period" -> "habitats" [label=""];
"dispersal.vectors2" -> "habitats" [label=""];

}

It's ok. It's getting quite complicated already though. Doing all this has forced me to really think about what I'm trying to model here. There are so many hypothesised links that it might be better to use some kind of machine-learning network analysis to look at the strong links instead. Something to look into.