Monday, April 29, 2013

The importance of meta-data

Creating and maintaining accurate meta-data for a database is important, but often overlooked.

Meta-data is especially important when working on large databases with many collaborators. You may know what you've done, but to others it's not always obvious. Equally, come back to your database after a few months away and you may not remember what you did, what unit the data is in, or it's source.

I am working on a large collaborative project and received a database with relatively little meta-data. I am now spending the day deciphering what each variable is and will probably have to consult my collaborators for more information at some point.

Often it's easiest to add another worksheet tab into the excel file if you're using that, titled 'meta-data', rather than listing meta-data in a separate document that could easily become detached from the database.

Here, then, is a basic outline of what meta-data should always be included. 

1. A few lines describing who created the database and when, who it was received from if it was emailed to you, and what it is about.

2. List of variable names as they appear in the raw database; a brief description of the variable; the coding or units; the type of variable (cat, cont, binary, percentage, etc); if it's a response or explanatory variable; the shortened name used for the variable in any R scripts.

I often find that this sort of meta-data is useful when writing a paper and rarely a waste of time to compile, because you almost have a ready-to-go table that could be added to the paper or supplemental information about your data. When I read papers with lots of variables I find such tables very helpful.

Thursday, April 18, 2013

Growing your own creativity

Since starting to seriously use Twitter a couple of weeks ago to follow like-minded researchers and tweet my own interests,  I've begun to feel better informed about ecology and also more creative. These outcomes alone are excellent reasons to be using social media and I'm glad I've taken that step.

Perhaps I shouldn't admit this - but coming up with creative ideas for research is something that I've found challenging. Useful creativity requires a deep knowledge of your subject area as well as the ideas to advance it in a novel way.

Why has this been a problem for me? When I was younger I enjoyed and even excelled at traditionally "creative" activities such as creative writing, music and art. Yet somehow through university I lost that natural creative feeling. I suspect it was because I felt I should focus all my energy on learning, understanding and memorising in order to get good grades at the expense of taking time out to really think about things.

I've heard people say that your PhD is a time to really think about an area in depth and to enjoy it because you'll never get this chance again...implying perhaps that this is a time to be creative. However, I didn't find that this was how I tackled my PhD. With the pressure of having to complete it in three years due to funding constraints (everyone gets this one), coupled with the (totally correct) expectation that I should publish parts of my thesis before submission, I found myself feeling rather like I was on a treadmill.

I felt I had no time for "extraneous" things, which in my mind then included Twitter, blogging and time out to relax and contemplate. I used to stuff a sandwich into my face at my desk instead of taking a break and would make furtive dashes to Coffee Culture to grab a flat white, which I would get to take-away, because of the need to get back to work.

This probably squashed any creativity in me and it's totally my own doing!

I've decided to try and nurture my creativity and original thought. Even just admitting I need to do this has freed me to think in a less goal-orientated way for at least a little bit of the day.

Watch this space.

Thursday, April 4, 2013

Logistic regression in BUGS: model fit and performance using Chi-square and deviance

Having narrowly escaped being an April-fools baby, this week saw me celebrating my birthday (no, I won't say which one!). The great thing about being an Easter birthday is that I can legitimately scoff a lot of chocolate.

Project-wise, I've managed to code a full two path models using my PhD data and test the independence claims in each model, as well as the overall model, following some methods in Clough (2012). This has been quite an achievement for me. The models run in R, though I can't say whether they're totally correct or not yet. This exercise has been useful because I've had to code logistic (Bernoulli), Poisson and regular Normal linear regression models in BUGS. I've also had to tackle some issue around data standardisation prior to modelling.

Another issue I wanted to resolve was how to calculate something akin to Bayesian p-values, or a measure of the fit, for a BUGS logistic regression model. Here is some code provided by Richard Duncan that does just this using a Chi-square goodness of fit tests and deviance measure. However, I haven't managed to get this working on my own data yet.

# load packages
  library(rjags)
  library(R2jags)
  library(mcmcplots)
  load.module("glm")          # loads the glm module for JAGS, which
                              # may help with convergence
  
# set working directory
setwd("c:\\pgrad\\kirsty mcgregor")

# generate some data
  unlogit <- function(x) exp(x)/(1+exp(x))
  
# an explanatory variable
  x <- seq(1, 10, 0.1)
# probability as a function of x on the logit scale  
  p <- unlogit(-1 + 0.3*x)
  plot(p ~ x)
  
# now generate some bernoulli data using this probability info
  y <- rbinom(length(p), 1, p)
  y
  
# logistic regression model in R
  summary(m1 <- glm(y ~ x, family=binomial))
  
  N <- length(y)
  
# in JAGS
  mod <- "model
  {
  for(i in 1:N) {
    y[i] ~ dbern(p[i])
    logit(p[i]) <- b0 + b1*x[i]

# calculate goodness of fit statistic for logistic regression model using data
    predicted[i] <- p[i]
    res.y[i] <- ((y[i] - predicted[i]) / sqrt(predicted[i]*(1-predicted[i])))^2

# calculate goodness of fit statistic for logistic regression model using new predicted data
    y.rep[i] ~ dbern(p[i])
    res.y.rep[i] <- ((y.rep[i] - predicted[i]) / sqrt(predicted[i]*(1-predicted[i])))^2
  }

  fit <- sum(res.y[])           # test statistic for data   
  fit.rep <- sum(res.y.rep[])   # test statistic for new predicted data   
  test <- step(fit.rep - fit)   # Test whether new data set more extreme
  bpvalue <- mean(test)   # Bayesian p-value 

  #priors
  b0 ~ dnorm(0, 0.0001)
  b1 ~ dnorm(0, 0.0001)
}"


  # write model
  write(mod, "modelRD.txt")
  
  set.seed(rnbinom(1, mu=200, size=1))
  mod <- jags(model = "modelRD.txt",
              data = list(N=N, y=y, x=x),
              inits = function() list(b0=rnorm(1), b1=rnorm(1)), 
              param = c("b0", "b1", "bpvalue"),
              n.chains = 3,
              n.iter =20000,
              n.burnin = 10000)
  

# put output into mcmc form and plot chains  
  out <- as.mcmc(mod)
  mcmcplot(out, parms=c("b0", "b1"))

# save and view output  
  all.sum <- mod$BUGSoutput$summary
  all.sum


Reading this week:
Clough, Y. 2012. A generalized approach to modeling and estimating indirect effects in ecology. Ecology 93:1809–1815. http://dx.doi.org/10.1890/11-1899.1

Imai, K., Keele, L., and Yamamoto, T. 2010. Identification, inference and sensitivity analysis for causal mediation effects. Statistical Science 25: 51–71. http://dx.doi.org/10.1214/10-STS321

Morin, L., Paini, D.R., Randall, R.P. 2013. Can global weed assemblages be used to predict future weeds? PLoS ONE 8(2): e55547.  http://dx.doi.org/10.1371/journal.pone.0055547

Bechara, F.C., Reis, A., Bourscheid, K., Vieira, N.K., and Trentin, B.E. 2013. Reproductive biology and early establishment of Pinus elliottii var. elliottii in Brazilian sandy coastal plain vegetation: implications for biological invasion. Scientia Agricola 70: 88-92. Available at: http://www.scielo.br/pdf/sa/v70n2/05.pdf