As I mentioned in my previous post on imputation of missing data, I found the Amelia II package for R to be user-friendly and effective. the procedure is documented in a nice way, but here is some of my R code and the outputs to illustrate.
# load Amelia package (I used the developer version)
install.packages("Amelia", repos="http://r.iq.harvard.edu", type = "source")
library(Amelia)
# load data
dat <- read.csv("CEplants_database_13.05.13.csv")
# change categorical variables into factors not integers
dat$cz.cult <- as.factor(dat$cz.cult)
dat$us.cult <- as.factor(dat$us.cult)
dat$dna.whole.cat <- as.factor(dat$dna.whole.cat)
dat$dna.haploid.cat <- as.factor(dat$dna.haploid.cat)
dat$alt.range.cz <- as.numeric(dat$alt.range.cz)
dat$clonal.index <- as.factor(dat$clonal.index)
dat$str.comepetitive <- as.factor(dat$str.comepetitive)
dat$str.stress<- as.factor(dat$str.stress)
dat$str.ruderal<- as.factor(dat$str.ruderal)
# drop some ID variables and others you don't need
num <- dat[,c(1, 8:14, 16:57, 59:61)]
At this point I did a test for multivariate normality (an assumption of Amelia) and visually inspected histograms of the continuous variables with either no transformation, natural log transformation, or square-root transformation. If the data were more normal through transformation, I noted this for later.
# check for highly correlated varaibles (these might need to be removed), load Hmisc
library(Hmisc)
# Pull out numeric variables only and do a Spearman's correlation test with P-values
numeric <- num[,c(2:5, 8:16, 29:31, 33:34, 52:53)]
corR <- rcorr(as.matrix(numeric), type="spearman")
# remove any highly correlated ones and any variables that are pseudo-replicated (i.e. any dummy variables derived from others in the data set)
sub <- num[,c(1:3, 5:21, 25:27, 29:34, 37:41, 44:50)]
# tell Amelia which variables to log transform
log <- c("states.pres", "state.nox", "sla", "height", "flower.period", "popagule.length")
# tell Amelia which variables to square-root transform
sqrt <- c("mrt", "habitats", "flor.zone", "alt.range.cz", "ldmc")
# tell Amelia which variables are nominal categorical
nom <- c("monocarp", "polycarp","annual", "shrub", "tree",
"str.comepetitive","str.stress","str.ruderal","ploidy","sex.type","self.pol","insect.pol","wind.popl","ant.dispersed","exoz.dispersed","endoz.dispersed","wind.dispersed","water.dispersed","self.dispersed")
# tell Amelia which variables are ordered categorical
ord <- c("cz.cult", "us.cult", "clonal.index")
# run a multiple imputation, making 5 new imputed data sets
a.out <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord
)
# run a multiple imputation, making 5 new imputed data sets, using a ridge prior which should be 1% of the total number of records (rows) in the dataset, and may improve the evenness of the number of cycles/iterations in each imputation (n=5) run. Ideally they should be quite similar.
a.out1 <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord,
empri = .01*nrow(sub)
)
# Check the density plots
par(mfrow =c(2,2))
compare.density(a.out, var = 4)
compare.density(a.out, var = 8)
compare.density(a.out, var = 13)
compare.density(a.out, var = 14)
There are a bunch of other tests you can do to see how well the imputation has worked. These are all covered well in the documentation, so I'm not including them here. I was planning to show the plots and explain the interpretation, but, I feel that time has moved on and I'm now working on something else, so in a different head space this week.
One note: I found that using the ridge prior didn't make much difference to the number of iterations each 'run' was doing. The documentation says that the iterations should be similar for each imputation, but I found mine weren't. Not sure what else to do about it. May have to live with it.
Hi,
ReplyDeleteThank you for the helpful post. Could you let me know what this: # change categorical variables into factors not integers does?
I am using Amelia and it detected factor variables in my data set on its own, so I put them into "ord" category without changing anything (i.e without coding them into factors beforehand).
Then I got an error:
"Error in amcheck(x = x, m = m, idvars = numopts$idvars, priors = priors, :
The variable q22g is perfectly collinear with another variable in the data.
The variable NA is perfectly collinear with another variable in the data.
The variable NA is perfectly collinear with another variable in the data."
So I did the Spearman test and I cannot see that any numeric variable is highly correlated.... (Additionally to run it I needed to remove NAs from all the rows...)
This comment has been removed by the author.
DeleteAsiack, did you ever find a solution to this problem? I am also having the same problem--perfect collinearity between all of my variables, though Spearman test reveals imperfect correlations. Frustratingly, I have very similar datasets that work with the exact same scripts. Any solution or hints would be very appreciated.
DeleteFYI, you can remove this error using the argument incheck = FALSE. See https://lists.gking.harvard.edu/pipermail/amelia/2015-January/001131.html for more information from the Amelia authors.
Delete