Friday, June 21, 2013
A sad week
The Charles University Department of Ecology, Institute of Botany lab group and invasion ecology community was hit this week with the sad passing of Prof. Vojta Jarošík. A prolific ecologist - check out his publication records and re-read some of his work. I was not fortunate enough to get to know him well because his illness began around the time I started working for him. Sad times.
Imputation of missing data with Amelia II - some notes
As I mentioned in my previous post on imputation of missing data, I found the Amelia II package for R to be user-friendly and effective. the procedure is documented in a nice way, but here is some of my R code and the outputs to illustrate.
# load Amelia package (I used the developer version)
install.packages("Amelia", repos="http://r.iq.harvard.edu", type = "source")
library(Amelia)
# load data
dat <- read.csv("CEplants_database_13.05.13.csv")
# change categorical variables into factors not integers
dat$cz.cult <- as.factor(dat$cz.cult)
dat$us.cult <- as.factor(dat$us.cult)
dat$dna.whole.cat <- as.factor(dat$dna.whole.cat)
dat$dna.haploid.cat <- as.factor(dat$dna.haploid.cat)
dat$alt.range.cz <- as.numeric(dat$alt.range.cz)
dat$clonal.index <- as.factor(dat$clonal.index)
dat$str.comepetitive <- as.factor(dat$str.comepetitive)
dat$str.stress<- as.factor(dat$str.stress)
dat$str.ruderal<- as.factor(dat$str.ruderal)
# drop some ID variables and others you don't need
num <- dat[,c(1, 8:14, 16:57, 59:61)]
At this point I did a test for multivariate normality (an assumption of Amelia) and visually inspected histograms of the continuous variables with either no transformation, natural log transformation, or square-root transformation. If the data were more normal through transformation, I noted this for later.
# check for highly correlated varaibles (these might need to be removed), load Hmisc
library(Hmisc)
# Pull out numeric variables only and do a Spearman's correlation test with P-values
numeric <- num[,c(2:5, 8:16, 29:31, 33:34, 52:53)]
corR <- rcorr(as.matrix(numeric), type="spearman")
# remove any highly correlated ones and any variables that are pseudo-replicated (i.e. any dummy variables derived from others in the data set)
sub <- num[,c(1:3, 5:21, 25:27, 29:34, 37:41, 44:50)]
# tell Amelia which variables to log transform
log <- c("states.pres", "state.nox", "sla", "height", "flower.period", "popagule.length")
# tell Amelia which variables to square-root transform
sqrt <- c("mrt", "habitats", "flor.zone", "alt.range.cz", "ldmc")
# tell Amelia which variables are nominal categorical
nom <- c("monocarp", "polycarp","annual", "shrub", "tree",
"str.comepetitive","str.stress","str.ruderal","ploidy","sex.type","self.pol","insect.pol","wind.popl","ant.dispersed","exoz.dispersed","endoz.dispersed","wind.dispersed","water.dispersed","self.dispersed")
# tell Amelia which variables are ordered categorical
ord <- c("cz.cult", "us.cult", "clonal.index")
# run a multiple imputation, making 5 new imputed data sets
a.out <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord
)
# run a multiple imputation, making 5 new imputed data sets, using a ridge prior which should be 1% of the total number of records (rows) in the dataset, and may improve the evenness of the number of cycles/iterations in each imputation (n=5) run. Ideally they should be quite similar.
a.out1 <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord,
empri = .01*nrow(sub)
)
# Check the density plots
par(mfrow =c(2,2))
compare.density(a.out, var = 4)
compare.density(a.out, var = 8)
compare.density(a.out, var = 13)
compare.density(a.out, var = 14)
There are a bunch of other tests you can do to see how well the imputation has worked. These are all covered well in the documentation, so I'm not including them here. I was planning to show the plots and explain the interpretation, but, I feel that time has moved on and I'm now working on something else, so in a different head space this week.
One note: I found that using the ridge prior didn't make much difference to the number of iterations each 'run' was doing. The documentation says that the iterations should be similar for each imputation, but I found mine weren't. Not sure what else to do about it. May have to live with it.
# load Amelia package (I used the developer version)
install.packages("Amelia", repos="http://r.iq.harvard.edu", type = "source")
library(Amelia)
# load data
dat <- read.csv("CEplants_database_13.05.13.csv")
# change categorical variables into factors not integers
dat$cz.cult <- as.factor(dat$cz.cult)
dat$us.cult <- as.factor(dat$us.cult)
dat$dna.whole.cat <- as.factor(dat$dna.whole.cat)
dat$dna.haploid.cat <- as.factor(dat$dna.haploid.cat)
dat$alt.range.cz <- as.numeric(dat$alt.range.cz)
dat$clonal.index <- as.factor(dat$clonal.index)
dat$str.comepetitive <- as.factor(dat$str.comepetitive)
dat$str.stress<- as.factor(dat$str.stress)
dat$str.ruderal<- as.factor(dat$str.ruderal)
# drop some ID variables and others you don't need
num <- dat[,c(1, 8:14, 16:57, 59:61)]
At this point I did a test for multivariate normality (an assumption of Amelia) and visually inspected histograms of the continuous variables with either no transformation, natural log transformation, or square-root transformation. If the data were more normal through transformation, I noted this for later.
# check for highly correlated varaibles (these might need to be removed), load Hmisc
library(Hmisc)
# Pull out numeric variables only and do a Spearman's correlation test with P-values
numeric <- num[,c(2:5, 8:16, 29:31, 33:34, 52:53)]
corR <- rcorr(as.matrix(numeric), type="spearman")
# remove any highly correlated ones and any variables that are pseudo-replicated (i.e. any dummy variables derived from others in the data set)
sub <- num[,c(1:3, 5:21, 25:27, 29:34, 37:41, 44:50)]
# tell Amelia which variables to log transform
log <- c("states.pres", "state.nox", "sla", "height", "flower.period", "popagule.length")
# tell Amelia which variables to square-root transform
sqrt <- c("mrt", "habitats", "flor.zone", "alt.range.cz", "ldmc")
# tell Amelia which variables are nominal categorical
nom <- c("monocarp", "polycarp","annual", "shrub", "tree",
"str.comepetitive","str.stress","str.ruderal","ploidy","sex.type","self.pol","insect.pol","wind.popl","ant.dispersed","exoz.dispersed","endoz.dispersed","wind.dispersed","water.dispersed","self.dispersed")
# tell Amelia which variables are ordered categorical
ord <- c("cz.cult", "us.cult", "clonal.index")
# run a multiple imputation, making 5 new imputed data sets
a.out <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord
)
# run a multiple imputation, making 5 new imputed data sets, using a ridge prior which should be 1% of the total number of records (rows) in the dataset, and may improve the evenness of the number of cycles/iterations in each imputation (n=5) run. Ideally they should be quite similar.
a.out1 <- amelia(sub,
m = 5,
p2s = 1,
idvars = "taxon",
logs = log,
sqrts = sqrt,
noms = nom,
ords = ord,
empri = .01*nrow(sub)
)
# Check the density plots
par(mfrow =c(2,2))
compare.density(a.out, var = 4)
compare.density(a.out, var = 8)
compare.density(a.out, var = 13)
compare.density(a.out, var = 14)
There are a bunch of other tests you can do to see how well the imputation has worked. These are all covered well in the documentation, so I'm not including them here. I was planning to show the plots and explain the interpretation, but, I feel that time has moved on and I'm now working on something else, so in a different head space this week.
One note: I found that using the ridge prior didn't make much difference to the number of iterations each 'run' was doing. The documentation says that the iterations should be similar for each imputation, but I found mine weren't. Not sure what else to do about it. May have to live with it.
Thursday, June 6, 2013
Imputation - a difficult few days
It's become clear that I'll need to impute missing values is my dataset in order to sensibly implement further analyses. Case-wise deletion is not a good option because there are many variables with missing values, so deletion to complete-case only reduced the dataset from some 467 incomplete entries to 177 complete entries - a stats-crime.
As always, there seem to be quite a lot of options for imputing in R. I'm not going to get into the methods of imputation that are implemented because there is a lot of information available on individual approaches to imputation at the click of a button. Instead I will just list the packages that I looked at (certainly not an exhaustive list), outline what they can implement and then also make any notes that I find relevant.
Hmisc: Harrell Miscellaneous - I found the documentation for these functions overwhelming - too much (14 Latex typeset pages of instructions including the R example code...14!). So decided instead to continue my search in the hope of finding packages with shorter instructions. But here is what it can do:
imputation
As always, there seem to be quite a lot of options for imputing in R. I'm not going to get into the methods of imputation that are implemented because there is a lot of information available on individual approaches to imputation at the click of a button. Instead I will just list the packages that I looked at (certainly not an exhaustive list), outline what they can implement and then also make any notes that I find relevant.
Hmisc: Harrell Miscellaneous - I found the documentation for these functions overwhelming - too much (14 Latex typeset pages of instructions including the R example code...14!). So decided instead to continue my search in the hope of finding packages with shorter instructions. But here is what it can do:
- impute: Single imputation. Simple imputation using constants, specified single-valued function of the non-NAs, or from a sample (with replacement) from the non-NA values. The print method places * after variable values that were imputed. Needs a vector or a transcan object as input. Can take a transcan object and use the imputed values created by transcan (with imputed=TRUE) to fill-in NAs.
- transcan: Single imputation. Needs (a) a matirx of continuous variables and dummy-coded categorical variables or (b) a formula calling a data frame, factor variables then don't need to be dummy coded but just specified as a factor. Does best guess imputation (medians, modes)
- transcan specifying n.impute: Approximate multiple imputation (only approximate because it freezes the imputation model before drawing the multiple imputations rather than using different estimates of regression coefficients for each imputation). Continuous variables: sampling with replacement (Bayesian bootstrapping), doesn't assume normality or symmetry of residuals. Categorical variables: best guess. Can use fit.mult.impute to re-fit any model n.impute times based on n.impute completed datasets
- aregImpute: Multiple imputation. Additive regression, bootstrapping and predictive mean matching. It draws predicted values from a full Bayesian predictive distribution. Uses predictive mean matching with optional weighted probability sampling of donors rather than using only the closest match.
imputation
- Single imputation
- Fills missing values in a data matrix (or a data frame) using mean, k-Nearest Neighbour (kNN), SVD, Singular Value Threshold, Boosted Trees Imputation, Locally weighted least squares.
- Feels like the ability of this to implement boosted trees is advantageous because boosted trees can handle data sets with categorical and numerical variables.
- However I ran this and it gave an error message that made me sad: Error in checkMissing(x, y) : Missing values are not allowed in the response (am not sure what it's using as a response variable). And: Warning messages: 1: In gbmImpute(dat, max.iters = 20, cv.fold = 0, n.trees = 100, ... :Tree based imputation works best with larger data (> 1000 obs. So looks like I can't use this on my data set...moving on.
- Involved tediously re-coding categorical and binary variables to be numeric, always starting at 1 and not 0 (all my binaries were as 0 and 1).
- Also need to re-arrange data frame so that all categorical variables are in the first x columns.
- I found the functions (you need to go through using two different functions [prelim.mix, em.mix] to get to use the function to impute a data set) difficult to use and it kept crashing my computer. I couldn't impute using the whole data set at once.
- Actually...I'm probably being unfair here...I think there were multicollinearity and other issues in my data that stopped this from working...BUT the point is that mix's documentation and error messages didn't help me identify the problems. The Amelia II documentation did (see below).
Amelia II - a breath of fresh air compared to everything else I've tried. Also it can be run using a GUI without R (didn't try this) or through R. This is a nice feature for non-R users. Also has intelligible documentation.
- Multiple imputation (could do single by setting m=1)
- Assumes multivariate normality
- Assumes missing at random (can be made more plausible by including more variables)
- EMB algorithm combines the classic EM algorithm with a bootstrap approach to take draws from this posterior; Honaker and King (2010)
- Returns data in the scale and form it was input it (you can tell Amelia to apply natural log or square root transformations to make data more multivariate normal, but these will be removed in the returned data set)
- I had to remove two highly correlated variables (Spearman's correlation coefficient was >0.9; latitude range and longitude range) because it gave the error: "The resulting variance matrix was not invertible. Please check your data for highly collinear variables."
- I had to remove some nominal categorical variables because they were derived from a bunch of other binary variables (I was trying to reduce the number of binary variables you see). Because including the original binary and the derived nominal categorical variables is sort of using the information twice I suppose that is bad and creates issues. I think using them twice was giving this error: "Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'", though not 100% sure.
- This method was computationally very fast. Also some diagnostics are available to check things are sensible.
All in all, this has taken a while and there is still checking to do and you can mess with the priors it looks like. But I now have 5 imputed data sets thanks to Amelia II.
Subscribe to:
Posts (Atom)