Thursday, June 6, 2013

Imputation - a difficult few days

It's become clear that I'll need to impute missing values is my dataset in order to sensibly implement further analyses. Case-wise deletion is not a good option because there are many variables with missing values, so deletion to complete-case only reduced the dataset from some 467 incomplete entries to 177 complete entries - a stats-crime.

As always, there seem to be quite a lot of options for imputing in R. I'm not going to get into the methods of imputation that are implemented because there is a lot of information available on individual approaches to imputation at the click of a button. Instead I will just list the packages that I looked at (certainly not an exhaustive list), outline what they can implement and then also make any notes that I find relevant.


Hmisc: Harrell Miscellaneous - I found the documentation for these functions overwhelming - too much (14 Latex typeset pages of instructions including the R example code...14!).  So decided instead to continue my search in the hope of finding packages with shorter instructions. But here is what it can do:
  • impute: Single imputation. Simple imputation using constants, speciļ¬ed single-valued function of the non-NAs, or from a sample (with replacement) from the non-NA values. The print method places * after variable values that were imputed. Needs a vector or a transcan object as input. Can take a transcan object and use the imputed values created by transcan (with imputed=TRUE) to fill-in NAs.
  • transcan: Single imputation. Needs (a) a matirx of continuous variables and dummy-coded categorical variables or (b) a formula calling a data frame, factor variables then don't need to be dummy coded but just specified as a factor. Does best guess imputation (medians, modes)
  • transcan specifying n.impute: Approximate multiple imputation (only approximate because it freezes the imputation model before drawing the multiple imputations rather than using different estimates of regression coefficients for each imputation). Continuous variables: sampling with replacement (Bayesian bootstrapping), doesn't assume normality or symmetry of residuals. Categorical variables: best guess. Can use fit.mult.impute to re-fit any model n.impute times based on n.impute completed datasets
  • aregImpute: Multiple imputation. Additive regression, bootstrapping and predictive mean matching. It draws predicted values from a full Bayesian predictive distribution. Uses predictive mean matching with optional weighted probability sampling of donors rather than using only the closest match.

imputation
  • Single imputation
  • Fills missing values in a data matrix (or a data frame) using mean, k-Nearest Neighbour (kNN), SVD, Singular Value Threshold, Boosted Trees Imputation, Locally weighted least squares.
  • Feels like the ability of this to implement boosted trees is advantageous because boosted trees can handle data sets with categorical and numerical variables. 
  • However I ran this and it gave an error message that made me sad: Error in checkMissing(x, y) : Missing values are not allowed in the response (am not sure what it's using as a response variable). And: Warning messages: 1: In gbmImpute(dat, max.iters = 20, cv.fold = 0, n.trees = 100,  ... :Tree based imputation works best with larger data (> 1000 obs. So looks like I can't use this on my data set...moving on.

mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data - two dark days spent messing with this on and off...
  • Involved tediously re-coding categorical and binary variables to be numeric, always starting at 1 and not 0 (all my binaries were as 0 and 1). 
  • Also need to re-arrange data frame so that all categorical variables are in the first x columns.
  • I found the functions (you need to go through using two different functions [prelim.mix, em.mix] to get to use the function to impute a data set) difficult to use and it kept crashing my computer. I couldn't impute using the whole data set at once.
  • Actually...I'm probably being unfair here...I think there were multicollinearity and other issues in my data that stopped this from working...BUT the point is that mix's documentation and error messages didn't help me identify the problems. The Amelia II documentation did (see below).

Amelia II - a breath of fresh air compared to everything else I've tried. Also it can be run using a GUI without R (didn't try this) or through R. This is a nice feature for non-R users. Also has intelligible documentation.

  • Multiple imputation (could do single by setting m=1)
  • Assumes multivariate normality
  • Assumes missing at random (can be made more plausible by including more variables)
  • EMB algorithm combines the classic EM algorithm with a bootstrap approach to take draws from this posterior; Honaker and King (2010)
  • Returns data in the scale and form it was input it (you can tell Amelia to apply natural log or square root transformations to make data more multivariate normal, but these will be removed in the returned data set)
  • I had to remove two highly correlated variables (Spearman's correlation coefficient was >0.9; latitude range and longitude range) because it gave the error: "The resulting variance matrix was not invertible.  Please check your data for highly collinear variables."
  • I had to remove some nominal categorical variables because they were derived from a bunch of other binary variables (I was trying to reduce the number of binary variables you see). Because including the original binary and the derived nominal categorical variables is sort of using the information twice I suppose that is bad and creates issues. I think using them twice was giving this error: "Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'", though not 100% sure.
  • This method was computationally very fast. Also some diagnostics are available to check things are sensible.

All in all, this has taken a while and there is still checking to do and you can mess with the priors it looks like. But I now have 5 imputed data sets thanks to Amelia II.


No comments: