Introduction

The foreign R package is useful for exporting data sets to all kinds of formats including files for the proprietary SPSS program from IBM. However, the default method for writing to SPSS doesn’t allow for variable labels. Instead, it defaults to labeling all of the variables by their name from the column headings. In this presentation, we will use the Hmisc R package for its variable labeling functionality and write a modification to the original SPSS export function from the foreign R package. If you want to test this code but don’t have SPSS, there is an open-source version available called PSPP.

This presentation will use the R programming language and assume the user is using Rstudio to compile their R markdown files into HTML (R Core Team 2019; RStudio Team 2016). If you want to follow along with the R markdown, it is downloadable here. The associated bibliography file is here.

Libraries

The libraries knitr, bookdown, and kableExtra generate the HTML output (Xie 2019; Xie 2018; Zhu 2019). We use the Hmisc library to set variable labels and the foreign library to export the data to SPSS (Harrell Jr, Charles Dupont, and others. 2019; R Core Team 2018).

package_loader <- function(x, ...) {
  if (x %in% rownames(installed.packages()) == FALSE) install.packages(x)
  library(x, ...)
}

packages <- c("knitr", "bookdown", "kableExtra", "Hmisc", "foreign")

invisible(sapply(X = packages, FUN = package_loader, character.only = TRUE))

Data Setup

There are two essential parts to setting up the data. First, make sure each variable is coded to reflect its class. Second, we want to add labels to each variable in the data set using the label function of the Hmisc library.

set.seed(123)
data(mpg)
mpg <- data.frame(mpg)

colnames(mpg)[which(colnames(mpg) == "manufacturer")] <- "manu"

mpg$manu <- factor(mpg$manu)
mpg$model <- factor(mpg$model)
mpg$displ <- as.numeric(mpg$displ)
mpg$year <- factor(mpg$year, levels = c("1999", "2008"), ordered = TRUE)

mpg$dp <- as.Date(NA, origin = "1970-01-01")
mpg$dp[which(mpg$year == "1999")] <- sample(seq(as.Date('1999-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('1999-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)
mpg$dp[which(mpg$year == "2008")] <- sample(seq(as.Date('2008-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('2008-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by="day"), dim(mpg)[1]/2)

mpg$cyl <- factor(mpg$cyl, levels = c(4, 5, 6, 8), ordered = TRUE)
mpg$trans <- factor(mpg$trans)
mpg$drv <- factor(mpg$drv, levels = c("f", "r", "4"), labels = c("front-wheel drive", "rear wheel drive", "4wd"))
mpg$fl <- factor(mpg$fl)
mpg$class <- factor(mpg$class)

mpg$rn <- rnorm(dim(mpg)[1], mean = 10, sd = 5)
mpg$rn[sample(1:length(mpg$rn), size = 50)] <- NA

mpg$party <- factor(sample(c("republican", "democrat", "independent", NA), dim(mpg)[1], replace = TRUE), levels = c("republican", "democrat", "independent"))
mpg$comments <- sample(c("I like this car!", "Meh.", "This is the worst car ever!", "Does it come in green?", "want cheese flavoured cars.", "Does it also fly?", "Blah, Blah, Blah, Blah, Blah, Blah, Blah, Blah", NA), dim(mpg)[1], replace = TRUE)

label(mpg$manu) <- "manufacturer"
label(mpg$model) <- "model name"
label(mpg$displ) <- "engine displacement, in litres"
label(mpg$year) <- "year of manufacture"
label(mpg$dp) <- "date of purchase"
label(mpg$cyl) <- "number of cylinders"
label(mpg$trans) <- "type of transmission"
label(mpg$drv) <- "drive type"
label(mpg$cty) <- "city miles per gallon"
label(mpg$hwy) <- "highway miles per gallon"
label(mpg$fl) <- "fuel type"
label(mpg$class) <- "type of car"
label(mpg$rn) <- "some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5"
label(mpg$party) <- "some random political parties"
label(mpg$comments) <- "some random comments"

kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Table 1: Header of mpg.
manu model displ year cyl trans drv cty hwy fl class dp rn party comments
audi a4 1.8 1999 4 auto(l5) front-wheel drive 18 29 p compact 1999-06-28 18.98570 democrat I like this car!
audi a4 1.8 1999 4 manual(m5) front-wheel drive 21 29 p compact 1999-01-14 NA NA I like this car!
audi a4 2.0 2008 4 manual(m6) front-wheel drive 20 31 p compact 2008-02-08 19.50450 independent Does it also fly?
audi a4 2.0 2008 4 auto(av) front-wheel drive 21 30 p compact 2008-07-14 NA independent I like this car!
audi a4 2.8 1999 6 auto(l5) front-wheel drive 16 26 p compact 1999-07-14 13.68097 democrat Meh.
audi a4 2.8 1999 6 manual(m5) front-wheel drive 18 26 p compact 1999-11-02 16.82888 NA Meh.

Exporting the Data to SPSS

The original unexported function can be viewed with the command foreign:::writeForeignSPSS. We are going to create a new function called writeForeignSPSS2. Below is the code with comments where changes are made.

# This function below isn't exported from the foreign package, so we recreate it here.  Another option is the use the ::: operator but the documention for R discourages this use case.  We will take their word for it :).
adQuote <- function(x) paste("\"", x, "\"", sep = "")

# The original function doesn't deal with dates.  To address this we need to check if the variable is of class date to later store is a character string.
is.date <- function(x) inherits(x, 'Date')

# The original function uses the strwrap function that can sometime cause a split string to begin with a '*'.  This function prevents this problem.
spssSafeSplit <- function(x) {
  spssSplit <- strwrap(paste(x, collapse = " "), width = 70)
  
  spssInvalid <- which(sapply(X = spssSplit, FUN = substring, 1, 1) == "*")
  
  if (length(spssInvalid) == 0) return(spssSplit)
  
  for (i in 1:length(spssInvalid)) {
    spssSplit[spssInvalid[i]] <- gsub("* ", "", spssSplit[spssInvalid[i]], fixed = TRUE)
    spssSplit[spssInvalid[i] - 1] <- paste(spssSplit[spssInvalid[i] - 1], "*")
  }
  
  return(spssSplit)
}

writeForeignSPSS2 <- function (df, datafile, codefile, varnames = NULL, maxchars = 32L) {
  dfn <- lapply(df, function(x) if (is.factor(x)) as.numeric(x) else x)
  
  # write.table(dfn, file = datafile, row.names = FALSE, col.names = FALSE, sep = ",", quote = FALSE, na = "", eol = ",\n")
  write.table(dfn, file = datafile, row.names = FALSE, col.names = FALSE, sep = "\t", quote = FALSE, na = ".", eol = "\n") # The original code creates a comma-delimited file.  In many data sets, there are comment columns that include commas which creates unexpected results.  To avoid that I change it over to create tab-delimited data.
  
  # varlabels <- names(df)
  varlabels <- label(df) # We don't want the names of df but the labels.
  
  if (is.null(varnames)) {
    varnames <- abbreviate(names(df), maxchars)
    if (any(sapply(varnames, nchar) > maxchars)) stop("I cannot abbreviate the variable names to 'maxchars' or fewer chars")
    # if (any(varnames != varlabels)) warning("some variable names were abbreviated")
    if (any(varnames != names(varlabels))) warning("some variable names were abbreviated") # The original code has an error check for the max length of variable names.  We can still have the error check, but we need to use the names of varlables instead.  
  }
  
  varnames <- gsub("[^[:alnum:]_\\$@#]", "\\.", varnames)
  dl.varnames <- varnames
  
  chv <- sapply(df, is.character)
  fav <- sapply(df, is.factor)
  dav <- sapply(df, is.date) # This is added in to deal with date class columns.
  
  if (any(chv)) {
    lengths <- sapply(df[chv], function(v) max(c(nchar(v), 8), na.rm = TRUE))
    lengths <- paste0("(A", lengths, ")")
    dl.varnames[chv] <- paste(dl.varnames[chv], lengths)
  }
  
  # Store any date class columns as a character string in SPSS.  It is the same as above, but coerces the date to a character vector.
  if (any(dav)) {
    lengths <- sapply(df[dav], function(v) max(c(nchar(as.character(v)), 8), na.rm = TRUE))
    lengths <- paste0("(A", lengths, ")")
    dl.varnames[dav] <- paste(dl.varnames[dav], lengths)
  }
  
  if (any(fav)) {
    dl.varnames[fav] <- paste(dl.varnames[fav], "(F8.0)")
  }
  
  # if (any(chv) || any(fav)) {
  #     star <- ifelse(c(FALSE, diff(chv | fav) == 1)[chv | fav], " *", " ")
  #     dl.varnames[chv | fav] <- paste(star, dl.varnames[chv | fav])
  # }
  
  # In the code above, the star detection isn't setup for dates.  We can add this simple enough, as shown below.
  if (any(chv) || dav || any(fav)) {
    star <- ifelse(c(FALSE, diff(chv | dav | fav) == 1)[chv | dav | fav], " *", " ")
    dl.varnames[chv | dav | fav] <- paste(star, dl.varnames[chv | dav | fav])
  }

  cat("SET DECIMAL=DOT.\n\n", file = codefile)
  # cat("DATA LIST FILE=", adQuote(datafile), " free (\",\")\n", file = codefile, append = TRUE)
  cat("DATA LIST FILE=", adQuote(datafile), " free (TAB)\n", file = codefile, append = TRUE) # Since we changed to a tab-delmited file, we also need to change the import format to TAB.
  cat("ENCODING=\"Locale\"\n", file = codefile, append = TRUE)
  # cat("/", paste(strwrap(paste(dl.varnames, collapse = " "), width = 70), "\n"), " .\n\n", file = codefile, append = TRUE)
  cat("/", paste(spssSafeSplit(paste(dl.varnames, collapse = " ")), "\n"), " .\n\n", file = codefile, append = TRUE) # The original code forces a string wrap.  If a new line ends up begining with a '*" weird things happen so we use the spssSafeSplit function to prevent this.
  cat("VARIABLE LABELS\n", file = codefile, append = TRUE)
  cat(paste(varnames, adQuote(varlabels), "\n"), ".\n", file = codefile, append = TRUE)
  
  if (any(fav)) {
    cat("\nVALUE LABELS\n", file = codefile, append = TRUE)
    for (v in which(fav)) {
      cat("/\n", file = codefile, append = TRUE)
      cat(varnames[v], " \n", file = codefile, append = TRUE, sep = "")
      levs <- levels(df[[v]])
      cat(paste(seq_along(levs), adQuote(levs), "\n", sep = " "), file = codefile, append = TRUE)
    }
    cat(".\n", file = codefile, append = TRUE)
  }
  
  ord <- sapply(df, is.ordered)
  if (any(ord)) cat("VARIABLE LEVEL", paste(strwrap(paste(varnames[ord], collapse = ", "), width = 70), "\n"), "(ordinal).\n", file = codefile, append = TRUE)
  
  num <- sapply(df, is.numeric)
  
  if (any(num)) cat("VARIABLE LEVEL", paste(strwrap(paste(varnames[num], collapse = ", "), width = 70), "\n"), "(scale).\n", file = codefile, append = TRUE)
  
  cat("\nEXECUTE.\n", file = codefile, append = TRUE)
}

write.foreign(mpg, datafile = "export_r_to_spss_with_variable_labels.txt", codefile = "export_r_to_spss_with_variable_labels.sps", package = "SPSS2", maxchars = 64L)

If you are using a console you can create you .sav file with the following commands, otherwise if you are using the GUI, click away!

cat("\nSAVE OUTFILE = \"export_r_to_spss_with_variable_labels.sav\".", file = "export_r_to_spss_with_variable_labels.sps", append = TRUE)

system("pspp export_r_to_spss_with_variable_labels.sps")

If you want to run the output yourself the data file can be downloaded here and the code file can be downloaded here. If you want to download the final .sav data set, it is here.

Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] foreign_0.8-72    Hmisc_4.2-0       ggplot2_3.1.1     Formula_1.2-3    
## [5] survival_2.44-1.1 lattice_0.20-38   kableExtra_1.1.0  bookdown_0.9     
## [9] knitr_1.22       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5    xfun_0.6            purrr_0.3.2        
##  [4] splines_3.6.1       colorspace_1.4-1    htmltools_0.3.6    
##  [7] viridisLite_0.3.0   yaml_2.2.0          base64enc_0.1-3    
## [10] rlang_0.3.4         pillar_1.3.1        glue_1.3.1         
## [13] withr_2.1.2         RColorBrewer_1.1-2  plyr_1.8.4         
## [16] stringr_1.4.0       munsell_0.5.0       gtable_0.3.0       
## [19] rvest_0.3.3         htmlwidgets_1.3     evaluate_0.13      
## [22] latticeExtra_0.6-28 htmlTable_1.13.1    Rcpp_1.0.1         
## [25] acepack_1.4.1       readr_1.3.1         backports_1.1.4    
## [28] checkmate_1.9.1     scales_1.0.0        webshot_0.5.1      
## [31] gridExtra_2.3       hms_0.4.2           digest_0.6.18      
## [34] stringi_1.4.3       dplyr_0.8.0.1       grid_3.6.1         
## [37] tools_3.6.1         magrittr_1.5        lazyeval_0.2.2     
## [40] tibble_2.1.1        cluster_2.1.0       crayon_1.3.4       
## [43] pkgconfig_2.0.2     Matrix_1.2-17       data.table_1.12.2  
## [46] xml2_1.2.0          assertthat_0.2.1    rmarkdown_1.12     
## [49] httr_1.4.0          rstudioapi_0.10     rpart_4.1-15       
## [52] R6_2.4.0            nnet_7.3-12         compiler_3.6.1

References

Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2019. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.

R Core Team. 2018. Foreign: Read Data Stored by ’Minitab’, ’S’, ’Sas’, ’Spss’, ’Stata’, ’Systat’, ’Weka’, ’dBase’, ... https://CRAN.R-project.org/package=foreign.

———. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2016. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.name/knitr/.

Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.