In August 2020, Atorus and GSK entered a unique relationship. As two separate organizations both committed to the advancement of open source within pharma, we saw potential for collaboration. What makes this relationship unique is the intent; from start to end, our goal has been to create open-source software fit for consumption by the wider industry.
Over the last few months, we’ve been hard at work. As the summer of 2021 approaches, we are thrilled to announce the alpha release of three new R packages developed through our collaborative efforts. So, without further ado, let’s jump right in!
timber
The timber package generates a supplemental log file when used to execute an R script. The goal of the package is to create a permanent source of information around the execution environment at the time of execution, supporting reproducibility of the results with the script.
Why timber? A significant population of users analyzing clinical trial data are accustomed to reviewing the SAS® execution log after executing a SAS® script. These logs are often useful to the user as a debugging tool, and thus tend to be filled with a great deal of information. The closest counterpart to the SAS® log in R is simply the console, which reports the code executed and any messages produced. We felt that the timber log did not need to replicate the SAS® log file in its entirety, but rather would capture minimal information to enable reproducibility and traceability of the executed data analysis.
The current version of the timber log reports important information, such as:
- Date/time of execution
- Name of file being run as well as output file
- Program run time
- Executing username
- System information
- Imported packages/versions
- Namespace conflicts
- Generated errors or warnings
This smaller and more isolated set of information allows for the timber log to serve the purpose of an audit trail, and still enable the use of automated tools to scan logs for any issues, such as unacceptable messages – or issues like masked functions that could be problematic.
xportr
The xportr package contains tools to build CDISC compliant data sets, enhancing the functionality already provided by SASxport. At the moment, clinical trial data is expected to be submitted to regulatory agencies as version 5 transport files. In R, packages already exist for the reading and writing of these transport files – including SASxport and the tidyverse package haven. The issue we identified is that files produced by these packages will not necessarily pass common compliance checks for a regulatory submission.
One of xportr’s major features allows the manual setting of variable lengths when writing to a transport file. Variable lengths do not exist within R in the same way that they exist in SAS®, the most common language used to generate submission ready transport files. In R, character strings are not truncated unless you happen to hit a memory limit. As this is not convention within R, neither haven nor SASxport contain this feature. Furthermore, the defaults produced are not as expected by typical compliance checks. For example, if a variable is empty (meaning no row contains values), the length will be set to 0 rather than 1.
The development of the package xportr is also wonderful example of one of our core principals within the Atorus-GSK open-source partnership:
If an existing open-source solution is close to what is required, contribute to that solution to enable it to meet requirements.
The first stage of xportr development was to research both haven and SASxport to see what could be done to address the issue of setting variable lengths manually. The team developed a solution and proposed it to the author of SASxport, which was ultimately accepted. This enabled SASxport to produce version 5 transport files that were capable of passing relevant compliance checks.
The next phase of development on xportr was to provide a simplified interface to the SASxport package. Users can provide the necessary metadata as a data frame or metacore object and that metadata will be appropriately set for use within SASxport. Furthermore, xportr runs a number of checks to ensure compliance with regulatory standards, specific to that of the transport file. The xportr package also provides a functional messaging interface for the user that supplies information about changes to the data frame along with different options of verbosity for these messages.
With xportr, we are not targeting replacement of common tools for full regulatory compliance of the submission data package. However, if you are using R to create version 5 transport files, xportr provides tools to catch common dataset-level non-compliance issues at the time of generation, meaning less findings downstream.
metacore
The purpose of the package metacore is to be a container of metadata. Many in our industry recognize the power of leveraging metadata to automate or semi-automate the clinical data pipeline. The metacore package is intended to serve as a foundation for working with metadata in R across various aspects of clinical statistical programming. We have metadata for our datasets (tabulation, analysis-ready, and analysis results), our analysis displays, and even our quality control processes. Information (metadata) can come in from many sources in many formats and metacore standardizes the metadata structure for easier ingestion by downstream tools.
Therefore, the development of metacore capitalized on the needs of xportr as a use case. For example, when xportr writes out a dataset, metadata is necessary to identify what attributes should be set to a dataset and variable. For example, the dataset label, variable labels, and variable lengths all must be set. This information is typically readily available within dataset specifications – and is commonly used within SAS® programming processes as well.
Other aspects of SDTM and ADaM programming also leverage metadata such as applying controlled terminology (value-level metadata), identifying the sort sequence of a dataset, applying variable labels, or even simply the order that variables are presented in the dataset. Reporting programs can also benefit from the use of metadata, where code lists allow for the conversion of character to factor variables, which can be useful to supplement empty rows without the creation dummy datasets. Other possibilities include the creation of compliance and consistency checks, study deliverable quality and completeness checks, and more – as metadata can serve several different purposes.
Organizations will inevitably handle their metadata in different ways, be it in Excel® spreadsheets, SQL databases, or other formats. The schemas of these different sources of metadata will likely all be different – even when organizations are following CDISC standards. Packages designed for CDISC programming activities will need to leverage this metadata in different places. For example, xportr has an isolated purpose, and a package centered around SDTM programming techniques will have another purpose – but metadata can be leveraged within both packages.
The largest challenge of metacore is getting metadata from organization specific formats to the standardized format of the container. Helper functions have been written to simplify the import of data from Excel spreadsheets, and readers have additionally been created for define.xml files. If these helpers are not sufficient, then fortunately the creation of a reader is a one-time activity that can then be reused moving forward – allowing users to still benefit from the standardized container provided by metacore.
open source solutions
Each of these packages (timber, xportr, metacore) have been released under an MIT license, an important target for the Atorus-GSK collaboration. The objective is to enable the acceleration of using R to deliver the clinical data pipeline. The design and features you will find in these packages are the result of our understanding of the underlying problem, combining the perspectives of both organizations. While we also have planned enhancements in the backlog, you are part of the user base and your opinions matter!
Releasing our packages under a permissive license permits both the use and critique of the provided solutions. With these packages out in the open, we’re eager to hear from the community. We welcome comments, testing, and contributions through our GitHub repository, much like we provided to the maintainer of SASxport. What works? What doesn’t? Are we on point or off the mark? Feedback drives progress, and the beauty of open-source is that you all have a voice – and we’re happy to listen.