I’m thrilled to announce that the new package datasetjson, which allows you to read, write, and validate CDISC Dataset JSON files per the CDISC schema, has been released to CRAN!
History of Submission Data Formats
If you’ve worked on clinical trial submissions, you’re inevitably familiar with the SAS Version 5 Transport files. This is the file format that you need to package and send your SDTM and ADaM data in when submitting data to the FDA. Version 5 Transport files were developed in the late 1980s and, as such, come with some specific limitations; variable names can’t be more than 8 characters; labels can’t be more than 40 characters; string variables can’t have values that exceed 200 characters, as well as some other nuanced requirements. Due to this regulatory requirement, this file format has had a strong influence on the rules and requirements we see in CDISC standards – as well as the programming languages we use in regulatory submissions.
For me personally, this has always seemed like a ripe opportunity for change. There have been a lot of changes and improvements in the way we can data since the late 1980s. But our continued use of Version 5 Transport isn’t for a lack of trying – in 2015 a pilot was run with the FDA for use of Dataset XML, a standard created by CDISC to try to replace version 5 transport files. Unfortunately, a primary issue was that the XML format inflated file sizes significantly, and the pilot was ultimately unsuccessful.
Back in 2022, CDISC decided to run a hackathon with a new attempt at a Version 5 transport file replacement – Dataset JSON, and this is one that I’ve been excited about for some time now.
First off – why JSON? There plenty of other efficient file formats out there, such as parquet, or Arrow. It’s a good question – the purpose of Dataset JSON is as a data exchange format. It’s intended for interface with the agency, such as during your submission. But secondarily, JSON is naturally used within data exchange through APIs and is much less verbose than the last attempt at using XML. Additionally, JSON allows the opportunity to store extra metadata directly alongside the data itself, which you can see within the schema of Dataset JSON.
So why am I excited? Because there’s a new energy around Dataset JSON that I haven’t seen for a while. With the increased usage of programming languages like R or Python, there’s an increased priority of a more language agnostic format. So realistically, it just feels like more people are paying attention this time around.
datasetjson v0.0.1 Released to CRAN
To truly leverage the Dataset JSON format, we still need the tools to read and write the data with ease. As such, I’m excited to announce the release of the datasetjson R package. Co-developed by Atorus and Johnson & Johnson, we’ve put this together and it’s now available on CRAN!
You can read all the datasetjson package on the package website here. The datasetjson package allows you to read a Dataset JSON into a data frame, and write a data frame into a Dataset JSON file. This is done while allowing you to manage or maintain the extra metadata attached to the object.
The package is currently in Version 0.0.1, and this is our first release to CRAN. We’re eager for feedback, so if you have questions, issues, or have suggestions of other interfaces you’d like into the data, let us know right here in the GitHub issues.
Acknowledgements
Thank you to Nick Masel of Johnson & Johnson for developing this package with me.
Thank you to Ben Straub and Eric Simms (GSK) for help and input during the original CDISC Dataset JSON hackathon that motivated this work.
Thank you to Tilo Blenk (GSK) for suggestions that allowed us to use jsonlite exclusively for generation of the final JSON file.