This vignette describes version 1.0 of the ROCK project file format.
ROCK project files have extension .ROCKproject
and are ZIP archives. They contain two things:
Files containing the data, ideally in a deliberately designed set of sub-directories to facilitate tracing the data through different stages of processing and analysis;
Files containing settings and directives for applications and processing of the data.
The former are raw data files and ROCK files. ROCK files are plain text files with the .rock
extension.
The latter are YAML files. Of these, the only required one is the _ROCKproject.yml
file. This file must always be a regular YAML file that contains a map with key _ROCKproject
. This map in turn must contain maps with keys project
, codebook
, sources
, and workflow
.
The project
map contains project metadata, such as the project’s title
, its authors
, optional (but strongly recommended!) author identifiers in authorIds
, the project’s version
, the version of the ROCK standard used in the project (with key ROCK_version
), the version of the ROCK project file (with key ROCK_project_version
), the date the project was created (with key date_created
), and the date the project was last modified (with key date_modified
).
The codebook
map contains the project’s codebook, either embedded or by linking to it. The codebook
key can also have value ~
(NULL) if not codebook information is specified (or the codebook is embedded in the ROCK files). Valid keys to be specified with the codebook
map are urcid
, embedded
, and local
. The urcid
key can store the project’s Unique ROCK Codebook Identifier (i.e. its URCID) as a URL to a ROCK codebook in spreadsheet (.xlsx
or .ods
format) or YAML (.yml
or .rock
) format.
The sources
map specifies where the project’s data resides. This is specified in terms of regular expressions. The first valid key is extension
, which is not a regular expression but can be used to conveniently specify that files with a given extension must be imported. This is used if regex
is ~
(NULL, i.e. unspecified). However, if a value is specified for regex
, a program importing a ROCK project should ignore whatever is specified for extension
. The value stored in the dirsToIncludeRegex
key should be a regular expression indicating which directories contain the data (i.e. the ROCK files forming the project). The recursive
key can be true
or false
and indicates whether all subdirectories of matched directories should be imported too. The dirsToExcludeRegex
regular expression can be used to ignore directories. In addition, if filesToIncludeRegex
is specified, only files matching that regular expression should be imported; and if filesToExcludeRegex
is specified, files matching that regular expression should be ignored.
Finally, the workflow
map described the workflow and data management template used in this project. It consists of a pipeline
and actions
. The pipeline
is a sequence of stages, each with an identifier (in key stage
); the directory containing files in that stage (in key dirName
; note that this is a single directory name, not a regular expression!); and a sequence of one or more next stage (with key nextStages
). Each element in nextStages
has a nextStageId
key and a actionId
. The nextStageId
specifies to which stage files transfer (i.e. are saved) when the action with the corresponding actionId
is executed. These actions
are stored in a sequence where each element has an actionId
; a language
specified the programming language the action is specified in; one or more dependencies
(typically packages that need to be loaded in that programming environment before the script
can be executed), and a script
section specifying the commands to run to execute that action. In this script, two placeholders can be used: {currentStage::dirName}
will be replaced with the contents of dirName
for the current stage; and {nextStage::dirName}
will be replaced with the contents of dirName
for the next stage. The latter part of these expressions (dirName
in both of these examples) can be replaced by other keys specified in each stage to allow setting parameters in the pipeline specification.
An example of a _ROCKproject.yml
file is included below.
_ROCKproject:
project:
title: "The Alice Study" # Any character string
authors: "Author names as string" # Any character string
authorIds:
-
display_name: "Talea Cornelius" # Any character string
orcid: "0000-0001-7181-0981" # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
shorcid: "ip6b381" # Any character string matching ^([0-9a-zA-Z]+$
-
display_name: "Gjalt-Jorn Peters" # Any character string
orcid: "0000-0002-0336-9589" # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
shorcid: "it36ll9" # Any character string matching ^([0-9a-zA-Z]+$
version: "1.1" # Anything matching regex [0-9]+(\\.[0-9]+)*
ROCK_version: 1 # Anything matching regex [0-9]+(\\.[0-9]+)*
ROCK_project_version: 1 # Anything matching regex [0-9]+(\\.[0-9]+)*
date_created: "2023-03-01 20:03:51 UTC" # Anything matching that date format, preferably converted to UTC timezone
date_modified: "2023-03-08 20:03:51 UTC" # Anything matching that date format, preferably converted to UTC timezone
codebook:
urcid: ""
embedded: ~
local: ""
sources:
extension: ".rock" # Any valid extension
regex: ~ # Any regex or ~
dirsToIncludeRegex: data/ # Any regex or ~
recursive: true # true or false
dirsToExcludeRegex: ~ # Any regex or ~
filesToIncludeRegex: ~ # Any regex or ~
filesToExcludeRegex: ~ # Any regex or ~
workflow:
pipeline:
-
stage: raw # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/010---raw-sources" # Any valid directory name, using a forward slash as separator
nextStages:
-
nextStageid: clean # A different stage identifier or ~
actionId: cleanSource
-
nextStageid: uids # A different stage identifier or ~
actionId: addUIDs
-
stage: clean # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/020---cleaned-sources" # Any valid directory name, using a forward slash as separator
nextStages:
-
nextStageid: uids # A different stage identifier or ~
actionId: addUIDs
-
stage: uids # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/030---sources-with-uids" # Any valid directory name, using a forward slash as separator
nextStage: coded # A different stage identifier or ~
-
stage: coded # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/040---coded-sources" # Any valid directory name, using a forward slash as separator
nextStage: masked # A different stage identifier or ~
-
stage: masked # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/090---masked-sources" # Any valid directory name, using a forward slash as separator
nextStage: ~ # A different stage identifier or ~
actions:
-
actionId: addUIDs # String, referenced from the stages
language: R # Language, has to be matched to interpreter
dependencies: rock # Dependencies to be loaded before running the script
script: | # Literal block style string
rock::prepend_ids_to_sources(
input = {currentStage::dirName},
output = {nextStage::dirName}
);