Skip to contents

This vignette describes version 1.0 of the ROCK project file format.

ROCK project files have extension .ROCKproject and are ZIP archives. They contain two things:

  • Files containing the data, ideally in a deliberately designed set of sub-directories to facilitate tracing the data through different stages of processing and analysis;

  • Files containing settings and directives for applications and processing of the data.

The former are raw data files and ROCK files. ROCK files are plain text files with the .rock extension.

The latter are YAML files. Of these, the only required one is the _ROCKproject.yml file. This file must always be a regular YAML file that contains a map with key _ROCKproject. This map in turn must contain maps with keys project, codebook, sources, and workflow.

The project map contains project metadata, such as the project’s title, its authors, optional (but strongly recommended!) author identifiers in authorIds, the project’s version, the version of the ROCK standard used in the project (with key ROCK_version), the version of the ROCK project file (with key ROCK_project_version), the date the project was created (with key date_created), and the date the project was last modified (with key date_modified).

The codebook map contains the project’s codebook, either embedded or by linking to it. The codebook key can also have value ~ (NULL) if not codebook information is specified (or the codebook is embedded in the ROCK files). Valid keys to be specified with the codebook map are urcid, embedded, and local. The urcid key can store the project’s Unique ROCK Codebook Identifier (i.e. its URCID) as a URL to a ROCK codebook in spreadsheet (.xlsx or .ods format) or YAML (.yml or .rock) format.

The sources map specifies where the project’s data resides. This is specified in terms of regular expressions. The first valid key is extension, which is not a regular expression but can be used to conveniently specify that files with a given extension must be imported. This is used if regex is ~ (NULL, i.e. unspecified). However, if a value is specified for regex, a program importing a ROCK project should ignore whatever is specified for extension. The value stored in the dirsToIncludeRegex key should be a regular expression indicating which directories contain the data (i.e. the ROCK files forming the project). The recursive key can be true or false and indicates whether all subdirectories of matched directories should be imported too. The dirsToExcludeRegex regular expression can be used to ignore directories. In addition, if filesToIncludeRegex is specified, only files matching that regular expression should be imported; and if filesToExcludeRegex is specified, files matching that regular expression should be ignored.

Finally, the workflow map described the workflow and data management template used in this project. It consists of a pipeline and actions. The pipeline is a sequence of stages, each with an identifier (in key stage); the directory containing files in that stage (in key dirName; note that this is a single directory name, not a regular expression!); and a sequence of one or more next stage (with key nextStages). Each element in nextStages has a nextStageId key and a actionId. The nextStageId specifies to which stage files transfer (i.e. are saved) when the action with the corresponding actionId is executed. These actions are stored in a sequence where each element has an actionId; a language specified the programming language the action is specified in; one or more dependencies (typically packages that need to be loaded in that programming environment before the script can be executed), and a script section specifying the commands to run to execute that action. In this script, two placeholders can be used: {currentStage::dirName} will be replaced with the contents of dirName for the current stage; and {nextStage::dirName} will be replaced with the contents of dirName for the next stage. The latter part of these expressions (dirName in both of these examples) can be replaced by other keys specified in each stage to allow setting parameters in the pipeline specification.

An example of a _ROCKproject.yml file is included below.


_ROCKproject:

  project:

    title: "The Alice Study"                     # Any character string
    authors: "Author names as string"            # Any character string
    authorIds:
      -
        display_name: "Talea Cornelius"          # Any character string
        orcid: "0000-0001-7181-0981"             # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
        shorcid: "ip6b381"                       # Any character string matching ^([0-9a-zA-Z]+$
      -
        display_name: "Gjalt-Jorn Peters"        # Any character string
        orcid: "0000-0002-0336-9589"             # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
        shorcid: "it36ll9"                       # Any character string matching ^([0-9a-zA-Z]+$

    version: "1.1"                               # Anything matching regex [0-9]+(\\.[0-9]+)*
    ROCK_version: 1                              # Anything matching regex [0-9]+(\\.[0-9]+)*
    ROCK_project_version: 1                      # Anything matching regex [0-9]+(\\.[0-9]+)*
    date_created: "2023-03-01 20:03:51 UTC"      # Anything matching that date format, preferably converted to UTC timezone
    date_modified: "2023-03-08 20:03:51 UTC"     # Anything matching that date format, preferably converted to UTC timezone

  codebook:
    urcid: ""
    embedded: ~
    local: ""

  sources:

    extension: ".rock"                           # Any valid extension
    regex: ~                                     # Any regex or ~
    dirsToIncludeRegex: data/                    # Any regex or ~
    recursive: true                              # true or false
    dirsToExcludeRegex: ~                        # Any regex or ~
    filesToIncludeRegex: ~                       # Any regex or ~
    filesToExcludeRegex: ~                       # Any regex or ~

  workflow:

    pipeline:
      -
        stage: raw                               # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/010---raw-sources"        # Any valid directory name, using a forward slash as separator
        nextStages:
          -
            nextStageid: clean                   # A different stage identifier or ~
            actionId: cleanSource
          -
            nextStageid: uids                    # A different stage identifier or ~
            actionId: addUIDs
      -
        stage: clean                             # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/020---cleaned-sources"    # Any valid directory name, using a forward slash as separator
        nextStages:
          -
            nextStageid: uids                    # A different stage identifier or ~
            actionId: addUIDs
      -
        stage: uids                              # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/030---sources-with-uids"  # Any valid directory name, using a forward slash as separator
        nextStage: coded                         # A different stage identifier or ~
      -
        stage: coded                             # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/040---coded-sources"      # Any valid directory name, using a forward slash as separator
        nextStage: masked                        # A different stage identifier or ~
      -
        stage: masked                            # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
        dirName: "data/090---masked-sources"     # Any valid directory name, using a forward slash as separator
        nextStage: ~                             # A different stage identifier or ~

    actions:
      -
        actionId: addUIDs                        # String, referenced from the stages
        language: R                              # Language, has to be matched to interpreter
        dependencies: rock                       # Dependencies to be loaded before running the script
        script: |                                # Literal block style string
          rock::prepend_ids_to_sources(
            input = {currentStage::dirName},
            output = {nextStage::dirName}
          );