Skip to contents

Introduction

Picard is built on the Ulysses framework, which follows the philosophy that real-world evidence (RWE) studies should be organized and versioned as software projects. Just as databases require a schema to organize data, RWE studies benefit from a standard directory structure to organize code, inputs, analyses, and outputs. This standardization enables:

  • Reproducibility: Anyone can understand the project structure at a glance
  • Collaboration: Multiple contributors follow the same conventions
  • Automation: Consistent organization enables reliable workflows and tooling
  • Version control: Clear separation of concerns makes git history more meaningful

Picard uses the Ulysses repository structure, adding specialized directories and configuration for cohort-based studies, Evidence Generation Plans, and results dissemination. This vignette describes the standard Ulysses repository structure created when you initialize a project using launchUlyssesRepo().

Pipeline Workflow and Folder Organization

The Ulysses repository organizes folders to match the flow of a real-world evidence study:

┌─────────────────────────────────────────────────────────────────┐
│ 1. META & CONFIG                                                │
│    (config.yml, README.md, NEWS.md, main.R, test_main.R)       │
│         ↓                                                        │
│ 2. INPUTS                                                       │
│    inputs/cohorts/ + inputs/conceptSets/                      │
│    (Define phenotypes, cohorts, covariates)                    │
│         ↓                                                        │
│ 3. ANALYSIS                                                     │
│    analysis/tasks/ + analysis/src/ + analysis/migrations/     │
│    (Execute analyses, generate statistics)                      │
│         ↓                                                        │
│ 4. EXECUTION OUTPUT                                             │
│    exec/results/[database]/[version]/[task]/                  │
│    (Raw results by task, database, version)                     │
│         ↓                                                        │
│ 5. DISSEMINATION                                                │
│    dissemination/export/ + dissemination/quarto/               │
│    (Format results, create Study Hub website)                   │
└─────────────────────────────────────────────────────────────────┘

Workflow sequence: 1. Initialize with metadata (config.yml defines databases and credentials) 2. Load or create inputs (cohorts and concept sets) 3. Execute analysis tasks (code in analysis/tasks runs using inputs) 4. Raw results written to exec/results organized by database and version 5. Post-processing (migrations) and formatting (excel, CSV) 6. Dissemination via Study Hub (quarto website) and formatted exports

Ulysses Repository Outline

A newly initialized Picard study has the following high-level structure:

study-repository/
├── analysis/          # Study analysis code and workflows
├── inputs/            # Cohort definitions and concept sets
├── dissemination/     # Results and evidence outputs
├── exec/              # Execution artifacts and logs
├── docs/              # Generated documentation (pkgdown)
├── extras/            # Reference scripts and development files
├── config.yml         # Project configuration
├── main.R             # Production pipeline execution script
├── README.md          # Project overview
├── NEWS.md            # Release notes and changelog
├── .gitignore         # Git configuration
└── study.Rproj        # RStudio project file

Elements of the Ulysses Repository

Vital Files

These files are essential to the project and should be maintained throughout its lifecycle.

project.Rproj

An RStudio project file that configures the working directory and development environment. The Ulysses structure uses this to ensure consistent behavior across team members. When you open this file in RStudio, the working directory is automatically set to the project root.

study-repository.Rproj

If using VS Code, Picard supports project detection through .code-workspace files and agent instructions.

.gitignore

Prevents sensitive files and intermediate outputs from being committed to version control. The standard .gitignore for Ulysses repository includes:

  • renv/ - Local package library snapshots (renv-specific files)
  • exec/ - Execution results and temporary files
  • .env - Environment variables and credentials
  • *.log - Log files
  • RStudio temporary files (.Rhistory, .RData, etc.)

This ensures that your git repository contains only source code and documentation, not generated outputs or sensitive credentials.

README.md

The README serves as the project’s front door, communicating key study information: description, objectives, key personnel, status, and links to vital resources. The README includes:

  • Study metadata: Title, ID, start/end dates, study type, therapeutic area
  • Status badges: Current version and project status
  • Tags: Keywords for searching similar studies within your organization
  • Links: References to ATLAS, protocols, publications, and repository

Ulysses auto-generates a README template when you launch a project. You should customize the Study Description section to clearly explain your research question and study design, following your organization’s documentation standards.

Example README.md structure (auto-generated):

# Cardiovascular Risk Assessment in Diabetes (Id: myStudyRepo)

<!-- badge: start -->

![Study Status: Started](https://img.shields.io/badge/Study%20Status-Started-blue.svg)
![Version: 0.0.1](https://img.shields.io/badge/Version-0.0.1-yellow.svg)

<!-- badge: end -->

## Study Information

- Study Id: myStudyRepo
- Study Title: Cardiovascular Risk Assessment in Diabetes
- Study Start Date: 2026-04-07
- Expected Study End Date: 2028-04-07
- Study Type: Cohort Study
- Therapeutic Area: Cardiovascular/Endocrinology

## Study Description

Add a short description about the study!

## Contributors

- Jane Doe, Institution Name

## Study Links

- [ATLAS Cohort Definitions](https://atlas.example.com/)
- [Study Repository](https://github.com/org/repo)

The README is auto-generated when you launch a study using launchUlyssesRepo(). You should edit the Study Description section to explain your research question and study design. Update the badges and version as your study progresses.

NEWS.md

Tracks changes across study versions. When you run a production pipeline with execStudyPipeline(), the Ulysses workflow automatically updates NEWS.md with version information and change summaries. This creates an audit trail of what changed in each release.

Format follows semantic versioning conventions:

## Version 1.2.0 (2026-04-07)

- Updated cohort definitions for primary population
- Fixed missing exclusion criteria in comparative arm
- Added sensitivity analyses for age groups

## Version 1.1.0 (2026-03-15)

- Added post-hoc subgroup analyses
- Expanded results dissemination

config.yml

Central configuration file specifying parameters needed to establish database connections. Uses YAML format with two section types:

  • default: Universal study settings (project name, version)
  • block headers: Database-specific configurations (dbms, credentials, schemas)

When you source a block header in a task file, the pipeline runs using only that block’s configuration, enabling multi-database studies.

Important: Connection details vary by database system. The codebase distinguishes between:

  • Snowflake: Uses connectionString format (JDBC connection string)
  • PostgreSQL, SQL Server, MySQL, Oracle, Redshift: Use server and port fields

Protecting Credentials with !expr:

The !expr tag (from the config package) allows you to evaluate R code within the config file. This is critical for security: it enables pulling credentials from environment variables rather than storing them as plain text in config.yml.

user: !expr Sys.getenv('dbUser')        # Evaluates R code: retrieves DB_USER from environment
password: !expr Sys.getenv('dbPassword') # Evaluates R code: retrieves DB_PASSWORD from environment

You can use any R function wrapped in !expr to retrieve credentials, including:

  • Environment variables: !expr Sys.getenv('VAR_NAME')
  • Keyring package: !expr keyring::key_get(service = 'picard', username = 'atlasUser')
  • Custom functions: !expr my_secure_fetch_credential('db_password')

Best practice: Always use !expr with a secure credential storage system. Never store passwords or connection strings as plain text in config.yml or commit them to git.

Common credentials:

  • dbms: Database type (snowflake, sql server, postgresql, mysql, oracle, redshift)
  • user: Database username (from environment variable via !expr Sys.getenv())
  • password: Database password (from environment variable)
  • databaseName: Internal reference name (snake_case with database + snapshot date)
  • databaseLabel: Pretty name for output formatting
  • cdmDatabaseSchema: Schema containing OMOP CDM tables (format: schema or database.schema)
  • vocabDatabaseSchema: Schema containing vocabulary tables (usually same as cdmDatabaseSchema)
  • workDatabaseSchema: Schema where user has write access (for cohort tables and intermediary work)
  • tempEmulationSchema: Optional schema for temp tables (snowflake, oracle)
  • cohortTable: Name of cohort table to create (default: {repoName}_{databaseName})

Example config.yml with Snowflake and PostgreSQL:

# Config File for my_study

default:
  projectName: my_study
  version: 1.0.0

# Snowflake: Uses connectionString format
snowflake_prod:
  dbms: snowflake
  connectionString: !expr Sys.getenv('dbConnectionString')
  user: !expr Sys.getenv('dbUser')
  password: !expr Sys.getenv('dbPassword')
  databaseName: snowflake_prod_20260101
  databaseLabel: Snowflake Production
  cdmDatabaseSchema: omop_schema
  vocabDatabaseSchema: omop_schema
  workDatabaseSchema: work_schema
  tempEmulationSchema: temp_schema
  cohortTable: my_study_cohorts

# PostgreSQL: Uses server/port format
postgres_local:
  dbms: postgresql
  server: localhost
  port: 5432
  user: !expr Sys.getenv('pgUser')
  password: !expr Sys.getenv('pgPassword')
  databaseName: postgres_local_20260101
  databaseLabel: PostgreSQL Local
  cdmDatabaseSchema: public
  vocabDatabaseSchema: public
  workDatabaseSchema: results
  cohortTable: my_study_pg_cohorts

Setting up environment variables:

In your .Renviron file (in project or home directory):

dbUser=your_db_username
dbPassword=your_db_password
dbConnectionString=jdbc:snowflake://account.snowflakecomputing.com:443
pgUser=postgres_user
pgPassword=postgres_password

Load before running pipeline: readRenviron("~/.Renviron")

main.R

The primary execution script for running the study pipeline in production mode. This is the script team members run to execute the full study workflow. The Ulysses workflow generates this file based on your project configuration.

See Running the Pipeline for detailed information about main.R and the execution workflow.

Analysis Folder

Contains the study code organized into executable analysis tasks.

analysis/tasks/

Individual R scripts that perform analytical steps. Each task is a self-contained unit that:

  • Loads necessary inputs (cohorts, concept sets, configuration)
  • Performs a specific analytical step
  • Saves results to a standardized output location

Note: Cohort generation is a built-in Picard feature handled automatically when you run the pipeline. You do not create a cohort generation task file. Tasks start after cohorts are generated. See Running the Pipeline for details on pipeline execution and cohort generation.

Tasks are named sequentially (01_, 02_, etc.) and executed in order:

analysis/tasks/
├── 01_descriptiveStats.R
├── 02_primaryAnalysis.R
├── 03_sensitivityAnalysis.R

Each task is independent and can be tested individually during development using testStudyTask().

Task File Format and Validation:

Picard enforces a standardized task file structure to ensure consistency and enable add-on modules to import external task files. Use makeTaskFile() to create new tasks—it automatically generates files in the correct format.

Required task file sections (validated by validateStudyTask()):

  • A. Meta: Metadata about the task (title, author, description, purpose)
  • B. Dependencies: Input files or objects required by the task
  • C. Connection Settings: Configuration block name and pipeline version (uses template variables !||configBlock||! and !||pipelineVersion||!)
  • D. Task Settings: Setup section where you:
    • Create executionSettings <- createExecutionSettingsFromConfig(configBlock = configBlock)
    • Create outputFolder <- setOutputFolder(executionSettings = executionSettings, pipelineVersion = pipelineVersion, taskName = "task_name")
  • E. Script: Actual analysis code that performs the task

Validation Rules:

All task files must pass validation before execution:

  1. File must exist and be readable
  2. All five sections (A, B, C, D, E) must be present
  3. Template variables !||configBlock||! and !||pipelineVersion||! must be defined in section C
  4. Section D must create an executionSettings object via createExecutionSettingsFromConfig()
  5. Section D must create an outputFolder object via setOutputFolder()
  6. Section E must contain actual code (not just comments or template placeholders)

Why Validation Matters:

Standardized task files enable:

  • Consistency: All tasks follow the same structure across your study
  • Portability: Add-on modules and packages can import external tasks that conform to the format
  • Reproducibility: Clear documentation of dependencies and configuration
  • Automation: The pipeline can reliably execute tasks knowing they meet structural requirements

Example: If you have a task that follows the required format, you could import it from another Picard study or an add-on package rather than rewriting it.

analysis/src/

Reusable functions and helpers for custom logic specific to your study. When external packages (like CohortPrevalence, FeatureExtraction, SelfControlledCaseSeries) provide the required functionality, use those directly—no need for custom helpers. Use analysis/src/ only for functions you’ve written that don’t have a package namespace.

Organize by functionality:

analysis/src/
├── cohortHelpers.R       # Custom cohort manipulation or validation functions
├── analysisHelpers.R     # Custom statistical or analysis utilities
├── outputHelpers.R       # Custom table formatting or export functions
└── diagnostics.R         # Custom validation or diagnostic functions

Example: - ✅ Use library functions: library(CohortPrevalence) → provides computePrevalence() - ✅ Write custom helper: You create customPrevalence() function to run bespoke prevalence tool → save in prevHelpers.R

Important: Functions in src/ must be sourced in the task files that use them. In your task’s section B (Dependencies), source the helper files:

# Section B. Dependencies
source(here::here("analysis/src/cohortHelpers.R"))
source(here::here("analysis/src/outputHelpers.R"))

This ensures all dependencies are explicit and documented, making task execution more transparent and reproducible.

analysis/migrations/

Post-processing scripts that clean and reshape pipeline results for dissemination. Migrations are numbered to correspond with their source analysis tasks (e.g., 02_migrate_surveillance.R cleans output from task 02_surveillance.R).

Purpose:

After orchestratePipelineExport() binds raw results with metadata, migrations handle data wrangling tasks that weren’t necessary during execution:

  • Aggregating: Combine results across subgroups or time periods
  • Standardizing: Apply demographic weighting or statistical adjustments
  • Pivoting: Reshape long format to wide for publication tables
  • Deriving: Calculate new metrics (confidence intervals, effect size categories, standardized rates)
  • Filtering: Remove rows below minimum cell counts or meeting exclusion criteria

Workflow:

Task 02: Pipeline Execution
    ↓
orchestratePipelineExport() → raw results to dissemination/export/merge/
    ↓
Migration 02: Data Wrangling (02_migrate_surveillance.R)
    ↓
dissemination/export/pretty/ → finalized, publication-ready results

Example Structure:

analysis/migrations/
├── 02_migrate_surveillance.R
│   # Aggregate surveillance counts by age/sex strata
│   # Apply census weighting for standardization
│   # Generate crude and standardized prevalence/incidence rates
├── 03_migrate_comparative.R
│   # Pivot comparative analysis results wide
│   # Calculate confidence intervals and p-values
└── 05_migrate_sensitivity.R
    # Combine sensitivity analysis variants into summary table

Each migration reads from dissemination/export/merge/ (raw exported results), performs transformations, and writes cleaned data to dissemination/export/pretty/ for final dissemination.

Inputs Folder

This folder stores cohort definitions and concept sets that define the study populations and other components for the analysis.

inputs/cohorts/

Cohort definitions defining the study populations. A cohort is a set of persons who satisfy one or more inclusion criteria for a duration of time. In OHDSI research, cohorts are the foundation of observational studies.

Cohort Definition Types:

  1. CIRCE-Based Definitions (JSON): Standard OHDSI approach
    • Uses CIRCE-BE Java library for standardized representation
    • Stored as JSON files in json/ folder
    • Ensures consistent serialization to SQL: same definition always generates identical populations
    • Typically imported from ATLAS
  2. Custom SQL Definitions: For specialized logic
    • Manual SQL queries in sql/ folder
    • Used when CIRCE cannot express the logic you need
  3. Dependency-Based Cohorts: Derived from existing cohorts
    • Subsets of existing cohorts (apply additional inclusion criteria)
    • Unions of multiple cohorts (combine populations)
    • Stored in sql/ folder or as derivative definitions

Folder Structure:

inputs/cohorts/
├── json/
│   ├── 001_primaryPopulation.json     # CIRCE-based from ATLAS
│   ├── 002_comparativeArm.json        # CIRCE-based from ATLAS
│   └── 003_outcomeDefinition.json     # CIRCE-based from ATLAS
├── sql/
│   ├── 004_primarySubset.sql          # Subset of cohort 001
│   └── 005_combinedPopulation.sql     # Union of cohorts 001 & 002
├── cohortsLoad.csv                    # Metadata index for cohort enrichment
└── cohortManifest.sqlite              # Provenance & metadata tracking database

cohortsLoad.csv:

A CSV file with metadata for each cohort. When loadCohortManifest() is called, this file is used to enrich CohortDef objects by matching file_name with actual cohort files. Used to track where cohorts came from and organize them with tags.

Columns: - atlasId: ATLAS cohort ID (integer, e.g., 1, 42) - label: Display name (character, e.g., "Type 2 Diabetes patients") - category: Broad grouping (character, e.g., "Disease Populations") - subCategory: Sub-grouping (character, optional, e.g., "Endocrine") - file_name: Relative path to JSON file (character, e.g., "json/t2dm_patients.json")

Example:

atlasId,label,category,subCategory,file_name
1,Type 2 Diabetes,Disease Populations,Endocrine,json/t2dm_patients.json
2,Diabetes Complications,Disease Populations,Endocrine,json/diabetes_complications.json

When loaded, these metadata fields are converted to tags on each cohort for later querying (e.g., $getCohortsByTag(), $getCohortsByLabel()).

cohortManifest.sqlite:

SQLite database created/managed by CohortManifest class. Contains cohort_manifest table tracking all cohort metadata:

Column Type Purpose
id INTEGER PRIMARY KEY Sequential cohort ID assigned by Picard
label TEXT NOT NULL Cohort display name
tags TEXT Serialized tags (e.g., "atlasId: 1 \| category: Disease Populations \| subCategory: Endocrine")
filePath TEXT NOT NULL Full path to cohort definition file
hash TEXT NOT NULL MD5 hash of SQL for change detection
cohortType TEXT DEFAULT ‘circe’ Type: 'circe' (ATLAS JSON), 'sql' (custom), 'subset', 'union', 'complement' (dependency-based)
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP When cohort was added to manifest
status TEXT DEFAULT ‘active’ Status tracking: 'active', 'missing', 'archived'
deleted_at DATETIME Soft-delete timestamp if cohort removed

Workflow:

  1. Create cohortsLoad.csv with metadata for your cohorts (use createBlankCohortsLoadFile())
  2. Import ATLAS cohort JSON definitions to json/ folder (use importAtlasCohorts())
  3. Add custom SQL or dependency-based cohorts to sql/ (e.g., subsets, unions)
  4. Call loadCohortManifest() to scan directories and enrich with cohortsLoad.csv metadata
  5. First load creates cohortManifest.sqlite database; subsequent loads verify file hashes
  6. Use CohortManifest methods to query cohorts: $getCohortById(1), $getCohortsByTag("category: Disease Populations")

See also: Loading Inputs for detailed guidance on creating, importing, and managing cohort manifests.

inputs/conceptSets/

Stores CIRCE-based concept set definitions for identifying cohorts and extracting covariates:

inputs/conceptSets/
├── json/
│   ├── exposure_antidiabetic.json        # Drug exposure concept set
│   ├── outcome_mi.json                   # Condition outcome concept set
│   └── covariate_hypertension.json       # Covariate measurement concept set
├── conceptSetsLoad.csv                   # Metadata index for concept set enrichment
└── conceptSetManifest.sqlite             # Provenance & metadata tracking database

conceptSetsLoad.csv:

Similar to cohortsLoad.csv, this CSV provides metadata for enriching ConceptSetDef objects. Columns: - atlasId: ATLAS concept set ID (integer, e.g., 456, 789) - label: Display name (character, e.g., "Antidiabetic medications") - category: Broad grouping (character, e.g., "Medications" or "Diagnoses") - subCategory: Optional sub-grouping (character, e.g., "Endocrine Drugs") - domain: OMOP clinical domain (required, character): - drug_exposure - Medication/drug concept sets - condition_occurrence - Diagnosis concept sets - measurement - Lab/test result concept sets - procedure - Medical procedure concept sets - observation - Observation concept sets - device_exposure - Device/equipment concept sets - visit_occurrence - Visit type concept sets - init - Not yet classified (placeholder) - sourceCode: Rarely used; TRUE if concept set represents source codes instead of standard concepts (character: TRUE or FALSE) - file_name: Relative path to JSON file (character, e.g., "json/hypertension.json")

Example:

atlasId,label,category,subCategory,domain,sourceCode,file_name
456,Antidiabetic Medications,Medications,Endocrine,drug_exposure,FALSE,json/exposure_antidiabetic.json
789,Acute MI,Diagnoses,Cardiovascular,condition_occurrence,FALSE,json/outcome_mi.json
1001,Hypertension,Diagnoses,Cardiovascular,condition_occurrence,FALSE,json/covariate_hypertension.json

When loaded, metadata fields are converted to tags on each concept set for querying (e.g., $getConceptSetsByTag(), $getConceptSetsByLabel()).

conceptSetManifest.sqlite:

SQLite database created/managed by ConceptSetManifest class. Contains concept_set_manifest table:

Column Type Purpose
id INTEGER PRIMARY KEY Sequential concept set ID assigned by Picard
label TEXT NOT NULL Concept set display name
tags TEXT Serialized tags (e.g., "atlasId: 456 \| domain: drug_exposure \| category: Medications")
filePath TEXT NOT NULL Full path to concept set JSON file
hash TEXT NOT NULL MD5 hash of JSON for change detection
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP When concept set added to manifest
status TEXT DEFAULT ‘active’ Status: 'active', 'missing', 'archived'
deleted_at DATETIME Soft-delete timestamp if removed

Workflow:

  1. Create or update conceptSetsLoad.csv with metadata (use createBlankConceptSetsLoadFile())
  2. Import ATLAS concept set JSON definitions to json/ folder (use importAtlasConceptSets())
  3. Call loadConceptSetManifest() to scan directories and enrich with conceptSetsLoad.csv metadata
  4. First load creates conceptSetManifest.sqlite database; subsequent loads verify file hashes
  5. Use ConceptSetManifest methods to query: $getConceptSetById(456), $getConceptSetsByTag("domain: drug_exposure")
  6. Extract source codes and dependencies using $extractSourceCodes() (requires ExecutionSettings)

See also: Loading Inputs for detailed guidance on creating, importing, and managing concept set manifests.

Dissemination Folder

This folder organizes results, evidence outputs, and documentation for sharing with stakeholders. Contains three main subdirectories.

dissemination/documents/

Static written reports and supplementary materials (PDFs, Word docs, etc.); anything that provides narrative context or detailed information for readers but isn’t generated directly from the pipeline. Examples include:

dissemination/documents/
├── mainReport.docx
├── supplementaryMaterial.pdf
└── analysisProtocol.md

dissemination/quarto/

In Picard, the Study Hub is the primary dissemination format for sharing results with stakeholders. A Study Hub is an interactive website that communicates study objectives, analytical assumptions, and final results in a unified, professional HTML format. Quarto is the tool used to construct these Study Hub websites, enabling you to weave together narrative text, analysis code, and results into linked HTML pages that are reproducible and automatically updated when data changes.

Quarto allows you to combine narrative text, R code, and results in documents that render to HTML, PDF, or Word. Quarto files (.qmd) contain markdown text interspersed with code chunks that execute when the document is rendered, automatically embedding results directly into the report. This ensures your documentation always reflects the latest data and findings.

Pre-formatted on initialization: When you initialize a Picard project, the dissemination/quarto/ folder is automatically set up with a standard Study Hub structure including template files:

dissemination/quarto/
├── _quarto.yml           # Quarto configuration for website
├── index.qmd             # Landing page (created from README.md)
├── news.qmd              # News/changelog page (created from NEWS.md)
├── egp.qmd               # Evidence Generation Plan template
├── results.qmd           # Results template for data integration
├── style.css             # Custom CSS styling for the website
├── R/                    # Helper functions directory
├── images/               # Images and figures directory
└── _site/                # Rendered HTML output (generated upon build)

Use buildStudyHub() to render your documentation into the _site/ folder, which creates the final interactive website.

dissemination/export/

Results exported from the pipeline via orchestratePipelineExport() and processed into various formats for dissemination:

dissemination/export/
├── merge/                 # Raw merged results from orchestratePipelineExport()
│   ├── cohortKey.csv      # Cohort definitions (generated)
│   ├── databaseInfo.csv   # Database metadata (generated)
│   ├── schema_review.csv  # Schema validation (generated)
│   ├── task_01_results.csv
│   ├── task_02_results.csv
│   └── task_03_results.csv
├── pretty/                # Formatted results from migration scripts (Excel, CSV, etc.)
│   ├── mainResults.xlsx
│   ├── sensitivity_analyses.xlsx
│   └── suppTable1_demographics.xlsx
└── studyHubOutput/        # Files sourced by the Study Hub for dynamic rendering
    ├── table1_demographics.csv
    ├── figure1_incidence.csv
    └── results_summary.json

Workflow:

  1. Merge phase: orchestratePipelineExport(pipelineVersion, dbIds) reads raw results from exec/results/ across all databases and tasks for a given version. It combines results into long-format CSV files in dissemination/export/merge/, along with reference files (cohortKey, databaseInfo, schema_review).

  2. Format phase: Migration scripts (e.g., 02_migrate_surveillance.R) read from dissemination/export/merge/ and perform data wrangling, reshaping, and formatting. Output depends on use case:

    • For publication/reports: Write formatted Excel/CSV files to dissemination/export/pretty/
    • For Study Hub: Write data tables and figures to dissemination/export/studyHubOutput/
  3. Dissemination:

    • Files in pretty/ are copied directly to publications or referenced by static Quarto reports
    • Files in studyHubOutput/ are sourced dynamically by Quarto files in dissemination/quarto/ to create interactive tables and figures in the Study Hub website

Exec Folder

This folder contains execution artifacts and logs from running the pipeline. Results are organized by database, pipeline version, and task to support parallel development and multi-database studies.

exec/results/

Raw results from each pipeline execution, organized hierarchically:

exec/results/
├── primary_db/                         # Created per config.yml database ID
│   ├── 1.0.0/                          # Versioned results (semantic versions)
│   │   ├── 00_buildCohorts/
│   │   │   ├── cohortCounts.csv
│   │   │   ├── cohortResults.csv
│   │   │   └── buildLog.txt
│   │   ├── 01_descriptiveStats/
│   │   │   ├── demographics.csv
│   │   │   └── flowchart.csv
│   │   ├── 02_surveillance/
│   │   │   └── incidenceCounts.csv
│   │   └── 03_primaryAnalysis/
│   │       └── modelResults.csv
│   ├── 1.1.0/                          # Another versioned run
│   │   ├── 00_buildCohorts/
│   │   └── ...
│   └── dev/                            # Development/test results (temporary)
│       ├── 00_buildCohorts/
│       └── ...
└── secondary_db/
    ├── 1.0.0/
    └── dev/

How results are organized:

  • Database folders: One folder per database configured in config.yml. Database folder name is snake_case version of the databaseName from config (e.g., optum_dodoptum_dod/)
  • Version folders: Within each database, results are organized by pipelineVersion (e.g., 1.0.0/, 1.0.1/). A special dev/ folder holds temporary test results
  • Task folders: Within each version, one folder per task created by setOutputFolder() (e.g., 00_buildCohorts/, 01_descriptiveStats/)

Task file execution: Each task script in analysis/tasks/ creates an output folder via:

outputFolder <- setOutputFolder(
  executionSettings = executionSettings,
  pipelineVersion = pipelineVersion,
  taskName = "00_buildCohorts"
)

This creates: exec/results/{databaseName}/{pipelineVersion}/{taskName}/ where task results (CSVs, logs, etc.) are written.

Development vs. Production: - Development mode (test_main.R): Uses pipelineVersion = "dev" so results go to exec/results/{db}/dev/. Results are temporary and don’t interfere with versioned runs. - Production mode (main.R): Uses semantic versioning (e.g., pipelineVersion = "1.0.0") so results go to exec/results/{db}/1.0.0/. Results are retained for post-processing and archival.

exec/logs/

Pipeline execution logs with timestamps and error details:

exec/logs/
├── picard_log_1.0.0_20260407_093045.txt
├── picard_log_dev_20260407_094512.txt
├── task_run_history.csv
└── picard_log_1.0.1_20260408_140230.txt

Extras Folder

Reference scripts and development artifacts that support the study but aren’t part of the core pipeline.

extras/test_main.R

Development variant of main.R for rapid iteration during development. Uses testStudyPipeline() instead of execStudyPipeline(), skips validations, and places results in the dev/ output folder.

See Running the Pipeline for details on using test_main.R.

extras/setup.R

Optional setup script for configuring the local environment, installing dependencies, or initializing database connections. This provides a reference for team members setting up their development environment.

extras/

Store additional reference materials, exploration scripts, or documentation that team members create during development:

extras/
├── test_main.R
├── setup.R
├── cohortValidation.R           # Exploration and diagnosis
├── conceptSetReview.R
└── AnalysisNotes.md             # Development documentation

Next Steps

Now that you understand the repository structure, the next steps depend on where you are in your project:

  1. Just initialized a project? See Launching a Study for how to configure and launch a Picard study.

  2. Setting up cohorts and concept sets? See Loading Inputs for working with cohort and concept set manifests.

  3. Ready to execute analyses? See Running the Pipeline to learn about test mode vs. production execution.

  4. Processing results? See Post-Processing Steps for organizing and exporting results.