Post-Processing Steps

Note: This vignette is currently in development and subject to change.

Introduction

After your pipeline runs successfully via execStudyPipeline(), raw analytical outputs are saved in exec/results/[database]/[version]/ organized by task. Post-processing orchestrates the merge and quality control of these results across multiple databases and exports them to dissemination/export/merge/v{version}/.

The main post-processing step is orchestrating the pipeline export, which:

Merges result files across all databases for each task
Generates reference files (cohortKey, databaseInfo)
Reviews the schema of all exported files
Validates cohort result completeness
Generates execution metadata and QC reports

The Post-Processing Workflow

Step 1: Orchestrate the Pipeline Export

After your production pipeline completes, call orchestratePipelineExport() to merge and validate all results:

library(picard)

# Orchestrate export for version 1.0.0 across two databases
orchestratePipelineExport(
  pipelineVersion = "1.0.0",
  dbIds = c("omop_cdm", "another_cdm"),
  resultsPath = here::here("exec/results"),
  exportPath = here::here("dissemination/export/merge"),
  cohortsFolderPath = here::here("inputs/cohorts")
)

This function:

Discovers all tasks for the specified version
Calls importAndBind() for each task to merge results across databases
Creates reference files: cohortKey.csv, databaseInfo.csv
Reviews file schemas with reviewExportSchema()
Validates cohorts with validateCohortResults()
Generates QC reports: qc_cohortValidation.csv, qc_processMeta.csv

Output Folder Structure

After orchestration completes, results are organized in:

dissemination/export/merge/v1.0.0/
├── 00_buildCohorts.csv          # Cohort counts merged across databases
├── 01_descriptiveStats.csv      # Task 1 results merged
├── 02_primaryAnalysis.csv       # Task 2 results merged
├── cohortKey.csv                # Reference: cohort IDs and labels
├── databaseInfo.csv             # Reference: which databases included
├── schema_review.csv            # Column-level schema of all files
├── qc_cohortValidation.csv      # QC: completeness check
└── qc_processMeta.csv           # QC: execution metadata

Reference Files

cohortKey.csv

Maps cohort IDs to labels for interpretation. Pulled from your cohort manifest if available:

cohortId,cohortLabel,cohortTags
1,Type 2 Diabetes,phenotype
2,CVD Comparator,phenotype
3,MI Outcome,outcome

databaseInfo.csv

Documents which databases were included in the merge:

databaseId,databaseName,databaseLabel,cohortTable
omop_cdm,database_1,OMOP CDM - Site A,cohort
uk_biobank,database_2,UK Biobank Linked EHR,cohort_table

This helps identify which databases contributed to each result.

schema_review.csv

Inspects the structure of all exported CSV files. Useful for identifying:

Column naming inconsistencies
Unexpected data types
Columns that need transformation

fileName,columnName,dataType,rowCount
cohortCounts.csv,databaseId,character,5
cohortCounts.csv,cohortId,numeric,5
cohortCounts.csv,cohortEntries,numeric,5
descriptiveStats.csv,databaseId,character,120
descriptiveStats.csv,cohortId,numeric,120
descriptiveStats.csv,ageGroup,character,120

Quality Assurance

qc_cohortValidation.csv

Validates that all cohorts in your cohort key have corresponding results. Flags:

OK: Cohort found in results with non-zero counts
ZeroCount: Cohort found but has zero entries or subjects (non-enumerated)
Missing: Cohort in manifest but not found in results

cohortId,label,validationStatus,details
1,Type 2 Diabetes,OK,entries: 245897, subjects: 123456
2,CVD Comparator,OK,entries: 189234, subjects: 98765
3,MI Outcome,ZeroCount,entries: 0, subjects: 0

Use this to identify:

Cohorts that need investigation (why zero?)
Cohorts that didn’t enumerate (possible definition issues)

qc_processMeta.csv

Records execution metadata for reproducibility:

executionTimestamp,pipelineVersion,codeCommitSha,lockfileHash,databasesIncluded,databaseCount,tasksProcessed,totalFilesExported,totalRowsMerged,qcStatus
2024-03-15 14:32:00,1.0.0,abc1234def5678,hash123,OMOP CDM | UK Biobank,2,3,9,542870,OK

Tracks:

When the export ran
What version was exported
Code state (git commit SHA for reproducibility)
Environment (renv lockfile hash for dependency reproducibility)
Scope (which databases, how many tasks)
Results (files and rows merged)
QC Status (OK, HasWarnings, or other)

Advanced: Manual Import and Binding

If you need to merge results for a specific task only, use importAndBind():

library(picard)

# Merge just the descriptive statistics task across all databases
importAndBind(
  version = "1.0.0",
  taskName = "01_descriptiveStats",
  dbIds = c("omop_cdm", "another_cdm"),
  resultsPath = here::here("exec/results"),
  exportPath = here::here("dissemination/export/merge")
)

This combines all CSV files from that task across databases and adds a databaseId column to identify the source.

Advanced: Schema Review

To examine file structure without full orchestration:

library(picard)

# Review schema of exported files
schema <- reviewExportSchema(
  exportPath = here::here("dissemination/export/merge/v1.0.0")
)

# Check for specific data types
character_cols <- schema[schema$dataType == "character", ]

Advanced: Cohort Validation

To validate cohort results independently:

library(picard)

# Validate cohorts in exported results
validation <- validateCohortResults(
  exportPath = here::here("dissemination/export/merge/v1.0.0"),
  resultsFileName = "cohortCounts.csv"
)

# View validation results
print(validation)

# Check for issues
issues <- validation[validation$validationStatus != "OK", ]

Next Steps

Run orchestration: Call orchestratePipelineExport() after production pipeline completes
Review QC reports: Check qc_cohortValidation.csv and qc_processMeta.csv
Examine schema: Use schema_review.csv to understand data structure
Handle issues: If cohorts are missing or zero, investigate in developing_the_pipeline
Continue dissemination: Use exported results for publication or further analysis