Data Engineer
The main purpose of this role is to combine clinical data expertise with strong data engineering skills to create and manage data pipelines, ensuring the integrity and accessibility of clinical trial data.
Responsibilities:
Productionize and monitor pipelines and models; collaborate on CI/CD, testing, and user feedback.
Implement ETL patterns (medallion architecture), ensuring data provenance, validation, and versioning.
Participate in the continuous improvement and validation of existing pipelines.
Ensure clinical concepts are accurately represented and harmonized across various data models (CDISC SDTM/ADaM, OMOP, HL7); contribute to mapping and transformation logic.
Develop NLP models for entity and relation extraction (e.g., inclusion/exclusion criteria, demographics, endpoints, study design).
Build automated pipelines to ingest registry and publication data, converting it to tabular, queryable datasets.
Co-design the benchmarking data model with end users and map extracted information to standardized terminologies.
Integrate human-in-the-loop review, confidence scoring, and vocabulary/units normalization.
Key Requirements
Strong data engineering skills: Databricks, Spark, Delta Lake, SQL, ETL design and orchestration.
Familiarity with clinical trial concepts (inclusion/exclusion criteria, endpoints, demographics) and biomedical terminologies.
Practical experience with data modeling and working with end users to define requirements.
Experience with CI/CD, testing frameworks, and monitoring for data pipelines and ML models.
Experience with NLP for information extraction from scientific text (publications, registries).
Good communication and collaboration skills.
Nice to Have
Experience with OMOP / CDISC or other clinical data standards.
Familiarity with vocabulary services (OHDSI/Athena, UMLS) and NLP toolkits (spaCy, Hugging Face, AllenNLP).
Prior work in pharma or clinical research data environments.
Data Engineer
Data Engineer