Senior Data Engineer

Stanford University

Stanford, CA

ID: 7068373
Posted: July 28, 2020
Application Deadline: Open Until Filled

Job Description

We enter the world with a set of genes that determine many factors of our health. From that moment forward, every aspect of our lives affects our health. There is unequivocal evidence that social, environmental, and behavioral factors are by far and away the strongest predictors of health and disease. Founded in 2015, the Stanford Center for Population Health Sciences’ (PHS) mission is to improve population health by bringing together diverse disciplines and data to better understand and address the social and environmental determinants of health.

One of PHS’ strategic goals is to spur and support cutting-edge transdisciplinary population research that advances our mission. Toward this end, we have created a world-class data ecosystem that hosts over 100 datasets, supporting 800 researchers and ~1,500 research projects. PHS’ data team is responsible for managing PHS’ data ecosystem, advancing critical research projects and initiatives, developing novel data products and services, and acquiring new datasets that stimulate population health research. The Senior Data Engineer will be responsible for administering, monitoring, and maintaining the PHS Google cloud environment. She/he will work as a technical bridge between PHS and our partners, who are providing PHS researchers with technical infrastructure, which allows researchers to analyze high-risk data in a secure environment.

To advance PHS' goals, the data team is expanding its data portfolio to include additional high-value datasets (focusing on datasets that include social determinants). The Senior Data Engineer will manage a portfolio of high-value datasets which will involve data cleaning, curation, annotation, and transformation into common data models. The integrity, transparency, and reproducibility of science increasingly depend on the accuracy and precision of both decisions made during data curation and the degree to which they are annotated in a way that enables the data to be both used and reused. The Senior Data Engineer must excel at both curating the data and communicating decisions made and how to use the data to the hundreds of researchers who will be using the canonical datasets. PHS has a large portfolio of exciting research and data partnerships around the world and working groups with different health focuses (e.g., air pollution and health, health disparities, mental health) that vary in terms of their scope, complexity, and topics. She/he will serve as the data lead and provide technical support to some of the working groups particularly for teams requiring machine learning or NLP methods.

Duties include:
Install, configure, and support relational database management systems (RDBMS) and related software to resolve highly complex and/or unique issues without precedent and/or structure. Work on databases using more advanced database administration concepts and modern (cloud) tools.
Design, develop, and maintain highly-available databases.
Take responsibility for database design and performance optimization, backup and recovery strategies and implementation, as well as monitoring the overall health of the database environment.
Execute complete database solutions from evaluation to implementation.
Provide on-going system administration and technical infrastructure support.
Review the physical and logical design of databases for optimal database structures, performance tuning, security, and database backup/recovery. Plan and implement pro-active and reactive performance analysis, monitoring, troubleshooting, and capacity planning.
Evaluate and test marketplace tools and utilities, including system integration and automation, which enhance server functionality and promote the development of RDBMS applications.
Develop and maintain efficient and appropriate connectivity solutions between various campus databases to ensure necessary data is available as needed.
Create data flow and data lifecycle documentation.
Develop and enforce database standards and/or new development protocols.
Experience in the Extract, Transform, Load (ETL) processes and building the pipeline.

* - Other duties may also be assigned

5+ years of experience in working with large datasets on cloud environment
3+ years of previous analytics projects on large Electronic Healthcare Records, medical claims or datasets of comparable size, sensitivity and complexity.
Proficiency in Google Cloud suites of tools such as BigQuery, Compute Engine, Dataprep, Healthcare API, AI Platform, etc.
Proficiency in Python packages for data science and statistics
Strong knowledge of relational databases and SQL
Strong skills in statistics, probability, and machine learning
Experience with Cloud Services and Cloud Automation solutions
Experience working with automated deployments and source code/configuration management tools (such as GitHub
Great communication skills, and experience telling stories with data
Familiarity with medical terms and the healthcare billing systems
Previous experience with Natural Language Processing is a plus
Advanced cloud skills
Data annotation and data science
ML skills


Apply Now

Please mention to the employer that you saw this ad on