Job details

Lead Data Engineer

Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming the landscape of biology and medicine through the power of Generative AI. Our team comprises leading minds and innovators in AI and Biological Science, pushing the boundaries of what is possible. We are dreamers who reimagine a new paradigm for biology and medicine.

We are committed to decoding biology holistically and enabling the next generation of life-transforming solutions. As the first mover in pan-modal Large Biological Models (LBM), we are pioneering a new era of biomedicine, with our LBM training leading to ground-breaking advancements and a transformative approach to healthcare. Our exceptionally strong R&D team and leadership in LLM and generative AI position us at the forefront of this revolutionary field. With headquarters in Silicon Valley, California, and a branch office in Paris, we are poised to make a global impact. Join us as we embark on this journey to redefine the future of biology and medicine through the transformative power of Generative AI.

Key Responsibilities:

Lead the strategic design of a holistic solution to our large and diverse data usage needs,
Set the collaboration and reusability strategy for data consumption including publicly available and partner generated data
Ensure the FAIR principles are followed in our data storage and retrieval strategy.
Build and maintain scalable, efficient, and reusable data products and codebases for large-scale foundation model training, adaptation, evaluation, and inference.
Collaborate closely with data engineers and research scientists to integrate models into production environments.
Ensure code quality, scalability, and performance through rigorous testing and code reviews.

Qualifications:

Bachelor’s, Master’s degree in Computer Science, Engineering, or related field. Experience in life sciences or healthcare is required.
Strong familiarity with at least some (the more the better) of the following biomedical data types: Sequencing data, other high throughput omics data, biological imaging data, clinical and phenotypic data
Experience with using (developing an advantage) large scale data products and systems for biological or biomedical applications.
Stong programming skills in JavaScript, Python, and modern web development frameworks, and familiarity with GPU-accelerated tools (e.g., CUDA, cuDNN, Triton).
Knowledge of major deep learning frameworks such as PyTorch, HuggingFace Transformers & Accelerate, or Megatron-LM/DeepSpeed.
Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes).
Proficiency in back-end frameworks like Django, Flask, or Node.js, and database technologies (e.g., PostgreSQL, MongoDB).
Expertise in distributed systems, cloud computing (AWS, GCP), and containerization tools (Docker, Kubernetes).

Preferred Qualifications:

Prior experience pre-training or serving large language models or large-scale foundation models.
Experience with deep learning workflows.
Knowledge of challenges and experience with bioinformatics tools
Familiarity with version control systems like Git and CI/CD pipelines.
Strong understanding of RESTful APIs, authentication, and deployment pipelines
Familiarity with machine learning workflows and biological datasets.

Join us as we embark on this journey to redefine the future of biology and medicine.

We are an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Average salary estimate

$140000 / YEARLY (est.)

min

max

$120000K

$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs