Full TimeMachine Learning Engineer (Training Infrastructure)
Nous Research
This job is no longer accepting applications
See open jobs at Nous Research.See open jobs similar to "Full TimeMachine Learning Engineer (Training Infrastructure) " Delphi Ventures.Software Engineering, Other Engineering
Posted 6+ months ago
We’re looking for an MLE to scale training of large transformer-based models. You’ll work on distributed training infrastructure, focusing on performance optimization, parallelization, and fault tolerance for multi-GPU and multi-node training environments.
Responsibilities:
- Performance engineering of training infrastructure for large language models
- Implementing parallelization strategies across data, tensor, pipeline, and context dimensions
- Profiling distributed training runs and optimizing performance bottlenecks
- Building fault-tolerant training systems with checkpointing and recovery mechanisms
Qualifications:
- 3+ years training large neural networks in production
- Expert-level PyTorch or JAX for performant and fault-tolerant training code
- Multi-node, multi-GPU training experience with debugging skills
- Experience with distributed training frameworks and cluster management
- Deep understanding of GPU memory management and optimization techniques
Preferred:
- Experience with distributed training of large multi-modal models, including those with separate vision encoders.
- Deep knowledge of NCCL (e.g. symmetric memory)
- Experience with mixture of experts architectures and expert parallelism
- Strong NVIDIA GPU programming experience (Triton, CUTLASS, or similar)
- Custom CUDA kernel development for training operations
- Proven ability to debug training instability and numerical issues
- Experience designing test runs to de-risk large-scale optimizations
- Hands-on experience with FP8 or FP4 training
- Track record of open-source contributions (e.g. DeepSpeed, TorchTitan, NeMO)
This job is no longer accepting applications
See open jobs at Nous Research.See open jobs similar to "Full TimeMachine Learning Engineer (Training Infrastructure) " Delphi Ventures.