ArlingtonVARecruiter Since 2001
the smart solution for Arlington jobs

AI Research Computing Infrastructure Engineer

Company: Frederick National Laboratory for Cancer Research
Location: Frederick
Posted on: February 22, 2026

Job Description:

AI Research Computing Infrastructure Engineer Job ID: req4426 Employee Type: exempt full-time Division: Enterprise Information Technology Facility: Frederick: Ft Detrick Location: PO Box B, Frederick, MD 21702 USA The Frederick National Laboratory is operated by Leidos Biomedical Research, Inc. The lab addresses some of the most urgent and intractable problems in the biomedical sciences in cancer and AIDS, drug development and first-in-human clinical trials, applications of nanotechnology in medicine, and rapid response to emerging threats of infectious diseases. Accountability, Compassion, Collaboration, Dedication, Integrity and Versatility; it's the FNL way. PROGRAM DESCRIPTION The mission of Enterprise Information Technology (EIT) is to develop an enterprise-level, consolidated information technology infrastructure that provides exceptional IT capabilities to the Frederick National Labs for Cancer Research (NCI-Frederick/FNLCR) in support of basic, translational, and clinical cancer and AIDS research. The IT Operations Group (ITOG) is a part of Enterprise Information Technology (EIT) within Leidos Biomedical Research, Inc. ITOG is responsible for computational servers, storage servers, virtual machine infrastructure, and the FNLCR network. ITOG focuses on implementing enterprise IT best practices in the areas of computational services, storage, backup, and archiving; batch and application support; server consolidation and virtualization; network infrastructure; unification of voice, teleconferencing, and video communication technologies; and improved infrastructure for collocation of dedicated servers. KEY ROLES/RESPONSIBILITIES: The Research Computing Infrastructure Engineer will design, build, and operate next-generation high-performance computing (HPC) environments that support container-based workflows and GPU-accelerated research computing. The position will play a key role in evaluating, implementing, and maintaining scalable and secure computing architectures for advanced data analysis, AI/ML model training, and simulation workloads. The engineer will collaborate closely with researchers, IT professionals, and external partners to translate scientific requirements into reliable, high-performance computing solutions. Design and implement next-generation high-performance computing (HPC) environments that leverage container-driven workflows for GPU-accelerated research. Build and maintain container orchestration systems for batch and distributed workloads. Integrate containerized job workflows with existing HPC schedulers and storage systems. Develop and maintain job templates for batch GPU training and multi-node distributed computing. Automate deployment, configuration, and scaling through infrastructure-as-code and CI/CD practices. Monitor, benchmark, and optimize system performance, reliability, and resource utilization. Collaborate with researchers to containerize and optimize legacy workflows for scalable execution. Lead evaluation of emerging tools (e.g., Prefect, Ray, Airflow, Dagster) for workflow orchestration and distributed computing. Contribute to the development of tools and bridges between orchestration frameworks and traditional HPC environments. BASIC QUALIFICATIONS To be considered for this position, you must minimally meet the knowledge, skills, and abilities listed below: Possession of Bachelor’s degree from an accredited college/university according to the Council for Higher Education Accreditation (CHEA) or four (4) years relevant experience in lieu of degree. Foreign degrees must be evaluated for U.S. equivalency. In addition to the education requirement, a minimum of eight (8) years of related experience. Strong Linux systems engineering and administration experience. Hands-on experience with container orchestration tools such as Kubernetes, Nomad, Run:AI, etc. Hands-on experience with scripting/programming skills (Python, Bash, or Go) for automation, monitoring, and job orchestration. Experience with infrastructure-as-code / automation tooling (Terraform, Ansible, Packer, or equivalent). Familiarity with system performance analysis, monitoring, and tuning. Comfortable with small-team environments and taking end-to-end ownership of compute infrastructure. Ability to obtain and maintain a security clearance. PREFERRED QUALIFICATIONS Candidates with these desired skills will be given preferential consideration: Experience with multi-node distributed ML frameworks (PyTorch DDP, Ray, Horovod, TensorFlow,etc). Familiarity with pipeline orchestration tools (Prefect, Airflow, Dagster, Kubeflow). Understanding of resource management and scheduling concepts (queues, allocations, GPU device plugins, gang scheduling, multi-node coordination). Understanding of storage integration with high-performance clusters (POSIX object storage, VAST or similar). Familiarity with cloud GPU environments (AWS, GCP, Azure) and hybrid workflows. Familiarity with workflow orchestration/pipeline tools (Argo, Kubeflow, Ray, MLFlow). Good communication and documentation skills, the ability to make complex infrastructure understandable to researchers and other engineers. EXPECTED COMPETENCIES: Expertise in Kubernetes, Nomad, or equivalent container orchestration systems for large-scale computing. Deep knowledge of Linux systems administration, performance tuning, and automation. Ability to translate research computing needs into scalable, reliable infrastructure designs. Commitment to documentation, reproducibility, and open science principles. Collaborative mindset and willingness to mentor peers in containerization and HPC best practices. Commitment to Non-Discrimination All qualified applicants will receive consideration for employment without regard to sex, race, ethnicity, color, age, national origin, citizenship, religion, physical or mental disability, medical condition, genetic information, pregnancy, family structure, marital status, ancestry, domestic partner status, sexual orientation, gender identity or expression, veteran or military status, or any other basis prohibited by law. Leidos will also consider for employment qualified applicants with criminal histories consistent with relevant laws. Pay and Benefits Pay and benefits are fundamental to any career decision. That's why we craft compensation packages that reflect the importance of the work we do for our customers. Employment benefits include competitive compensation, Health and Wellness programs, Income Protection, Paid Leave and Retirement. More details are available here 123,800.00 - 207,125.00 USD The posted pay range for this job is a general guideline and not a guarantee of compensation or salary. Additional factors considered in extending an offer include, but are not limited to, responsibilities of the job, education, experience, knowledge, skills, and abilities as well as internal equity, and alignment with market data. The salary range posted is a full-time equivalent salary and will vary depending on scheduled hours for part time positions

Keywords: Frederick National Laboratory for Cancer Research, Arlington , AI Research Computing Infrastructure Engineer, IT / Software / Systems , Frederick, Virginia


Didn't find what you're looking for? Search again!

I'm looking for
in category
within


Log In or Create An Account

Get the latest Virginia jobs by following @recnetVA on Twitter!

Arlington RSS job feeds