DevOps Infrastructure Engineer

Posted:
7/24/2024, 9:11:06 AM

Location(s):
California, United States ⋅ Cupertino, California, United States

Experience Level(s):
Mid Level

Field(s):
DevOps & Infrastructure ⋅ Software Engineering

Workplace Type:
On-site

About Etched

Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep chain-of-thought reasoning.

DevOps Infrastructure Engineer

Designing and writing software for new ASICs is hard, and requires a huge amount of software and tooling. It is even more challenging for model-specific ASICs, as it is important for them to hit the market at the right time, and thus moving fast is essential.

You will drive adoption of cutting-edge tooling, to improve the speed and reliability of our toolchains. You will help us innovate to do better than the industry norm, by running massively parallel CI jobs, specifying and building our own fully-redundant SSD-only server infrastructure, and making sure these tools run automatically and reliability.

You will work with an IT contracting firm to do the day-to-day maintenance and installation - while you must be knowledgeable enough about IT to work with this firm, most of your time will be spent designing new toolchains entirely

The scope and title of this role can be modified for exceptional candidates.

Representative projects

●  Spec out a server using a 6 GHz desktop CPU to speed up single-threaded workloads

●  Decide if moving our servers to the cloud/a colo facility makes sense to improve uptime

●  Set up networking infrastructure to allow Jupyter notebook users to connect to our

servers, without waiting for them to be restarted.

●  Parallelize our CI stack to run on dozens of different machines at once, designing a

policy to avoid unnecessary CI failures if a machine goes down.

You may be a good fit if you

●  Are highly technical

●  Strong knowledge of Linux, containerization, CI/CD, and programming languages such

as Python/C++. You will be asked coding questions during your interview.

●  Proven ability to lead technical teams and mentor junior members

●  Have 4+ years of experience with either infrastructure engineering or software

development

●  Experience debugging complex hardware and software issues with server infrastructure

Strong candidates may also have experience with

●  In-depth understanding of workflows used in the semiconductor industry, especially those involving Synopsys and Cadence EDA tooling and Verilator

●  Proficiency with cloud computing technology and experience working with a Big 3 Cloud

●  Experience monitoring and installing datacenter hardware

●  In-depth understanding of workflows used in the semiconductor industry,

We encourage you to apply even if you do not believe you meet every single qualification.

How we’re different:

Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.

We are a fully in-person team in Cupertino, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.

Benefits:

  • Full medical, dental, and vision packages, with 100% of premium covered, 90% for dependents
  • Housing subsidy of $2,000/month for those living within walking distance of the office
  • Daily lunch and dinner in our office
  • Relocation support for those moving to Cupertino