Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. The Site Reliability Engineer role involves ensuring the reliability and operational health of AI systems across internal clusters and customer deployments, troubleshooting complex issues, and partnering with engineering teams to resolve production incidents.
Responsibilities
- Ensure reliability and operational health of Tenstorrent systems across internal and customer environments
- Troubleshoot complex issues across compute, networking, and software layers
- Partner with engineering teams and customers to resolve production incidents
- Design and improve monitoring, observability, and alerting systems
- Build automation to reduce operational toil and improve system reliability
Skills
- Experienced in site reliability, infrastructure, or systems engineering in distributed environments
- Strong Linux systems knowledge with the ability to troubleshoot complex multi-layer issues
- Proficient with observability tools such as Prometheus, Grafana, and alerting systems
- Comfortable with scripting and automation using Python, Go, or similar languages
- Solid understanding of networking fundamentals and how systems behave at scale
Benefits
- Highly competitive compensation package and benefits
Company Overview
- Tenstorrent develops AI hardware and software solutions for data processing and machine learning application. It was founded in 2016, and is headquartered in Toronto, Ontario, CAN, with a workforce of 501-1000 employees. Its website is http://tenstorrent.com.