×
Large-Scale Training

Observability | Best Practices for Host Monitoring in Elastic Supercomputing Scenarios with Prometheus

This article introduces how to build an accurate, fast, and reliable monitoring system in supercomputing's fast auto-scaling scenario.

A Journey into Alibaba Cloud's Large-scale Deep Learning Performance Optimization Practices

In this article, we'll introduce Alibaba's Apsara AI Acceleration(AIACC for short) and discuss how it topped DAWNBench in the category of image classification on ImageNet.