Incident began at 2023-03-03 21:56 (all times are US/Pacific).
Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3
Description: Mitigation work is currently underway by our engineering team.
At this time, we believe the issue has been resolved for the us-central1 region and are working to confirm.
We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.
We will provide more information by Friday, 2023-03-03 23:30 US/Pacific.
Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.
Workaround: None at this time.
Affected products: Vertex AI Training, Cloud Machine Learning
Affected locations: Frankfurt (europe-west3), Iowa (us-central1), South Carolina (us-east1)