Amazon Web Services 今天在網站上公佈了 4/21 EBS 故障的原因。雖然一開始的人為操作失誤很快就回覆(roll-back)了,不過後面造成的骨牌效應還是讓整個系統掛了(不穩定)了三天……
文章裡面有一些設計系統的時候的重點:
- 多次失敗重試的時候要把重試的間隔拉長(back off aggressively)。
- 想辦法在要救資料的時候可以(半)自動化。
- 限制失敗重試的次數,如果超過的話就暫時從整個系統中隔離。
另外 AWS 也提出了這次事件的賠償方式:
For customers with an attached EBS volume or a running RDS database instance in the affected Availability Zone in the US East Region at the time of the disruption, regardless of whether their resources and application were impacted or not, we are going to provide a 10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. These customers will not have to do anything in order to receive this credit, as it will be automatically applied to their next AWS bill. Customers can see whether they qualify for the service credit by logging into their AWS Account Activity page.
簡譯如下:不論客戶是否受到影響,只要發生問題的當時在美東地區有使用 EBS 或是 RDS,AWS會賠償 10 個整天客戶所使用的資源,以 credit 的方式(未來可以抵)給客戶。
詳細的賠償額度可以看四月份的帳單。另外這次的賠償是不用另外申請的(EC2的SLA規定低於Service Level時要另外寫信去申請才能獲得賠償)