Failed webhook notifications are now retried over the course of 24 hours, with some delay randomization.
We’re making some changes to the way the Webhooks API retries notifications in order to improve deliverability and reliability. The first and most notable change is that the exponential backoff is changing from from a 2^n second delay, to a maximum of 10 retries over 24 hours.
Additionally, we’re introducing some randomization into the delay to prevent large numbers of concurrent failures to be continuously retried at the exact same interval. Delay randomization more evenly spreads retries out within the retry window, allowing your system to more effectively process them.
Check out the Webhooks API documentation for more information.
There are two major changes in the retry behavior of webhooks: First, the timing of retries is shifting from a fixed exponential delay to a fixed time window with exponential delays designed to fit within that window. To better understand this shift, it’s helpful to compare the ‘old’ behavior and the new behavior:
- The old behavior is as follows: Failed webhook notifications will be retried a maximum of 10 times, with a 2^n second delay based on the attemptNumber. This means that the delay between retries is dependent only on the attemptNumber, and will only ever reach a total of ~13 minutes.
- The new behavior is different in that failed webhook notifications will be retried a maximum of ten times, spread out over a 24 hour period. Instead of basing the delays on a fixed scale, notifications can be retried with various delays over the course of a 24 hours period.
The second major change is related to the accuracy of the delay between retries: The actual length of delays between retries is changing from a fixed size dependent on attemptNumber to a random length delay between the previous delay and the maximum delay length. This change is also best illustrated with a comparison:
- The old behavior is that the length of the delay is fixed; if a request fails for the first time, it will be retried again exactly 2 seconds later. This can cause problems when volume gets extremely high, since many requests failing concurrently are retried after almost the same delay.
- The new behavior includes some delay randomization, which means that the actual delay before the next retry will be somewhere between the previous delay and the maximum delay. This prevents a large number of concurrent failed requests from being retried at the exact same time after a fixed delay.
When is this happening?
This shouldn’t be a breaking change, and the deliverability and stability improvements should be immediate, so we’re rolling out these changes according to a shorter-than-usual timeframe. The new Webhooks retry system will go live on December 17th, 2018