Fun with Adaptive Rate Limiting
· 9 min read
We recently had an issue where an OpenZiti network was overwhelmed with client requests when a user change unintentionally caused the request rate to spike. The fundamental problem was that if a request took too long, the client gave up, but the request was still processed. The system ended up doing work that was ignored while causing new requests to wait until they also timed out. Once the requests hit a certain threshold the system didn't degrade gracefully.
I had a fun day solving the problem, and while I'm sure that nothing here is new, I thought others might be interested in where I landed and some ideas that were rejected along the way.