Rendered at 15:13:15 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
penteract 5 hours ago [-]
This is the same calculation behind the observation "you spend much longer in front of red traffic lights than green ones".
It's an interesting observation, but it's playing games with the meaning of "mean latency" and I'm not sure this is a very helpful way to look at requests to a web service - slightly slowing the fastest responses to requests would improve your time-weighted mean latency.
It's a better metric for looking at outages - instantaneous outages don't affect anyone, and time-weighting correctly discards them. On the other hand, average outage length is a very suspect and gameable metric unless accompanied by uptime %.
trb 17 hours ago [-]
Considering other metrics then p99 for user impact is unwise. All users will at some point experience a <1% request, it's not like half of all users will only send requests what will be under your median latency, some of their requests will hit your worst-case.
By focusing on the tail and optimizing worst cases you help users more than by improving your median latency.
the8472 5 hours ago [-]
If your frontend fires hundreds of requests (which isn't uncommon) then the p99 is merely what most users will experience. Ideally you want cumulative distribution chart that goes up to the max.
And then that's just for the requests you measure. If something takes too long the user might do something that cancels the requests which means the backend never completes its response and won't get the time-to-response sample, so you need to account dropped requests too.
This is only true if your latency distribution is fully random, which is rarely the case. More often than not, it's the same small group of users hitting most of the p99 because their accounts are simply more resource intensive.
14 hours ago [-]
michaelt 3 hours ago [-]
> Alice says your service is slow. You tell Alice that the mean request to your service completes in 100ms, but Alice says that her mean wait time is 1s.
There are also plenty of situations where a service can have a bimodal performance distribution and the impact of that can fall on certain users disproportionately.
Imagine a retail website that serves images from a global CDN, with cache misses pulled from a server in the EU. Users who visit our homepage, or look at our bestselling products, get a cache hit from the CDN node close to them, in 50ms. But users who look at our long-tail products get a cache miss - and if they're not near Europe, they'll get a noticeable delay.
Hence our mean image load time is 100ms - but a customer browsing an obscure product category for their location can experience markedly worse performance. If Alice is the only person in Costa Rica looking at ski equipment in June, she's going to get a lot of cache misses.
uberex 13 hours ago [-]
> More technically, what’s going on here is the inspection paradox. Alex and Alice don’t experience your latency distribution
, they experience a t-weighted version of it
Ooh I got pushed in the 2m end of the pool there. What is the intuition? The ten hundred most popular words sort of thing.
I am very interested in this article though. At first I assumed it would be about TTFB vs. time to render the page after all those async useEffects have run, but it isn't that this is something else and I am very interested.
CrazyStat 12 hours ago [-]
Perhaps an easier to intuit version of it is how full airliners are.
An airline might report that their flights are on average 60% full, and that might be completely absolutely 100% true. But that's not what passengers experience. If we assume (for convenience) that a plane holds 100 people, when the plane is 20% full then 20 passengers experience that, but when the plane is 100% full then 100 passengers experience that. On average, from a passenger's point of view, the flights are much more than 60% full--it might be 70 or 80%--because a full flight is experienced by more passengers than an empty flight.
For a concrete example imagine two flights, one 20% full and one 100% full: the average is 60% from the airline's point of view, but 100 passengers experienced a full flight and only 20 experienced the 20% full flight, so from the passenger's point of view the average is 86.7% full.
The same logic applies to outages. If you have an outage that lasts one minute then only a few users will encounter it. If you have an outage that lasts one hour then many more users will encounter that. The longer the outage is, the more likely any given user is to encounter it, so from the user's point of view the "average" outage is much longer than the "true" average where you weight every outage equally.
Again we can consider a concrete example: imagine you run a website that gets 100 visitors per minute. You have one outage that lasts 1 minute, then later a second outage that lasts 9 minutes. Your average outage time is 5 minutes. But 100 visitors experienced the 1 minute outage, while 900 visitors experienced the 9 minute outage, so from the point of view of a visitor the average outage is (900*9 + 100*1)/1000 = 8.2 minutes.
exceptione 7 hours ago [-]
I think you got the `900*9` wrong if you talk about experienced downtime. If you calculate discrete minutes
(900*4,5+100*1)/1000 = 4,15 min
(Unless you manage to inform the user since how long the website has been down already.)
This could be made more accurate if we calculate it over seconds, which would drive the experienced downtime even lower!
When you measure latency, you’re measuring it based on requests. So in some bucket if you had a request take 2s and a request take 10s, you would say the average is 6s. This answers the question “how long should I expect a single request to take”.
But the articles point is that to the people - it’s not the number of requests that matters - it’s how often they are waiting for them. The question is “how much time am I sitting here waiting?” In that case the 10s request is 5x worse than a 2s request - it takes 5x of the “time spent”.
So you can change the weighing to 1 / 2 2s 1 / 2 10s to 1 * 2/12 & 1 * 10/12 - that gives us 17% and 83%. And the average there is 9.04s.
The difference is the question. If you think about it as road segments, let’s say you have a group of road segments with different lengths. You can ask the “average length of segment” - or you can ask “if you pick a random point among all the segments, how long is the segment that I landed in?” You’re picking very differently there - the second is proportional to the length!
Technically IMO the blog is slightly off - you want to use “mean residual life”. Ie if I pick a random TIME how long do I have to wait for my request to finish. But it’s reasonably close.
mungoman2 13 hours ago [-]
Similarly curious about this. The intuition I extracted:
Let’s say we have 10 requests, where 9 of them take 1 second to complete but one that takes 100 seconds. The average time to complete a request is about 10 seconds, but if you experience the requests in series, at any given time you’re much more likely to sit and wait in one of those 100 second requests.
So if you imagine a long series of requests from this distribution and place yourself randomly in the series, the average time to completion is just a bit less than 50 seconds.
This is what is meant by t-weighted, that events with a large t take a larger place.
uberex 11 hours ago [-]
I see it is about a long series of requests. Makes sense. Ill start looking at latency at p99.9 and p99.99 more often now!
ssivark 9 hours ago [-]
When we measure the average experience, it's crucial what we are sampling/measuring uniformly to construct that experience.
The service provider is choosing to weight all requests uniformly, and average over requests -- some have 10s latency and some have 1s latency.
The user lives in time, and chooses to weight their time intervals equally. So a 10 second pause carries 10 times more weight for them than a 1 second pause -- because they experience it 10 times as much! So their average experience is a different weighted average.
The conceptual point is that averaging always needs a measure, and implicitly assumes one if you aren't explicit about the choice.
diatone 13 hours ago [-]
AIUI: my intuition is Alex and Alice are points in the distribution. They don’t think about their experience in terms of population statistics. They see their individual latency times, and use that as their sample. If t is low in their experience, great the distribution is low.
But for any t that goes high that they observe (which tends to be the case in a skewed distribution such as service latencies), it drags their impression of the distribution up, dominating the shape of that impression.
reinitctxoffset 11 hours ago [-]
Arithmetic mean is just really bad for latency conversations (ditto MTTR). Other averages have their place but for a legible, accessible chart that's 4 lines in anything: p50, p75, p95, p99.9 with the last having the SLA is IMHO the right thing to goal and alert on in a cross-functional setting that's attaching engineering outcomes to business outcomes.
There's better math for advanced introspection, but for stuff everyone in the room can intuit no matter their discipline, that's a really sweet spot.
And it's motivating: the p99.9 latency is a bunch of quick, high-impact wins if you haven't profiles it yet. A good time is had by all.
mdavid626 3 hours ago [-]
I don’t remember any service I used in the last couple of years, where I thought to myself: this service is really fast and responsive. Great experience.
Quite the contrary: feels like everything got worse. Sometimes painfully slow, buggy and unreliable.
rustybolt 17 hours ago [-]
This article contains very little substance. Show me the math!
AgentOrange1234 16 hours ago [-]
Yes I found this very hard to follow. I appreciate expressing ideas in math like E_a[X] as much as the next guy, but there is no definition or even description of what the heck E or E_a or Var(x) even mean, so how is anyone supposed to understand the reasoning here? All I get from this is a claim that experienced latency is different than the mean, which sounds important, but I still have no intuition as to why this is. Which is sad, because Booker's blog is often deeply amazing.
I was clicking around in response to this article and found this video that explains the inspection paradox nicely. https://youtu.be/Jd1wNizPjoE
glitchc 6 hours ago [-]
Thank you for writing this article, there's a deep and powerful insight illustrated here: An observer using the system experiences different statistics from the system operator. By extension, taking an average of observer experiences leads to different conclusions from taking an average of system performance. One must not confuse the two when designing systems.
pfdietz 3 hours ago [-]
This feels similar to how nuclear power is perceived, contrasting deaths per TWh and the long tail effect of a rare but serious accident.
reisse 4 hours ago [-]
There is a branch of math dedicated to (among other things) truthfully estimating the waiting time, called queueing theory. I wonder why it wasn't mentioned in the article.
amluto 10 hours ago [-]
I found this article to be a very poor explanation of what (I think) it’s trying to say.
I think the point the article ought to be making is much better handled entirely separately for request time and for outages. For outages, it goes something like this: if you have a 1 hour outage, and your user notices that outage, they think you had a 1 hour outage. [0] If you do statistics that observe that you also had ten thousand 1 second outages and thus had an MTTR of under two seconds, this does not excuse your 1 hour outage in the slightest. And the longer an outage is, the more likely that any given user interacts with your service during the outage.
But the article is oddly caught up in this t-weighting idea, without justification. What does the statement that “Alex and Alice experience E_a[X]” even mean. What’s X? Is it the distribution of request times? If so, then I don’t see the article’s point — if I, a user, sample a bunch of requests, I recover an approximation of X, not X^2. And I really hope that X is not intended to be the distribution of outage lengths because I think the conclusion is just wrong as I alluded to above. Sure, if I happen to sample your service during an outage, the probability that I sample any specific outage is proportional to the length of that outage, but what about all the times that your are (hopefully) not having an outage? What if you have two consecutive outages that are so close to each other in time that I don’t think you recovered?
It would be entertaining to make an outage website. I’d pick a distribution over outage lengths. At time 0 I would sample that distribution, get an outage length t, and declare myself down until time t. All requests during that “outage” would report “hey, I’m down, and my outage length is t”. After the “outage” the site would sample again and repeat. This would give the answer in the article. But this is, of course, absurd.
[0] To be pedantic, they may not notice the beginning of the outage. This is a constant factor correction.
perching_aix 17 hours ago [-]
I've grown to dislike the typical tail measurements completely. What I usually look at these days is what share of unique users experience an "unacceptable experience" over a measurement period instead.
I find it much more inquisitive and visceral, to the extent that p99 now boggles my mind. 2N would be dreadful as an availability figure, yet for UX it's treated very different. So much so that my measurements corroborate exactly that; good UX requires the same many-nines reliability as e.g. DCs, not one or two.
I wonder if it's p90 and p99 to blame for the shoddy services we have, in a way. It's pretty hard to argue for fixing something when it's presented as only going wrong 0.5% or less of the time after all. Even if at scale that means most of your users are experiencing it weekly.
AkshatM 14 hours ago [-]
How does one measure unique users here in a way different from classic p99? I usually associate p99 with an SLO of some kind, and each request as a "unique user" for the service, so at first it seems like the same thing - measuring p99 with a SLO would say 1% of users are allowed to experience a time longer than our acceptable minimum T, and you're measuring the percentage of requests ("users") experiencing T and trying to keep it below 1% (e.g.).
Is the difference more about measuring a request "across services"? That is, the total cumulative p99 across services must be small i.e. linking all requests to a user and then measuring that? Or is the difference elsewhere?
If the former: are you taking traces and graphing that? What's your methodology?
I think you got it, but let me maybe lay it out more explicitly with a specific example.
I visit HN, that's one request. But I visit HN multiple times a day. So for the operation that serves the homepage, if you took e.g. a past 24hr latency p99 chart, the number of requests analyzed would not be the same as the number of unique users involved in making those requests, potentially drastically so.
So you might see a p99 you're comfortable with, and conclude that since only 1% of requests were worse than that, it's fine. In practice though, depending on how "well-trodden" that operation is, you might very well be in a situation where all users experienced at least one such beyond-SLO event that day, since the mapping is many to one.
The cross operation version of this is important as well, yes. You can have users experience snags across common flows too for example, same idea.
Regarding methodology, it's nothing special, I just rely on user IDs and correlation IDs. It really is just a perspective shift, the underlying data is the same. You can even calculate back the "number of nines needed to get an acceptable UX" using this, as long as the general usage habits are stable. It's just gonna be a lot more nines than two in my experience.
frictasolver 2 hours ago [-]
[flagged]
17 hours ago [-]
13 hours ago [-]
zaik 17 hours ago [-]
Is the formula for E_a[X] trivial? I don't see it immediately...
kgwgk 15 hours ago [-]
E[X^2] weights each time with the time, giving the square, and the E[X] in the denominator is the normalisation factor (also required to fix the dimensions).
Say that there are to different waiting times 1s and 3s, and they happen with probability 50% each. The average waiting time (1/2 1+1/2 3) is 2s. However, 75% of the time we are waiting on a 3s event and only 25% on a 1s event. The weighted average is 2.5s. E[X^2]=1/2 1+1/2 9=5(s^2) is not the right answer, it still has to be divided by E[X]=2(s) to get the correct answer.
ggm 15 hours ago [-]
Interesting you work at Amazon and show how end user experience weights to their pessimal experience.
So.. apply that to Amazon design heuristics like author name search on books, and how Amazon return "in the style of" and "not a book but this guy called Charles Dickens makes jigsaws" as high order matches and consider how the end user experience weights to the pessimal yet Amazon can show on average they make more money doing this..
(Understood that engineers and AWS don't influence UX in the storefront or search)
cowthulhu 14 hours ago [-]
Comments like these seem likely to discourage authors from making more interesting posts about niche topics they specialize in, without actually moving the needle on stuff the commenter is pestering about.
ggm 13 hours ago [-]
Yes, has that risk. If I were to neutralise it I'd observe any system which assesses UX will trip over this, the majors are no exception. It serves as a useful reminder tuning isn't optimised solely for your(as a user) benefit.
It's an interesting observation, but it's playing games with the meaning of "mean latency" and I'm not sure this is a very helpful way to look at requests to a web service - slightly slowing the fastest responses to requests would improve your time-weighted mean latency.
It's a better metric for looking at outages - instantaneous outages don't affect anyone, and time-weighting correctly discards them. On the other hand, average outage length is a very suspect and gameable metric unless accompanied by uptime %.
By focusing on the tail and optimizing worst cases you help users more than by improving your median latency.
https://www.youtube.com/watch?v=lJ8ydIuPFeU
There are also plenty of situations where a service can have a bimodal performance distribution and the impact of that can fall on certain users disproportionately.
Imagine a retail website that serves images from a global CDN, with cache misses pulled from a server in the EU. Users who visit our homepage, or look at our bestselling products, get a cache hit from the CDN node close to them, in 50ms. But users who look at our long-tail products get a cache miss - and if they're not near Europe, they'll get a noticeable delay.
Hence our mean image load time is 100ms - but a customer browsing an obscure product category for their location can experience markedly worse performance. If Alice is the only person in Costa Rica looking at ski equipment in June, she's going to get a lot of cache misses.
Ooh I got pushed in the 2m end of the pool there. What is the intuition? The ten hundred most popular words sort of thing.
I am very interested in this article though. At first I assumed it would be about TTFB vs. time to render the page after all those async useEffects have run, but it isn't that this is something else and I am very interested.
An airline might report that their flights are on average 60% full, and that might be completely absolutely 100% true. But that's not what passengers experience. If we assume (for convenience) that a plane holds 100 people, when the plane is 20% full then 20 passengers experience that, but when the plane is 100% full then 100 passengers experience that. On average, from a passenger's point of view, the flights are much more than 60% full--it might be 70 or 80%--because a full flight is experienced by more passengers than an empty flight.
For a concrete example imagine two flights, one 20% full and one 100% full: the average is 60% from the airline's point of view, but 100 passengers experienced a full flight and only 20 experienced the 20% full flight, so from the passenger's point of view the average is 86.7% full.
The same logic applies to outages. If you have an outage that lasts one minute then only a few users will encounter it. If you have an outage that lasts one hour then many more users will encounter that. The longer the outage is, the more likely any given user is to encounter it, so from the user's point of view the "average" outage is much longer than the "true" average where you weight every outage equally.
Again we can consider a concrete example: imagine you run a website that gets 100 visitors per minute. You have one outage that lasts 1 minute, then later a second outage that lasts 9 minutes. Your average outage time is 5 minutes. But 100 visitors experienced the 1 minute outage, while 900 visitors experienced the 9 minute outage, so from the point of view of a visitor the average outage is (900*9 + 100*1)/1000 = 8.2 minutes.
(900*4,5+100*1)/1000 = 4,15 min
(Unless you manage to inform the user since how long the website has been down already.)
This could be made more accurate if we calculate it over seconds, which would drive the experienced downtime even lower!
So 4.6min, i.e. 4:36
But the articles point is that to the people - it’s not the number of requests that matters - it’s how often they are waiting for them. The question is “how much time am I sitting here waiting?” In that case the 10s request is 5x worse than a 2s request - it takes 5x of the “time spent”.
So you can change the weighing to 1 / 2 2s 1 / 2 10s to 1 * 2/12 & 1 * 10/12 - that gives us 17% and 83%. And the average there is 9.04s.
The difference is the question. If you think about it as road segments, let’s say you have a group of road segments with different lengths. You can ask the “average length of segment” - or you can ask “if you pick a random point among all the segments, how long is the segment that I landed in?” You’re picking very differently there - the second is proportional to the length!
Technically IMO the blog is slightly off - you want to use “mean residual life”. Ie if I pick a random TIME how long do I have to wait for my request to finish. But it’s reasonably close.
Let’s say we have 10 requests, where 9 of them take 1 second to complete but one that takes 100 seconds. The average time to complete a request is about 10 seconds, but if you experience the requests in series, at any given time you’re much more likely to sit and wait in one of those 100 second requests.
So if you imagine a long series of requests from this distribution and place yourself randomly in the series, the average time to completion is just a bit less than 50 seconds.
This is what is meant by t-weighted, that events with a large t take a larger place.
The service provider is choosing to weight all requests uniformly, and average over requests -- some have 10s latency and some have 1s latency.
The user lives in time, and chooses to weight their time intervals equally. So a 10 second pause carries 10 times more weight for them than a 1 second pause -- because they experience it 10 times as much! So their average experience is a different weighted average.
The conceptual point is that averaging always needs a measure, and implicitly assumes one if you aren't explicit about the choice.
But for any t that goes high that they observe (which tends to be the case in a skewed distribution such as service latencies), it drags their impression of the distribution up, dominating the shape of that impression.
There's better math for advanced introspection, but for stuff everyone in the room can intuit no matter their discipline, that's a really sweet spot.
And it's motivating: the p99.9 latency is a bunch of quick, high-impact wins if you haven't profiles it yet. A good time is had by all.
Quite the contrary: feels like everything got worse. Sometimes painfully slow, buggy and unreliable.
I'm pretty sure what the author is saying is:
E(X) =:= \sum_t(t * P(X = t)) is the definition
another important note is P(X^2 = t^2) = P(X = t) - because it's the same distribution.
E_a(X) is a bit sloppy, but consider X_a aka Alice's latency "experience" distribution. The argument is:
P(X_a = t) = t * P(X = t) / \sum_u(u * P(X = u)) - i.e. scale the probability up by t but make it sum to 1.
Then
E(X_a) = \sum_t(t * P(X_a = t)) = \sum_t(t * t * P(X = t) / \sum_u(u * P(X = u))
aka
E(X^2) / E(X)
Then (from wikipedia)
Var(X) = E(X^2) - (E(X))^2
And we get
E(X_a) = (Var(X) + (E(X))^2) / E(X) = E(X) + Var(X) / E(X)
I think the point the article ought to be making is much better handled entirely separately for request time and for outages. For outages, it goes something like this: if you have a 1 hour outage, and your user notices that outage, they think you had a 1 hour outage. [0] If you do statistics that observe that you also had ten thousand 1 second outages and thus had an MTTR of under two seconds, this does not excuse your 1 hour outage in the slightest. And the longer an outage is, the more likely that any given user interacts with your service during the outage.
But the article is oddly caught up in this t-weighting idea, without justification. What does the statement that “Alex and Alice experience E_a[X]” even mean. What’s X? Is it the distribution of request times? If so, then I don’t see the article’s point — if I, a user, sample a bunch of requests, I recover an approximation of X, not X^2. And I really hope that X is not intended to be the distribution of outage lengths because I think the conclusion is just wrong as I alluded to above. Sure, if I happen to sample your service during an outage, the probability that I sample any specific outage is proportional to the length of that outage, but what about all the times that your are (hopefully) not having an outage? What if you have two consecutive outages that are so close to each other in time that I don’t think you recovered?
It would be entertaining to make an outage website. I’d pick a distribution over outage lengths. At time 0 I would sample that distribution, get an outage length t, and declare myself down until time t. All requests during that “outage” would report “hey, I’m down, and my outage length is t”. After the “outage” the site would sample again and repeat. This would give the answer in the article. But this is, of course, absurd.
[0] To be pedantic, they may not notice the beginning of the outage. This is a constant factor correction.
I find it much more inquisitive and visceral, to the extent that p99 now boggles my mind. 2N would be dreadful as an availability figure, yet for UX it's treated very different. So much so that my measurements corroborate exactly that; good UX requires the same many-nines reliability as e.g. DCs, not one or two.
I wonder if it's p90 and p99 to blame for the shoddy services we have, in a way. It's pretty hard to argue for fixing something when it's presented as only going wrong 0.5% or less of the time after all. Even if at scale that means most of your users are experiencing it weekly.
Is the difference more about measuring a request "across services"? That is, the total cumulative p99 across services must be small i.e. linking all requests to a user and then measuring that? Or is the difference elsewhere?
If the former: are you taking traces and graphing that? What's your methodology?
I visit HN, that's one request. But I visit HN multiple times a day. So for the operation that serves the homepage, if you took e.g. a past 24hr latency p99 chart, the number of requests analyzed would not be the same as the number of unique users involved in making those requests, potentially drastically so.
So you might see a p99 you're comfortable with, and conclude that since only 1% of requests were worse than that, it's fine. In practice though, depending on how "well-trodden" that operation is, you might very well be in a situation where all users experienced at least one such beyond-SLO event that day, since the mapping is many to one.
The cross operation version of this is important as well, yes. You can have users experience snags across common flows too for example, same idea.
Regarding methodology, it's nothing special, I just rely on user IDs and correlation IDs. It really is just a perspective shift, the underlying data is the same. You can even calculate back the "number of nines needed to get an acceptable UX" using this, as long as the general usage habits are stable. It's just gonna be a lot more nines than two in my experience.
Say that there are to different waiting times 1s and 3s, and they happen with probability 50% each. The average waiting time (1/2 1+1/2 3) is 2s. However, 75% of the time we are waiting on a 3s event and only 25% on a 1s event. The weighted average is 2.5s. E[X^2]=1/2 1+1/2 9=5(s^2) is not the right answer, it still has to be divided by E[X]=2(s) to get the correct answer.
So.. apply that to Amazon design heuristics like author name search on books, and how Amazon return "in the style of" and "not a book but this guy called Charles Dickens makes jigsaws" as high order matches and consider how the end user experience weights to the pessimal yet Amazon can show on average they make more money doing this..
(Understood that engineers and AWS don't influence UX in the storefront or search)