Since we have defined a recorded metric named "minute_up_bool", we can then create an uptime graph over whatever range we want. 1 for each minute up) divided by the total number of minutes in the duration, whatever that duration happens to be. So "Uptime" for any given duration is the sum of "up for this minute" values (i.e. The second recorded metric builds on the first, which is OK since the Prometheus documentation says the recorded metrics are run in series within each group. (So if we have both a failure and a success in a given minute, that does not count as downtime.) That is why we have the second recorded metric to produce the actual "up for this minute" boolean values. If the number of failures for a service is > 0 for all the values returned for any given minute, we consider that service to be "down" for that minute. (See this post for a discussion about the use of bool.) So the result of the first recorded metric is 1 for every service where all its tests succeeded during that scrape interval, and 0 where there was at least one test failure for that service. (The number of test failures for a given service is captured in the metrics with the status="failure" label value.) We clamp the number of failures to 1 so we only have zeroes and ones for our values and can therefore convert a "failure values time series" into a "success values time series" instead, using an inequality operator and the bool modifier. We run our tests multiple times per minute against all our services, and each service ("service" is a label where each service's name is the label value) has a different number of tests associated with it, but if any of the tests for a given service fails, we consider that a "down moment". So we record the function result (which will be an instance vector) as a new time series and use that as the metric name in a different query, where you can then add the to select a range. You have to use recording rules because you cannot create a range vector from the instance vector result of a function in a single query, as you have already discovered (you get a parse error). Just figured this out and I believe it is producing correct results.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |