Expose stats about finally failed AP deliveries in prometheus #882

Merged
Oneric merged 2 commits from Oneric/akkoma:telemetry-failed-deliveries into develop 2025-05-09 20:12:02 +00:00
Owner

And log about it. Makes it easier to tell when and what's going wrong.

If anyone knows how to make Grafana show this as a bar chart with one bar per target with bars segmented by reason pls let me know.
I only managed to get a bar chart over target loosing reason info

The oban event types and states per failure/return type are:

fail/return event final state retry state
:ok :stop :success
{:cancel, _} :stop :cancelled
{:discard, _} :stop :discard
{:error, _} :exception :discard :failure
exception :exception :discard :failure
And log about it. Makes it easier to tell when and what's going wrong. If anyone knows how to make Grafana show this as a bar chart with one bar per `target` with bars segmented by `reason` pls let me know. I only managed to get a bar chart over `target` loosing `reason` info The oban event types and states per failure/return type are: | **fail/return** | event | final state | retry state | | ------------------- | --------- | ------------- | --------------- | | `:ok` | `:stop` | `:success` | | | `{:cancel, _}` | `:stop` | `:cancelled` | | | `{:discard, _}` | `:stop` | `:discard` | | | `{:error, _}` | `:exception` | `:discard` | `:failure` | | exception | `:exception` | `:discard` | `:failure` |
Oneric added 2 commits 2025-03-14 19:43:05 +00:00
:discard marks jobs as "discarded", i.e. jobs which permanently failed
due to e.g. exhausting all retries or explicitly being discared due to a
fatal error.
:cancel marks jobs as "cancelled" which does not imply failure.

While neither method counts as a job "exception" in the set of
telemetries we currently export via Prometheus, the different state
is visible in the (not-exported) metadata of oban job telemetry.
We can use handlers of those events to build bespoke statistics.

Ideally we'd like to distinguish in the receiver worker between
"invalid" and "already present or delete of unknown" documents,
but this is cumbersome to get get right with a list of
free-form, human-readable descriptions oof the violated constraints.
For now, just count both as an fatal error.
        # but that is cumbersome to get right with a list of string error descriptions
telemetry: expose stats about failed deliveries
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending approval
ci/woodpecker/pr/build-arm64 Pipeline is pending approval
ci/woodpecker/pr/docs Pipeline is pending approval
ci/woodpecker/pr/lint Pipeline is pending approval
ci/woodpecker/pr/test/1 Pipeline is pending approval
ci/woodpecker/pr/test/2 Pipeline is pending approval
249876d1f0
And also log about it which we so far didn't do
Oneric force-pushed telemetry-failed-deliveries from 249876d1f0 to 1f6f5edf85 2025-04-15 17:41:33 +00:00 Compare
Author
Owner

rebased and cosmetic changes for elixir-1.18’s mix format

rebased and cosmetic changes for elixir-1.18’s `mix format`
Oneric merged commit 13940a558a into develop 2025-05-09 20:12:02 +00:00
Oneric deleted branch telemetry-failed-deliveries 2025-05-09 20:12:02 +00:00
Sign in to join this conversation.
No description provided.