Lack of retry mechanisms

What is a one sentence summary of your feature request?

We have not seen a built-in “retry on error” functionality. When an action fails, it is placed in a list that must be manually retried, increasing operational overhead.

Please describe your idea in detail. What is your problem, why do you feel this idea is the best solution, etc.

Currently, when an action fails, it is simply placed in an error list that must be reprocessed manually by the operations teams.
This lack of an automatic retry mechanism increases operational workload, raises the risk of oversight, and makes it more difficult to isolate recurring technical errors.

Problem / Pain point
No standard “retry on error” mechanism for failed actions
Multiplication of manual retry tasks
Difficulty in securing and stabilising processing, especially when too many actions are concentrated in a single orchestration flow
Increased risk of business incidents in case a retry is forgotten

Desired functional behaviour
Ability to define, per job or flow type, a configurable retry policy (number of attempts, delay between attempts, maximum retry duration).
Differentiated handling of technical errors (e.g. temporary unavailability of a target system) versus functional errors (e.g. invalid or inconsistent data).
Clear logging of all retry attempts (timestamp, status, cause of the initial error, result of each attempt).
Centralised view of actions currently being retried and of those that have definitively failed after all attempts have been exhausted.
Integration with existing monitoring and alerting mechanisms (e.g. alerts when retry thresholds are exceeded).
Behaviour aligned with the planned job redesign and segmentation of orchestration flows, so that technical errors are easier to isolate and handle.

How do you currently solve the challenges you have by not having this feature?

Manual actions, such as implementing an operational bulk‑retry script to reprocess resources in error and reduce the manual workload for operations teams.

@EnaelleDcs Thank you for your idea submission. It seems that there are several issues that you are addressing here beyond just a retry mechanism. The request is also related to how we audit and visualize these events. I’d like to discuss in more detail with you.

Thanks,
Kate

Hello @EnaelleDcs ,

I don’t seem to have the same issues.
I have a lot of provisioning reviews/errors, but that is due to specific configuration issues.
The only time I see this issue you raise is when the timeout is not set long enough when processing a bulk action.
The issue may be on the network/session level. This is why it would be useful to see which errors you are getting that this would solve.