Error handling and recovery
For a subscription in the error
state, the client can initiate a recovery process to move the subscription back to the active
state.
If the client leaves the subscription in the error
state for an extended amount of time (approximately a month), then the server moves the subscription to the off
state and recovery is no longer possible.
The recovery process relies on the status, events and update operations of a subscription, which were already introduced earlier. These operations need to be executed following the procedure described below:
- Identifying the issue: To get help with troubleshooting, the client can use the
$status
operation. The reason causing the most recent notification failure is indicated by theentry[0].resource.error[0].text
field of the response. - Fixing the issue: The client needs to check and fix the issue that caused notifications to fail.
- Reactivating the subscription: Once the issue has been addressed, the client needs to set the status of the subscription back to
active
. To do this, an update needs to be posted with thestatus
set to eitheractive
orrequested
. The server will send a handshake while serving the subscriber's request and will only respond once the handshake has succeeded or failed. The update operation will succeed or fail depending on the outcome of the handshake. - Identifying missed events: After the reactivation of the subscription, the server will send new notifications but will not resend the old ones.
Therefore, the subscriber needs to check for any events that it missed during the downtime of the subscription.
To do this, a request should be made to the
$status
endpoint and theeventsSinceSubscriptionStart
field needs to be consulted in response. The client needs to compare this value with the number of the last event that it received to determine the range of missed events. - Handling missed events: Finally, the client needs to fetch the missed events using the
$events
endpoint and process them.
Please note that the order of steps outlined above differs from the one specified by FHIR in its Recovering from Errors section. The standard specifies fetching missed events first and reactivating the subscription afterward. However, that order could result in a missed event in case a new event arrives between these two steps. The order recommended here avoids this issue but could result in the client getting the same event twice instead. However, duplicate event delivery should be handled gracefully by the subscriber anyway.
⚠️ Known issue: A subscriber may unintentionally reactivate a subscription by updating it, leading to missed events that need to be handled. This can occur under the following sequence of events:
- The subscriber fetches an
active
subscription. - A notification fails, causing the subscription to enter the
error
state. - The subscriber sends an update request (HTTP PUT) with the desired subscription state (including the
active
status) and successfully acknowledges the handshake sent by the server.
Since the server no longer attempts to send old notifications after a reactivation and the subscriber is unaware of the need to do a full recovery, the subscriber will miss the events that were created while the subscription was off
.
The next time new events are created for such a subscription, a notification will be sent with those events. At this point, the subscriber can notice a gap in the event numbers it received and can thus fetch the missing events.
To reliably work around this issue, the subscriber should fetch the subscription status and check the event count after all update operations. If the event count is larger than expected, the subscriber needs to fetch the new events.