ames: recover from mismatched message nums

We've seen issues where the message-num of the head of live.state is
less than current.state.  When this happens, we continually try to
resend message n-1, but we throw away any acknowledgment for n-1 because
current.state is already n.  This halts progress on that flow.

We don't know what causes us to get in this bad state, so this adds an
assert to the packet pump that we're in a good state, run every time
the packet pump is run.  When this crashes, we can turn on |ames-verb
and hopefully identify the cause.

This also adds logic to +on-wake in the packet pump to not try to resend
any messages that have already been acknowledged.  This is just to
rescue ships that currently have these stuck flows.

(Incidentally, I'd love to have a rr-style debugger for stuff like this.
Just run a command that says "replay my event log watching for this
specific condition and then stop and let me poke around".)
This commit is contained in:
Philip Monk 2019-12-09 23:14:06 -08:00
parent 96621c6062
commit e7c8a44e11
No known key found for this signature in database
GPG Key ID: B66E1F02604E44EC

View File

@ -768,7 +768,7 @@
$% [%hear =message-num =fragment-num]
[%done =message-num lag=@dr]
[%halt ~]
[%wake ~]
[%wake current=message-num]
==
:: $packet-pump-gift: effect from |packet-pump
::
@ -2222,6 +2222,7 @@
=~ (dispatch-task task)
feed-packets
(run-packet-pump %halt ~)
assert
[(flop gifts) state]
==
:: +dispatch-task: perform task-specific processing
@ -2232,7 +2233,7 @@
::
?- -.task
%memo (on-memo message-blob.task)
%wake (run-packet-pump task)
%wake (run-packet-pump %wake current.state)
%hear
?- -.ack-meat.task
%& (on-hear [message-num fragment-num=p.ack-meat]:task)
@ -2391,6 +2392,16 @@
=. message-pump (give i.packet-pump-gifts)
::
$(packet-pump-gifts t.packet-pump-gifts)
:: +assert: sanity checks to isolate error cases
::
++ assert
^+ message-pump
=/ top-live
(peek:packet-queue:*make-packet-pump live.packet-pump-state.state)
?. |(?=(~ top-live) (gte current.state message-num.key.u.top-live))
~| [%strange-current current=current.state key.u.top-live]
!!
message-pump
--
:: +make-packet-pump: construct |packet-pump core
::
@ -2420,12 +2431,13 @@
?- -.task
%hear (on-hear [message-num fragment-num]:task)
%done (on-done message-num.task)
%wake on-wake
%wake (on-wake current.task)
%halt set-wake
==
:: +on-wake: handle packet timeout
::
++ on-wake
|= current=message-num
^+ packet-pump
:: assert temporal coherence
::
@ -2439,13 +2451,14 @@
::
=- =* res -
=. live.state live.res
=. packet-pump (give %send static-fragment.res)
%- %+ trace snd.veb
=/ nums [message-num fragment-num]:static-fragment.res
|.("dead {<nums^show:gauge>}")
=? packet-pump ?=(^ static-fragment)
%- %+ trace snd.veb
=/ nums [message-num fragment-num]:u.static-fragment.res
|.("dead {<nums^show:gauge>}")
(give %send u.static-fragment.res)
packet-pump
::
=| acc=static-fragment
=| acc=(unit static-fragment)
^+ [static-fragment=acc live=live.state]
::
%^ (traverse:packet-queue _acc) live.state acc
@ -2454,12 +2467,18 @@
val=live-packet-val
==
^- [new-val=(unit live-packet-val) stop=? _acc]
:: if already acked later message, don't resend
::
?: (lth message-num.key current)
%- %- slog :_ ~
leaf+"ames: strange wake queue, expected {<current>}, got {<key>}"
[~ stop=%.n ~]
:: packet has expired; update it in-place, stop, and produce it
::
=. last-sent.val now.channel
=. retries.val +(retries.val)
::
[`val stop=%.y (to-static-fragment key val)]
[`val stop=%.y `(to-static-fragment key val)]
:: +feed: try to send a list of packets, returning unsent and effects
::
++ feed