Commit Graph

455 Commits

Author SHA1 Message Date
Philip Monk
2140e07a99
gall: properly track remote acknowledgments
outstanding.agents.state is a queue of what sort of message we sent to a
foreign app.  We use it so that when the acknowledgment comes back we
know whether to treat it as a watch-ack, poke-ack, or neither.  We used
to put this info in the wire, but this gave us a different ames flow,
which meant %leave and %watch didn't get associated (causing #2079).

The error was that when when retrieving the item from the queue, we put
the new 1-item-shorter queue back in outstanding.agents.state at a
different wire than it came from, so the queues never actually got
shorter, and acknowledgments of the wrong sort were commonly produced.
This caused problems mainly in situations where we poke and peer on the
same wire, and possibly when a subscription was cancelled.

Possibly related to #2206 and #2176.  I would expect this bug to cause
those issues, but I haven't verified the converse.  Also possibly
related to #2153 and #2079.
2020-02-13 15:12:07 -08:00
Philip Monk
41fd367bff
ames: make routing simpler 2020-02-10 17:49:18 -08:00
Ted Blackman
7dc499d438
ford: ignore spurious clay responses
Due to asynchronicity, Ford can receive responses from Clay to requests
that it has already attempted to cancel. This removes some overzealous
assertions that this wouldn't happen.
2020-01-29 15:11:36 +04:00
Ted Blackman
0d69031c72
ford: add +got-build helper
Replaced manual calls to (~(got by builds.state) build) with a new
+got-build helper function that prints a helpful error message on
failure.
2020-01-29 14:00:25 +04:00
Jared Tobin
a3e682f596
Merge branch 'ford-orphans' (#2192)
* ford-orphans:
  ford: dequeue orphans

Signed-off-by: Jared Tobin <jared@tlon.io>
2020-01-28 17:39:36 +04:00
Ted Blackman
155ab60609
ford: dequeue orphans
@ixv recently uncovered a bug (#2180) in Ford that caused certain
rebuilds to crash. @Fang- and I believe this change should fix the bug,
and we have confirmed that the reproduction that used to fail about two
thirds of the time now has not failed at all in the ten or so times
we've run it since then. @Fang- is still running more tests to confirm
the fix with more certainty.

It turned out the cause was that (depending on the rebuild order, which
is unspecified and should not need to be specified), Ford could enqueue
a provisional sub-build to be run but then, later in the same +gather
call, discover that the sub-build was in fact an orphan and delete it
from builds.state accordingly. Then when Ford tried to run the
sub-build, it would have already been deleted from the state, so Ford
would crash when trying to process its result in +reduce.

The fix was to make sure that when we discover a provisional sub-build
is orphaned, dequeue it from candidate-builds and next-builds to make
sure we don't try to run it. I'm about 95% sure this fix completely
solves the bug.
2020-01-28 17:29:24 +04:00
Ted Blackman
0bee77ce8e
/sys: use +harden on vane tasks
Uses Zuse's previously unused +harden helper function to streamline
+task unwrapping in vanes.

(Arguably, in landlocked vanes like Ford, we should crash if we get a
%soft task, since no events should be coming in directly from the
outside.)
2020-01-27 09:53:53 +04:00
Fang
f4ed3fe980
clay: document %t care 2020-01-22 21:23:14 -08:00
Jared Tobin
c182672b54
Merge branch 'ames-goof' (#2166)
* origin/ames-goof:
  ames: adjust route update logic

Signed-off-by: Jared Tobin <jared@tlon.io>
2020-01-22 13:14:39 +04:00
Ted Blackman
11c92e691d
ames: adjust route update logic
There was a typo in the routing logic that was comparing equality
against a value where it should have been doing a pattern match. The
value compared against contained the literal * gate, which would never
match route.peer-state, so this condition was always true, meaning the
fix that had added this extra condition (5406f06) did not actually
change the behavior from what it been previously.
2020-01-22 12:50:18 +04:00
Philip Monk
d578159791
ames: fix assertion bug and add debug info
If we receive the naxplanation before the nack, the assertion in the gte
direction fails.  The intent of the assertion is to make sure top of the
live queue never falls behind current.state, so it was simply in the
wrong direction.
2020-01-14 08:34:12 -08:00
Jared Tobin
01afc2a143
Merge branch 'm/gall-gift-paths' (#2134)
* origin/m/gall-gift-paths:
  gall: (list path) in %fact and %kick

Signed-off-by: Jared Tobin <jared@tlon.io>
2020-01-07 04:17:32 +08:00
Jared Tobin
cd9624e097
Merge branch 'm/whitespace' (#2149)
* origin/m/whitespace:
  various: remove trailing whitespace
  ci: reject trailing whitespace

Signed-off-by: Jared Tobin <jared@tlon.io>
2020-01-06 10:55:13 +08:00
Jared Tobin
f94ba8ce9c
Merge branch 'm/xmas' (#2143)
* origin/m/xmas:
  xmas: remove, obsoleted by alef

Signed-off-by: Jared Tobin <jared@tlon.io>
2020-01-06 10:53:15 +08:00
Jared Tobin
6f7aae3574
Merge branch 'ames-clean' (#2127)
* origin/ames-clean:
  ames: update comment docs

Signed-off-by: Jared Tobin <jared@tlon.io>
2020-01-06 07:25:22 +08:00
Fang
fcf1846b6f
various: remove trailing whitespace 2020-01-03 22:06:42 +01:00
Fang
e005cefe77
xmas: remove, obsoleted by alef 2019-12-27 02:19:36 +01:00
Fang
ae8a57ca25
gall: (list path) in %fact and %kick
Instead of providing a (unit path), allows for (list path), which better
supports the "update to path and subpath cases".

For example, if /things wants updates about everything, and
/things/specific wants updates about the specific thing, they'll both
need to receive a %fact when the specific thing changes.
Previously, these would have been two separate moves. Now, gall handles
the multi-targeting for you.
2019-12-23 13:37:32 +01:00
Fang
ea7c1db61c
various: use =/ in place of =+ ^-
Also faceless =; where appropriate.
2019-12-21 14:29:14 -03:30
Ted Blackman
895f1c069d ames: update comment docs 2019-12-21 01:56:51 -05:00
Jared Tobin
103e375417
Merge branch 'ford-safe' (#2117)
* origin/ford-safe:
  ford: clear build results on +load

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-20 13:22:23 -03:30
Fang
3808f02287
clay: implement %u care
Previously, it would always produce ~, regardless of the path asked
about.

Now, it produces a loobean, based on whether or not a file exists at the
specified path.
2019-12-18 21:02:38 +01:00
Jared Tobin
9b0582323c
Merge branch 'philip/eth-watcher' (#2113)
* philip/eth-watcher:
  ph: fix tests by spamming blocks regularly
  gaze: reflect changes to eth-watcher
  ames: better printfs
  jael: only advance lifes
  jael: stop ship-to-ship
  jael: add "eager" mode to avoid hitting nodes as much
  jael: properly store ship sources
  gen: add +azimuth-sources
  jael: re-enable ship-to-ship communication
  eth-watcher: actually stop pending thread when restarting

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-18 12:13:27 -03:30
Philip Monk
0e876b3cd4
ames: better printfs 2019-12-18 11:31:17 -03:30
Philip Monk
16d98e5eda
jael: stop ship-to-ship 2019-12-18 11:19:41 -03:30
Philip Monk
18c3e7253b
jael: add "eager" mode to avoid hitting nodes as much 2019-12-18 10:58:00 -03:30
Philip Monk
15bd35301e
jael: properly store ship sources 2019-12-18 10:42:57 -03:30
Ted Blackman
9fb37543ec ford: clear build results on +load 2019-12-18 00:25:27 -05:00
Philip Monk
7ca3d9624e
ames: handle misordered crashing boons
Two bugs fixed here: first, if the %done reentrancy triggered another
%boon, that wasn't getting translated to a %lost, even though it could
have been the reason the event crashed in the first place.

Second, the %done reentrancy needs to happen after we emit our move, so
that we don't invert the order of the %boon's we produce.
2019-12-17 20:58:30 -08:00
Philip Monk
e5ac690fd3
jael: re-enable ship-to-ship communication
Also fix bug in eth-watcher that didn't cancel outstanding threads when
config changes.

And set default rift for ourselves to 0.
2019-12-17 16:14:07 -08:00
Philip Monk
769a1c96af
eyre: turn sigpam into flog
This error is mostly harmless, but it does indicate we aren't cleaning
up our subscriptions properly.  This lets you silence with |knob.

fixes #2088
2019-12-14 00:49:23 -08:00
Philip Monk
b14606660a
goad: recompile apps after changes to /sys
OTAs commonly end up in an inconsistent state if apps depend on changes
to /sys.  For example, the %sift changes break on OTA because %spider
needs to be reloaded so that it's aware of the new thread type.  This
adds a %goad app, which reloads all apps after every change to /sys.

Getting this to start OTA is nontrivial, but this pattern should work
for apps in the future.  The changes to clock shouldn't generally be
necessary; they are only necessary here because we can't rely on hood to
start goad, since hood fails to compile if it's run before zuse is
reloaded.  Once goad is active, this will cease to be a problem.
2019-12-13 17:14:51 -08:00
Jared Tobin
9ba4505086
Merge branch 'ames-sift' (#2081)
* ames-sift:
  ames: refactor +load
  ames: +send-blob better ship printing
  hood: |ames-sift generator to trace by ship
  ames: add %sift  to trace by ship

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-12 16:06:32 +08:00
Ted Blackman
35596ca7de
ames: refactor +load 2019-12-12 15:55:37 +08:00
Ted Blackman
d4574b5da4
ames: +send-blob better ship printing 2019-12-12 15:55:36 +08:00
Ted Blackman
d77fb0f685
ames: add %sift to trace by ship 2019-12-12 15:55:32 +08:00
Jared Tobin
85d447f173
Merge branch 'philip/gall-noop' (#2073)
* origin/philip/gall-noop:
  gall: no-op on duplicate watch-ack

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-12 15:50:19 +08:00
Jared Tobin
2aa86e3121
Merge branch 'philip/stuck-flow' (#2071)
* origin/philip/stuck-flow:
  ames: recover from mismatched message nums

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-12 15:49:53 +08:00
Jared Tobin
e4a7dae888
Merge branch 'philip/login-instructions' (#2039)
* origin/philip/login-instructions:
  eyre: add instructions to login page

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-12 15:46:36 +08:00
Philip Monk
3b41a8be15
gall: no-op on duplicate watch-ack
fixes #2070
2019-12-10 18:49:50 -08:00
Philip Monk
29f078bb14
ames: don't forward up the sponsorship chain
This is *actually* why the galaxies are under so much load.  They're in
a forwarding loop with their stars, and this breaks the loop.
2019-12-10 16:20:12 -08:00
Philip Monk
68279d91e4
gall: remove message type from wire
%leave over the network didn't work because we included the message type
in the wire from gall, so the duct for the initial %watch and the %leave
were different.  We need to know the message type so we can route the
acknowledgment as %poke-ack, %watch-ack, or no-op.

This moves this piece of information to a piece of state, where we queue
up the message types per [duct wire].  Ames guarantees that
acknowledgments will come in order.

This also includes an easy state adapter.  The more interesting part of
the upgrade is that we likely have outstanding subscriptions with the
old wire format.  The disadvantage of storing information in wires is
that it can't be upgraded in +load.  So, here we listen for updates on
the old wire format, and when we get them we kill the old subscription,
so that it will be recreated with the new wire format.

As an aside, this is a good example of what we mean when we say
subscriptions may be killed at any time, so apps must handle this case.

Finally, this fixes the "attributing" ship to ~zod for agent requests.
This information was ignored for agent requests, but including it causes
spurious duct mismatches.
2019-12-10 19:32:26 +08:00
Philip Monk
e7c8a44e11
ames: recover from mismatched message nums
We've seen issues where the message-num of the head of live.state is
less than current.state.  When this happens, we continually try to
resend message n-1, but we throw away any acknowledgment for n-1 because
current.state is already n.  This halts progress on that flow.

We don't know what causes us to get in this bad state, so this adds an
assert to the packet pump that we're in a good state, run every time
the packet pump is run.  When this crashes, we can turn on |ames-verb
and hopefully identify the cause.

This also adds logic to +on-wake in the packet pump to not try to resend
any messages that have already been acknowledged.  This is just to
rescue ships that currently have these stuck flows.

(Incidentally, I'd love to have a rr-style debugger for stuff like this.
Just run a command that says "replay my event log watching for this
specific condition and then stop and let me poke around".)
2019-12-09 23:31:18 -08:00
Philip Monk
abde1d8aa9
ames: reduce load by increasing timer delays 2019-12-06 12:11:06 -08:00
Philip Monk
956a3c7420
eyre: add instructions to login page 2019-12-05 12:31:42 -08:00
Ted Blackman
bee0b5803a
ames: don't crash on missing queued larval event 2019-12-05 17:04:24 +08:00
Jared Tobin
41b64feb16
Merge branch 'philip/p2p' (#2025)
* philip/p2p:
  ames: don't overwrite lane if already direct

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-05 16:08:01 +08:00
Philip Monk
5406f06092
ames: don't overwrite lane if already direct
This is why basically all packets are going through the galaxies right
now.  Most of the time, the flow right now is:

* talking to ~dopzod but don't know where it is, so ask ~zod to forward,
  which it does

* ~dopzod responds both directly (on the origin lane) and through ~zod

* (if NAT, the direct response doesn't get back, but the one through
  ~zod does. Then you respond directly to ~dopzod because their lane
  piggybacked on the response. ~dopzod responds both directly and
  through ~zod, and the story picks up the same as if you weren't behind a
  NAT)

* now you have a direct lane to ~dopzod, so all is well.

* now the duplicate response from ~dopzod through ~zod comes in (takes a
  little longer because it's bouncing off ~zod), resetting your lane to
  "provisional"

* since your lane is provisional, you send your next packet both
  directly and through ~zod

* GOTO 2

This change says "if I already have a direct lane, don't overwrite it
with a provisional one". This way, the only way the direct lane can be
overwritten is if they stop responding on it (cleared on "not
responding; still trying").

I also added |- to +send-blob to make |ames-verb %rot less confusing.
2019-12-05 16:05:06 +08:00
Jared Tobin
75ca54ca24
Merge branch 'ames-sponsor-scry-2' (#2021)
* ames-sponsor-scry-2:
  ames: scry for sponsor and don't crash on jael response

Signed-off-by: Jared Tobin <jared@tlon.io>
2019-12-05 15:43:00 +08:00
Ted Blackman
a7e638ebab ames: scry for sponsor and don't crash on jael response 2019-12-04 17:18:39 -05:00