Commit Graph

40 Commits

Author SHA1 Message Date
David Soria Parra
3e94bf58b6 worker: flush ui buffers before running the worker
a91c6275 introduces flushing ui buffers after a worker finished. If the ui was
not flushed before the worker was started, fork will copy the existing buffers
to the worker. This causes messages issued before the worker started to be
written to the terminal for each worker.

We are now flushing the ui before we start a worker and add an appropriate test
which will fail before this patch.
2017-03-28 10:21:38 -07:00
Martin von Zweigbergk
ef59d7b1fd merge with stable 2017-02-28 11:13:25 -08:00
Yuya Nishihara
ee998576d8 worker: flush messages written by child processes before exit
I found some child outputs were lost while testing the previous patch. Since
os._exit() does nothing special, we need to do that explicitly.
2017-02-25 12:48:50 +09:00
FUJIWARA Katsunori
47ba9fae77 worker: ignore meaningless exit status indication returned by os.waitpid()
Before this patch, worker implementation assumes that os.waitpid()
with os.WNOHANG returns '(0, 0)' for still running child process. This
is explicitly specified as below in Python API document.

    os.WNOHANG
        The option for waitpid() to return immediately if no child
        process status is available immediately. The function returns
        (0, 0) in this case.

On the other hand, POSIX specification doesn't define the "stat_loc"
value returned by waitpid() with WNOHANG for such child process.

    http://pubs.opengroup.org/onlinepubs/9699919799/functions/waitpid.html

CPython implementation for os.waitpid() on POSIX doesn't take any care
of this gap, and this may cause unexpected "exit status indication"
even on POSIX conformance platform.

For example, os.waitpid() with os.WNOHANG returns non-zero "exit
status indication" on FreeBSD. This implies os.kill() with own pid or
sys.exit() with non-zero exit code, even if no child process fails.

To ignore meaningless exit status indication returned by os.waitpid(),
this patch skips subsequent steps forcibly, if os.waitpid() returns 0
as pid.

This patch also arranges examination of 'p' value for readability.

FYI, there are some issues below about this behavior reported for
CPython.

    https://bugs.python.org/issue21791
    https://bugs.python.org/issue27808
2017-02-25 01:07:52 +09:00
Pulkit Goyal
07314d0686 py3: convert the mode argument of os.fdopen to unicodes (1 of 2)
os.fdopen() does not accepts bytes as its second argument which represent the
mode in which the file is to be opened. This patch makes sure unicodes are
passed in py3 by using pycompat.sysstr().
2017-02-13 20:06:38 +05:30
Pulkit Goyal
4780c32e4c py3: replace os.name with pycompat.osname (part 1 of 2)
os.name returns unicodes on py3 and we have pycompat.osname which returns
bytes. This series of 2 patches will change every ocurrence of os.name with
pycompat.osname.
2016-12-19 00:16:52 +05:30
Pulkit Goyal
1a4248666b py3: replace os.environ with encoding.environ (part 2 of 5) 2016-12-18 01:46:39 +05:30
Jun Wu
cdd5ade3da worker: use os._exit for posix worker in all cases
Like commandserver, the worker should never run other resource cleanup logic.

Previously this is not true for workers if they have exceptions other than
KeyboardInterrupt.

This actually caused a real-world deadlock with remotefilelog:

1. remotefilelog/fileserverclient creates a sshpeer. pipei/o/e get created.
2. worker inherits that sshpeer's pipei/o/e.
3. worker runs sshpeer.cleanup (only happens without os._exit)
4. worker closes pipeo/i, which will normally make the sshpeer read EOF from
   its stdin and exit. But the master process still have pipeo, so no EOF.
5. worker reads pipee (stderr of sshpeer), which never completes because
   the ssh process does not exit, does not close its stderr.
6. master waits for all workers, which never completes because they never
   complete sshpeer.cleanup.

This could also be addressed by closing these fds after fork, which is not
easy because Python 2.x does not have an official "afterfork" hook. Hacking
os.fork is also ugly. Besides, sshpeer is probably not the only troublemarker.

The patch changes _posixworker so all its code paths will use os._exit to
avoid running unwanted resource clean-ups.
2016-11-24 01:15:34 +00:00
Yuya Nishihara
71e47c621a worker: discard waited pid by anyone who noticed it first
This makes sure all waited pids are removed before calling killworkers()
even if waitpid()-pids.discard() sequence is interrupted by another SIGCHLD.
2016-11-17 20:57:09 +09:00
Yuya Nishihara
bb91ef72fa worker: kill workers after all zombie processes are reaped
Since we now wait child processes in non-blocking way (changed by 6c7588a50638
and 13c3aefdee29), we don't have to kill them in the middle of the waitpid()
loop. This change will help solving a possible race of waitpid()-pids.discard()
sequence and another SIGCHLD.

waitforworkers() is called by cleanup(), in which case we do killworkers()
beforehand so we can remove killworkers() from waitforworkers().
2016-11-17 21:08:58 +09:00
Yuya Nishihara
524706bec6 worker: make sure killworkers() never be interrupted by another SIGCHLD
killworkers() iterates over pids, which can be updated by SIGCHLD handler.
So we should either copy pids or prevent killworkers() from being interrupted
by SIGCHLD. I chose the latter as it is simpler and can make pids handling
more consistent.

This fixes a possible "set changed size during iteration" error at
killworkers() before cleanup().
2016-11-17 20:44:05 +09:00
Yuya Nishihara
4d599ed630 worker: fix missed break on successful waitpid()
Follow-up for 5414fcc0ba19.
2016-11-17 21:43:01 +09:00
Jun Wu
ea73f2efd0 worker: stop using a separate thread waiting for children
Now that we have a SIGCHLD hander, and it could get executed when waiting
for I/O. It's no longer necessary to have a separated waitpid thread. So
just remove it.
2016-11-12 03:06:07 +00:00
Jun Wu
483697646a worker: add a SIGCHLD handler to collect worker immediately
As planned by previous patches, add a SIGCHLD handler to get notifications
about worker exits, and deals with worker failure immediately.

Note that the SIGCHLD handler gets unregistered before killworkers(), so
SIGCHLD won't interrupt "killworkers" - making it harder to send kill
signals to waited processes.
2016-11-12 03:07:22 +00:00
Jun Wu
5b9ad89016 worker: make waitforworkers reentrant
We are going to use it in the SIGCHLD handler. The handler will be executed
in the main thread with the non-blocking version of waitpid, while the
waitforworkers thread runs the blocking version. It's possible that one of
them collects a worker and makes the other error out (no child to wait).
This patch handles these errors: ECHILD is ignored. EINTR needs a retry.

The "pids" set is designed to be only modifiable by "waitforworkers". And we
only remove items after a successful waitpid. Since a child process can only
be "waitpid"-ed once. It's guaranteed that "pids.remove(p)" won't be called
with duplicated "p"s. And once a "p" is removed from "pids", that "p" does
not need to be killed or waited any more.
2016-11-15 02:12:16 +00:00
Jun Wu
c6f4ebbf7e worker: change "pids" to a set
There is no need to keep any order of the "pids" array. A set is more
efficient for the "remove" operation. And the following patch will use that.
2016-11-15 02:10:40 +00:00
Jun Wu
30695d34e0 worker: allow waitforworkers to be non-blocking
This patch adds a boolean flag to waitforworkers and makes it non-blocking
if set to True.

This is to make it possible that we can reap our workers while keep other
unrelated children untouched, after receiving SIGCHLD.
2016-07-28 20:57:07 +01:00
Jun Wu
300350ed69 worker: wait worker pid explicitly
Before this patch, waitforworkers uses os.wait() to collect child workers, and
only wait len(pids) processes. This can have serious issues if other code
spawns new processes and does not reap them: 1. worker.py may get wrong exit
code and kill innocent workers. 2. worker.py may continue without waiting for
all workers to complete.

This patch fixes the issue by using waitpid to wait worker pid explicitly.

However, this patch introduces a new issue: worker failure may not be handled
immediately. The issue will be addressed in next patches.
2016-07-28 20:51:20 +01:00
Jun Wu
d66ec54835 worker: move killworkers and waitforworkers up
We need to use them in the SIGCHLD handler and SIGCHLD handler should be
installed before fork.
2016-07-28 20:49:57 +01:00
Jun Wu
2893e63c0a worker: migrate to util.iterfile 2016-11-14 23:12:11 +00:00
Gregory Szorc
bfe71a12f3 worker: document poor partitioning scheme impact
mpm isn't a fan of the existing or previous partitioning scheme. He
provided a fantastic justification for why on the mailing list.

This patch adds his words to the code so they aren't forgotten.
2016-02-27 21:43:17 -08:00
Gregory Szorc
4a0cbf9fe4 worker: change partition strategy to every Nth element
The only consumer of the worker pool code today is `hg update`.

Previously, the algorithm to partition work to each worker process
preserved input list ordering. We'd take the first N elements, then
the next N elements, etc. Measurements on mozilla-central demonstrate
this isn't an optimal partitioning strategy.

I added debug code to print when workers were exiting. When performing
a working copy update on a previously empty working copy of
mozilla-central, I noticed that process lifetimes were all over the
map. One worker would complete after 7s. Many would complete after
12s. And another worker would often take >16s. This behavior occurred
for many worker process counts and was more pronounced on some than
others.

What I suspect is happening is some workers end up with lots of
small files and others with large files. This is because the update
code passes in actions according to sorted filenames. And, directories
under tend to accumulate similar files. For example, test directories
often consist of many small test files and media directories contain
binary (often larger) media files.

This patch changes the partitioning algorithm to select every Nth
element from the input list. Each worker thus has a similar composition
of files to operate on.

The result of this change is that worker processes now all tend to exit
around the same time. The possibility of a long pole due to being
unlucky and receiving all the large files has been mitigated. Overall
execution time seems to drop, but not by a statistically significant
amount on mozilla-central. However, repositories with directories
containing many large files will likely show a drop.

There shouldn't be any regressions due to partial manifest decoding
because the update code already iterates the manifest to determine
what files to operate on, so the manifest should already be decoded.
2016-02-20 15:56:44 -08:00
Pierre-Yves David
30913031d4 error: get Abort from 'error' instead of 'util'
The home of 'Abort' is 'error' not 'util' however, a lot of code seems to be
confused about that and gives all the credit to 'util' instead of the
hardworking 'error'. In a spirit of equity, we break the cycle of injustice and
give back to 'error' the respect it deserves. And screw that 'util' poser.

For great justice.
2015-10-08 12:55:45 -07:00
Gregory Szorc
5880e9b671 worker: restore old countcpus code (issue4869)
This is a backout of 0f1a7b0ccb69. The stdlib implementation of
multiprocessing.cpu_count() attempts to invoke a process on BSD and
Darwin platforms (at least on 2.7). Under certain conditions (such as
cwd being removed) this could raise. Our old code was silently catching
the exception.

The old code was more robust, so restore it.
2015-10-08 10:57:03 -07:00
Gregory Szorc
8c2d6fb389 worker: use multiprocessing to find cpu count
The multiprocessing package was added in Python 2.6.
The implementation of worker.countcpus() is very similar to
multiprocessing.cpu_count(). Ditch our one-off code.

multiprocessing does result in a number of imports. However,
the lazy importer ensures that we don't import anything until
cpu_count() is called. Furthermore, if we are doing something
with multiple cores, chances are the time of that operation
will dwarf the import time, so module bloat isn't a concern
here.
2015-05-25 13:10:38 -07:00
Gregory Szorc
999425dca1 worker: use absolute_import 2015-08-08 18:44:41 -07:00
Gregory Szorc
5380dea2a7 global: mass rewrite to use modern exception syntax
Python 2.6 introduced the "except type as instance" syntax, replacing
the "except type, instance" syntax that came before. Python 3 dropped
support for the latter syntax. Since we no longer support Python 2.4 or
2.5, we have no need to continue supporting the "except type, instance".

This patch mass rewrites the exception syntax to be Python 2.6+ and
Python 3 compatible.

This patch was produced by running `2to3 -f except -w -n .`.
2015-06-23 22:20:08 -07:00
Mads Kiilerich
23da6c1d98 cleanup: avoid _ for local unused tmp variables - that is reserved for i18n
_ is usually used for i18n markup but we also used it for I-don't-care
variables.

Instead, name don't-care variables in a slightly descriptive way but use the _
prefix to designate unused variable.

This will mute some pyflakes "import '_' ... shadowed by loop variable"
warnings.
2014-08-15 16:20:47 +02:00
Augie Fackler
9f876f6c89 cleanup: move stdlib imports to their own import statement
There are a few warnings still produced by my import checker, but
those are false positives produced by modules that share a name with
stdlib modules.
2013-11-06 16:48:06 -05:00
Matt Mackall
75eb7e7c7e worker: properly report errors from worker processes (issue3982) 2013-07-16 15:18:12 -05:00
Matt Mackall
e4e0d4a087 worker: check problem state correctly (issue3982)
If a large update triggered an abort, it was possible for the main
thread to still update the dirstate.

This fix is incomplete, as the failing worker now doesn't generate a
proper error message. This is difficult in the fork-based framework,
which relies on exceptions propagating to the top of the dispatcher
for formatting.
2013-07-16 11:53:53 -05:00
Bryan O'Sullivan
7fb4e0bf12 worker: add missing import of errno
Found using Cython.
2013-04-12 17:16:37 -07:00
Bryan O'Sullivan
c538c00399 worker: catch all exceptions, try to exit usefully/safely 2013-04-11 13:30:31 -07:00
Bryan O'Sullivan
0aa6f05307 worker: handle worker failures more aggressively
We now wait for worker processes in a separate thread, so that we can
spot failures in a timely way, wihout waiting for the progress pipe
to drain.

If a worker fails, we recover the pre-parallel-update behaviour of
failing early by killing its peers before propagating the failure.
2013-02-20 11:31:34 -08:00
Bryan O'Sullivan
0da2636c99 worker: fix a race in SIGINT handling
This is almost impossible to trigger due to the tiny time window involved.
2013-02-20 11:31:31 -08:00
Bryan O'Sullivan
9f53401ffd worker: on error, exit similarly to the first failing worker
Previously, if a worker failed, we exited with status 1. We now exit
with the correct exit code (killing ourselves if necessary).
2013-02-20 11:31:27 -08:00
Bryan O'Sullivan
5d849a878a worker: allow a function to be run in multiple worker processes
If we estimate that it will be worth the cost, we run the function in
multiple processes. Otherwise, we run it in-process.

Children report progress to the parent through a pipe.

Not yet implemented on Windows.
2013-02-09 15:51:32 -08:00
Bryan O'Sullivan
8ef1da44b3 worker: partition a list (of tasks) into equal-sized chunks 2013-02-09 15:51:32 -08:00
Bryan O'Sullivan
998baaf4d8 worker: estimate whether it's worth running a task in parallel
Not implemented for Windows yet.
2013-02-09 15:51:26 -08:00
Bryan O'Sullivan
46bf2f5a6f worker: count the number of CPUs
This works on the major platforms, and falls back to a safe guess of
1 elsewhere.
2013-02-09 15:22:12 -08:00