the ports-* users. Currently it is not possible to delegate
management of ZFS filesystems to non-root users, so root privilege
is required to manipulate them. We validate the command passed on
a local domain socket and re-execute the build script with the requested
parameters.
ports and source trees. Since we have >=1 consumer of these trees
that run frequently but do not insist on up-to-the-second trees, it
makes sense to "pre-update" them regularly and then then re-use in all
of the consumers, instead of potentially doing several updates
simultaneously or on demand. Consumers can clone the ZFS snapshot
into their local filesystem which takes a couple of seconds instead of
minutes or tens of minutes for the CVS update.
We update to a date stamp instead of "." because this avoids
ambiguity of commits that happen while the tree update is in progress
(unfortunately it's slower).
listed filesystem we take a new snapshot each time it is run and if
the last full backup was not too long ago, do a compressed incremental
backup from the previous backup.
* Catch up to build ID directory changes
* Support a meta-hostname of 'all' for setting up all clients at once.
This is better than the old way of running one copy of the script
for each client by hand, since it is easier and involves less
duplicated work.
* We copy in the per-build ports, src, and bindist .tbz files and .md5
checksums, as well as refreshing the build scripts and
bindist-$(hostname).tar customization tarball.
* The -force switch forces copying of files and re-extraction of the
tarballs on the client. This is necessary in order to propagate
local changes to the tarballs after the initial client setup
(e.g. if you need to change a file in the ports tree, it must be
recompressed, redistributed, and re-extracted on the client).
* The -queue switch will poll the client's job queue after completion
of the setup. This is racy and should only be used when the machine
is not currently accepting jobs.
* For cleaning up a build the 'build cleanup' command should now be
used instead. It calls back into this command but also allows full
clenaup of build-local files on the client.
TODO: "all" setups are hard on the server since they may spawn dozens
of rsyncs at once. A better solution would be to have a worker pool
of setup tasks to limit the maximum load.
* Catch up to build ID directory changes
* Make it easier to kill a build by not running dopackages in the background
where it is detached from shell job control. Now, sending a termination
signal to this process (e.g. ^C) will also kill off the dopackages script
and in turn the processes created by it. Some background processes
spawned by dopackages, pdispatch, etc, may still remain and need to be
killed by hand.
* Catch up to build ID directory changes
* Improve usage()
* Fix a variety of small bugs
* Remove support for -ftp builds: we have not supported direct
uploading for many years due to the desire to manually inspect build
output for quality
* All data associated to a build is now localized in its own directory
named according to a build ID:
/var/portbuild/${arch}/${branch}/builds/${buildid}, where ${buildid}
is the creation time. These are actually ZFS filesystems.
* Tasks such as cloning a new build, updating a ZFS snapshot, and
cleaning up a build are exported to the "build" script, which can be
used independently.
* Creating a new build is done by ZFS cloning and takes a couple of
seconds since it is copy-on-write (i.e. no data needs to be copied).
* Ports and source trees are also cloned from pre-updated ZFS images
(updated regularly from the "updatesnap" cron job). In most cases
we do not care if we are building a ports tree that is an hour or so
old since it will become outdated almost immediately anyway, so no
matter what we do there will be times when a port has been fixed by
the time the build error is generated by a client.
* In case an up-to-the-second tree is desired, the -portscvs and
-srccvs switches update the existing ports tree via CVS.
* -noports and -nosrc can be used to prevent any automatic changes to
the ports tree. This is useful for dealing with local
modifications (e.g. for -exp builds), since the default when
creating a new build is to replace the previous trees with fresh,
pristine trees. If you forget to use this then any local changes
that are not also present in other trees will be lost.
* By default we keep two builds for each arch/branch pair. These
build IDs also may be referred to via "latest" and "previous"
symlinks. When creating a new build, the old "previous" build is
destroyed by default, unless it was originally created using the
-keep switch. This prevents the build from being destroyed
automatically.
* By default when a build finishes all of the clients are completely
cleaned up (i.e. all build data such as ports trees, tarballs,
client chroots, etc are deleted). This is needed to save space on
the clients. If you expect to *immediately* perform further builds
after this one completes, the -nocleanup switch prevents this step.
Otherwise they will just be set up again if further builds are
scheduled.
* Try to parallelize build pre-processing as much as possible, by
running jobs in the background wherever possible. In several places
we operate on the same parts of the filesystem from multiple jobs,
so we can make good use of caching to improve performance
* Clients no longer need to be set up explicitly at the start of the
build, they will be set up on-demand when the first job is
dispatched to them. This allows fast clients or those that already
have been set up to begin building ports as soon as possible, while
slow clients are set up in the background. It also improves
robustness of client recovery, e.g. if the client was offline at the
time of build startup but later brought back online.
* Optimize copying back in the previous set of restricted packages by
hardlinking instead of copying.
TODO: The record of failed ports is arch/branch-global still. This is
the only thing preventing us from running concurrent builds of the
same arch/branch (e.g. while one is stuck building openoffice, the
next build can start to keep the cluster busy). The difficulty is
that one build from a later ports tree may signal that a build was
successful, then a phase 2 build from an earlier ports tree may
indicate that it was broken. The solution is probably to migrate this
to a real database instead of a flat file, and query it for the set of
broken ports as of a certain ports tree date.
* Clients no longer mount ports/src trees via NFS (even the FreeBSD.org
local clients). This was putting too much load on the server and
slowing down builds.
* Instead ports and src .tbz files are pushed to the clients and
unpacked. MD5 checksums are used to verify correctness
* -force forces re-extraction of the tarballs even if they exist and
appear to be checked out
* Also unpack the compressed bindist
TODO: When we are not using md or ZFS builds it would be even faster
to keep an unpacked copy of the bindist on the scratch filesystem and
hardlink the files into the target directory
* Catch up to build ID directory changes
* Improved support for ZFS builds
* Improved robustness
* Report status verbosely to the caller; whether we succeeded in claiming
a chroot, whether the caller needs to first set up the client, or
whether a setup is in progress.
* If we discover that the client has not been set up either because it
freshly booted and newfs'ed its filesystem, or because a particular
build has not yet been encountered, atomically claim a cookie and
report this to the caller to act on
* Catch up to build ID directory changes
* Add helper functions for resolving a build ID symlink and
validating an arch/branch combination (centralize instead of doing it
in many scripts)
* Catch up to build ID directory changes
* Add support for ssh_cmd and scp_cmd to allow using HPN-SSH with the
none cipher where possible (for performance)
* Lazy client setup; claim-chroot will report if the client needs to be
set up with this buildid, and we initiate the setup and poll until
it is complete. This allows fast clients to begin building before
slow ones have finished setting up.
TODO: a better solution would be to avoid trying to dispatch jobs onto
clients that are in the process of setting up, since they often have low
loads and are picked preferentially by the job scheduler.
* Remove vestiges of archaic support for building bindists from FTP
snapshots; we haven't used this for years and building a world is no
longer a challenge
* Revert half-baked bindist generation number and make it per-buildid
instead. Compress and md5 it for distribution to the clients.
TODO: Merge with makeworld?
checkmachines script. Polls build machines for their status either
once-off or regularly as a daemon. Optionally it will update the
queue entries but this remains subject to race conditions.
TODO: Integrate with queue manager and forward machine status changes
to it
* Remove 5.x support
* Leave the archaic ftp snapshot support for now, it is not hurting anything
but will not work
* Be more careful when removing files (use absolute paths)
* Switch to bindist/tmp for the tmp dir
* Fix the recording of the bindist.tar generation number
* Get rid of redundant or useless processing of the world image
* Record the CVS update stamp in some extra places and make sure to remove it
if the build is started with -noportscvs (since this probably means the
ports tree was updated by hand at some random time)
invocations). It also fixes some edge cases that were not handled in
the previous version.
TODO: Correctly report IPv6 sockets (already in use by the sparc64 build)
ordering, which had become too limited.
We now build packages ordered by those that are part of the longest
dependency chains first. This has the effect of building the deepest
parts of the tree first and levelling out the tree height, hopefully
avoiding the situation we currently face where there appear
bottlenecks late in the build where the cluster becomes mostly idle
while waiting for a few long dependency chains to finish building
before the cluster can become fully loaded again.
The algorithm is that we sort the list of remaining packages according
to height (longest dependency chain), then add leaf packages from each
in order until we have filled a queue of length between 100 and 200,
to amortise the cost of this queue rebalancing while not losing the
height averaging property. Jobs are dispatched from this queue into
worker threads as machine slots become available.
Unlike the make-based solution that required a fixed -j concurrency
value and could not respond to addition/removal of build resources, we
now can dynamically add new machines as they become available to the
queue.
The other advantage of using python is that we have more
customisability and visibility into the build status, e.g. we
periodically report the number of remaining packages, as well as the
list of deepest packages that we are working on.
TODO:
* Implement mtime checking for parent package staleness, so that
parents are rebuilt if the dependencies are touched more recently.
Currently packages will not be rebuild if they exist, whether or not
they are "stale" wrt their dependencies.
* Offload the machine selection into an external queue manager.
Currently the queue manager used here doesn't interoperate with the
old one (getmachine/releasemachine) because it's not possible to use
the lockf()-based mutual exclusion within a multithreaded client.
Doing that will also allow for a more flexible job placement
algorithm as well as finer queue customization.