Log lines which include an error should use the full term "error"
as the context key. This provides consistency across the codebase
and avoids a Go style which operators might not be aware of.
This PR causes the logmon task runner to acquire the binary of the
Nomad executable in an 'init' block, so as to almost certainly get
the name while the nomad file still exists.
This is an attempt at fixing the case where a deleted Nomad file
(e.g. during upgrade) may be getting renamed with a mysterious
suffix first.
If this doesn't work, as a last resort we can literally just trim
the mystery string.
Fixes: #14079
The `golang.org/x/net/context` package was merged into the stdlib as of go
1.7. Update the imports to use the identical stdlib version. Clean up import
blocks for the impacted files to remove unnecessary package aliasing.
* test: use `T.TempDir` to create temporary test directory
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix TestLogmon_Start_restart on Windows
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestConsul_Integration
t.TempDir fails to perform the cleanup properly because the folder is
still in use
testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Fix a logmon leak causing high goroutine and memory usage when a task
restarts.
Logmon `FileRotator` buffers the task stdout/stderr streams and
periodically flushing them to log files. Logmon creates a new
FileRotator for each stream for each task run. However, the
`flushPeriodically` goroutine is leaked when a task restarts,
holding a reference to a no-longer-needed `FileRotator` instance
along with its 64kb buffer.
The cause is that the code assumed `time.Ticker.Stop()` closes the
ticker channel, thereby terminating the goroutine, but the documentation
says otherwise:
> Stop turns off a ticker. After Stop, no more ticks will be sent. Stop does not close the channel, to prevent a concurrent goroutine reading from the channel from seeing an erroneous "tick".
https://pkg.go.dev/time#Ticker.Stop
Add an RPC timeout for logmon. In
https://github.com/hashicorp/nomad/issues/6461#issuecomment-559747758 ,
`logmonClient.Stop` locked up and indefinitely blocked the task runner
destroy operation.
This is an incremental improvement. We still need to follow up to
understand how we got to that state, and the full impact of locked-up
Stop and its link to pending allocations on restart.
There's a bug in go1.11 that causes some io operations on windows to
return incorrect errors for some cases when Stat-ing files. To avoid
upgrading to go1.12 in a point release, here we loosen up the cases
where we will attempt to create fifos, and add some logging of
underlying stat errors to help with debugging.
On unix platforms, it is safe to re-open fifo's for reading after the
first creation if the file is already a fifo, however this is not
possible on windows where this triggers a permissions error on the
socket path, as you cannot recreate it.
We can't transparently handle this in the CreateAndRead handle, because
the Access Is Denied error is too generic to reliably be an IO error.
Instead, we add an explict API for opening a reader to an existing FIFO,
and check to see if the fifo already exists inside the calling package
(e.g logmon)
Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI
commands to be handled by command CLI framework. Plugin launchers use
`discover.NomadBinary()` to identify the binary and start it.
This has few downsides: The trivial one is that when running tests, one
must re-compile the nomad binary as the tests need to invoke the nomad
executable to start plugin. This is frequently overlooked, resulting in
puzzlement.
The more significant issue with `executor` in particular is in relation
to external driver:
* Plugin must identify the path of invoking nomad binary, which is not
trivial; `discvoer.NomadBinary()` now returns the path to the plugin
rather than to nomad, preventing external drivers from launching
executors.
* The external driver may get a different version of executor than it
expects (specially if we make a binary incompatible change in future).
This commit addresses both downside by having the plugin invocation
handling through an `init()` call, similar to how libcontainer init
handler is done in [1] and recommened by libcontainer [2]. `init()`
will be invoked and handled properly in tests and external drivers.
For external drivers, this change will cause external drivers to launch
the executor that's compiled against.
There a are a couple of downsides to this approach:
* These specific packages (i.e executor, logmon, and dockerlog) need to
be careful in use of `init()`, package initializers. Must avoid having
command execution rely on any other init in the package. I prefixed
files with `z_` (golang processes files in lexical order), but ensured
we don't depend on order.
* The command handling is spread in multiple packages making it a bit
less obvious how plugin starts are handled.
[1] drivers/shared/executor/libcontainer_nsenter_linux.go
[2] eb4aeed24f/libcontainer (using-libcontainer)
This PR switches to using plain fifo files instead of golang structs
managed by containerd/fifo library.
The library main benefit is management of opening fifo files. In Linux,
a reader `open()` request would block until a writer opens the file (and
vice-versa). The library uses goroutines so that it's the first IO
operation that blocks.
This benefit isn't really useful for us: Given that logmon simply
streams output in a separate process, blocking of opening or first read
is effectively the same.
The library additionally makes further complications for managing state
and tracking read/write permission that seems overhead for our use,
compared to using a file directly.
Looking here, I made the following incidental changes:
* document that we do handle if fifo files are already created, as we
rely on that behavior for logmon restarts
* use type system to lock read vs write: currently, fifo library returns
`io.ReadWriteCloser` even if fifo is opened for writing only!
Noticed that the protobuf files are out of sync with ones generated by 1.2.0 protoc go plugin.
The cause for these files seem to be related to release processes, e.g. [0.9.0-beta1 preperation](ecec3d38de (diff-da4da188ee496377d456025c2eab4e87)), and [0.9.0-beta3 preperation](b849d84f2f).
This restores the changes to that of the pinned protoc version and fails build if protobuf files are out of sync. Sample failing Travis job is that of the first commit change: https://travis-ci.org/hashicorp/nomad/jobs/506285085
There were multiple bugs here:
1. Reattach unmarshalling always returned an error because you can't
unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
reattaching was never attempted.
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends. This number is closer
what Docker reports.
Related to https://github.com/hashicorp/nomad/issues/5165 .
As of now, FileRotator uses bufio.Write under the hood to write data to
configured output file. Due to the way how bufio handles any occurred io
error - saves it into `err` variable never resetting it automatically -
any operation like `Write`, `Flush` etc will become a no-op, returning the very same,
saved error (eg. Out of disk space) even when the problem is fixed (eg. disk
space is available again).
That automatically means that FileRotator will stop writing any logs,
reporting the same error over and over again, even if it's no longer
valid.
This PR fixes it by resetting the bufio Writer, which resets any errors
and tries to write requested data.
The logging package is used by logmon and the legacy mock_driver. Because the
legacy drivers are going away, I'm moving it here to signify its actual
ownership.