Failed requests due to API client errors are to be marked as DEBUG.
The Error log level should be reserved to signal problems with the
cluster and are actionable for nomad system operators. Logs due to
misbehaving API clients don't represent a system level problem and seem
spurius to nomad maintainers at best. These log messages can also be
attack vectors for deniel of service attacks by filling servers disk
space with spurious log messages.
Pipe http server log to hclog, so that it uses the same logging format
as rest of nomad logs. Also, supports emitting them as json logs, when
json formatting is set.
The http server logs are emitted as Trace level, as they are typically
repsent HTTP client errors (e.g. failed tls handshakes, invalid headers,
etc).
Though, Panic logs represent server errors and are relayed as Error
level.
* command/agent/csi_endpoint: support type filter in volumes & plugins
* command/agent/http: use /v1/volume/csi & /v1/plugin/csi
* api/csi: use /v1/volume/csi & /v1/plugin/csi
* api/nodes: use /v1/volume/csi & /v1/plugin/csi
* api/nodes: not /volumes/csi, just /volumes
* command/agent/csi_endpoint: fix ot parameter parsing
allow oss to parse sink duration
clean up audit sink parsing
ent eventer config reload
fix typo
SetEnabled to eventer interface
client acl test
rm dead code
fix failing test
Introduce limits to prevent unauthorized users from exhausting all
ephemeral ports on agents:
* `{https,rpc}_handshake_timeout`
* `{http,rpc}_max_conns_per_client`
The handshake timeout closes connections that have not completed the TLS
handshake by the deadline (5s by default). For RPC connections this
timeout also separately applies to first byte being read so RPC
connections with TLS enabled have `rpc_handshake_time * 2` as their
deadline.
The connection limit per client prevents a single remote TCP peer from
exhausting all ephemeral ports. The default is 100, but can be lowered
to a minimum of 26. Since streaming RPC connections create a new TCP
connection (until MultiplexV2 is used), 20 connections are reserved for
Raft and non-streaming RPCs to prevent connection exhaustion due to
streaming RPCs.
All limits are configurable and may be disabled by setting them to `0`.
This also includes a fix that closes connections that attempt to create
TLS RPC connections recursively. While only users with valid mTLS
certificates could perform such an operation, it was added as a
safeguard to prevent programming errors before they could cause resource
exhaustion.
Noticed that ACL endpoints return 500 status code for user errors. This
is confusing and can lead to false monitoring alerts.
Here, I introduce a concept of RPCCoded errors to be returned by RPC
that signal a code in addition to error message. Codes for now match
HTTP codes to ease reasoning.
```
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 500 (ACL bootstrap already done (reset index: 9))
$ nomad acl bootstrap
Error bootstrapping: Unexpected response code: 400 (ACL bootstrap already done (reset index: 9))
```
Nomad web UI currently fails when querying client nodes for allocation
state end endpoints, due to CORS policy.
The issue is that CORS requests that are marked `withCredentials` need
the http server to include a `Access-Control-Allow-Credentials` [1].
But Nomad Task Logs and filesystem requests include authenticating
information and thus marked with `credentials=true`[2][3].
It's worth noting that the browser currently sends credentials and
authentication token to servers anyway; it's just that the response is
not made available to caller nomad ui javascript. For task logs
specifically, nomad ui retries again by querying the web ui address
(typically pointing to a nomad server) which will forward the request
to the nomad client agent appropriately.
[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Access-Control-Allow-Credentials
[2] 101d0373ee/ui/app/components/task-log.js (L50)
[3] 101d0373ee/ui/app/services/token.js (L25-L39)
Adds new package that can be used by client and server RPC endpoints to
facilitate monitoring based off of a logger
clean up old code
small comment about write
rm old comment about minsize
rename to Monitor
Removes connection logic from monitor command
Keep connection logic in endpoints, use a channel to send results from
monitoring
use new multisink logger and interfaces
small test for dropped messages
update go-hclogger and update sink/intercept logger interfaces
Adds nomad monitor command. Like consul monitor, this command allows you
to stream logs from a nomad agent in real time with a a specified log
level
add endpoint tests
Upgrade go-hclog to latest version
The current version of go-hclog pads log prefixes to equal lengths
so info becomes [INFO ] and debug becomes [DEBUG]. This breaks
hashicorp/logutils/level.go Check function. Upgrading to the latest
version removes this padding and fixes log filtering that uses logutils
Check
This adds a websocket endpoint for handling `nomad exec`.
The endpoint is a websocket interface, as we require a bi-directional
streaming (to handle both input and output), which is not very appropriate for
plain HTTP 1.0. Using websocket makes implementing the web ui a bit simpler. I
considered using golang http hijack capability to treat http request as a plain
connection, but the web interface would be too complicated potentially.
Furthermore, the API endpoint operates against the raw core nomad exec streaming
datastructures, defined in protobuf, with json serializer. Our APIs use json
interfaces in general, and protobuf generates json friendly golang structs.
Reusing the structs here simplify interface and reduce conversion overhead.
Fix an issue in which the deployment watcher would fail the deployment
based on the earliest progress deadline of the deployment regardless of
if the task group has finished.
Further fix an issue where the blocked eval optimization would make it
so no evals were created to progress the deployment. To reproduce this
issue, prior to this commit, you can create a job with two task groups.
The first group has count 1 and resources such that it can not be
placed. The second group has count 3, max_parallel=1, and can be placed.
Run this first and then update the second group to do a deployment. It
will place the first of three, but never progress since there exists a
blocked eval. However, that doesn't capture the fact that there are two
groups being deployed.
The parse endpoint accepts a hcl jobspec body within a json object
and returns the parsed json object for the job. This allows users to
register jobs with the nomad json api without specifically needing
a nomad binary to parse their hcl encoded jobspec file.
Fixes#3697
The existing code and test case only covered the leader behavior. When
querying against non-leaders the error has an "rpc error: " prefix.
To provide consistency in HTTP error response I also strip the "rpc
error: " prefix for 403 responses as they offer no beneficial additional
information (and in theory disclose a tiny bit of data to unauthorized
users, but it would be a pretty weird bit of data to use in a malicious
way).
* Allow server TLS configuration to be reloaded via SIGHUP
* dynamic tls reloading for nomad agents
* code cleanup and refactoring
* ensure keyloader is initialized, add comments
* allow downgrading from TLS
* initalize keyloader if necessary
* integration test for tls reload
* fix up test to assert success on reloaded TLS configuration
* failure in loading a new TLS config should remain at current
Reload only the config if agent is already using TLS
* reload agent configuration before specific server/client
lock keyloader before loading/caching a new certificate
* introduce a get-or-set method for keyloader
* fixups from code review
* fix up linting errors
* fixups from code review
* add lock for config updates; improve copy of tls config
* GetCertificate only reloads certificates dynamically for the server
* config updates/copies should be on agent
* improve http integration test
* simplify agent reloading storing a local copy of config
* reuse the same keyloader when reloading
* Test that server and client get reloaded but keep keyloader
* Keyloader exposes GetClientCertificate as well for outgoing connections
* Fix spelling
* correct changelog style
This PR removes deepcopying of the job attached to the allocation in the
alloc runner. This operation is called very often so removing reflect
from the code path and the potentially large number of mallocs need to
create a job reduced memory and cpu pressure.
* -dev mode defaults bind & advertise to localhost
* Normal mode defaults bind to 0.0.0.0 & advertise to the resolved
hostname. If the hostname resolves to localhost it will refuse to
start and advertise must be manually set.
* Added the keygen command
* Added support for gossip encryption
* Changed the URL for keyring management
* Fixed the cli
* Added some tests
* Added tests for keyring operations
* Added a test for removal of keys
* Added some docs
* Fixed some docs
* Added general options