Previously, Nomad was using a hand-made lookup table for looking
up EC2 CPU performance characteristics (core count + speed = ticks).
This data was incomplete and incorrect depending on region. The AWS
API has the correct data but requires API keys to use (i.e. should not
be queried directly from Nomad).
This change introduces a lookup table generated by a small command line
tool in Nomad's tools module which uses the Amazon AWS API.
Running the tool requires AWS_* environment variables set.
$ # in nomad/tools/cpuinfo
$ go run .
Going forward, Nomad can incorporate regeneration of the lookup table
somewhere in the CI pipeline so that we remain up-to-date on the latest
offerings from EC2.
Fixes#7830
Fixes#7681
The current behavior of the CPU fingerprinter in AWS is that it
reads the **current** speed from `/proc/cpuinfo` (`CPU MHz` field).
This is because the max CPU frequency is not available by reading
anything on the EC2 instance itself. Normally on Linux one would
look at e.g. `sys/devices/system/cpu/cpuN/cpufreq/cpuinfo_max_freq`
or perhaps parse the values from the `CPU max MHz` field in
`/proc/cpuinfo`, but those values are not available.
Furthermore, no metadata about the CPU is made available in the
EC2 metadata service.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-categories.html
Since `go-psutil` cannot determine the max CPU speed it defaults to
the current CPU speed, which could be basically any number between
0 and the true max. This is particularly bad on large, powerful
reserved instances which often idle at ~800 MHz while Nomad does
its fingerprinting (typically IO bound), which Nomad then uses as
the max, which results in severe loss of available resources.
Since the CPU specification is unavailable programmatically (at least
not without sudo) use a best-effort lookup table. This table was
generated by going through every instance type in AWS documentation
and copy-pasting the numbers.
https://aws.amazon.com/ec2/instance-types/
This approach obviously is not ideal as future instance types will
need to be added as they are introduced to AWS. However, using the
table should only be an improvement over the status quo since right
now Nomad miscalculates available CPU resources on all instance types.
Fix a regression where we accidentally started treating non-AWS
environments as AWS environments, resulting in bad networking settings.
Two factors some at play:
First, in [1], we accidentally switched the ultimate AWS test from
checking `ami-id` to `instance-id`. This means that nomad started
treating more environments as AWS; e.g. Hetzner implements `instance-id`
but not `ami-id`.
Second, some of these environments return empty values instead of
errors! Hetzner returns empty 200 response for `local-ipv4`, resulting
into bad networking configuration.
This change fix the situation by restoring the check to `ami-id` and
ensuring that we only set network configuration when the ip address is
not-empty. Also, be more defensive around response whitespace input.
[1] https://github.com/hashicorp/nomad/pull/6779
Previously, Nomad used hand rolled HTTP requests to interact with the
EC2 metadata API. Recently however, we switched to using the AWS SDK for
this fingerprinting.
The default behaviour of the AWS SDK is to perform retries with
exponential backoff when a request fails. This is problematic for Nomad,
because interacting with the EC2 API is in our client start path.
Here we revert to our pre-existing behaviour of not performing retries
in the fast path, as if the metadata service is unavailable, it's likely
that nomad is not running in AWS.
Some code cleanup:
* Use a field for setting EC2 metadata instead of env-vars in testing;
but keep environment variables for backward compatibility reasons
* Update tests to use testify
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.
It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
Certain environments use WARN for serious logging; however, it's very
possible to have machines without some of the fingerprinted keys
(public-ipv4 and public-hostname specifcally). Setting log level to
INFO seems more consistent with this possibility.
This PR:
* Makes AWS network speeds more granular
* Makes `network_speed` an override and not a default
* Adds a default of 1000 MBits if no network link speed is detected.
Fixes#1985
Fix URL. It was printing an error message on startup:
```
2015/11/13 15:49:21 [ERR] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
```
By the way is it safe to use latest? Is there a chance that Amazon decides to change the format of the metadata? It could be safer to use something like `http://169.254.169.254/2014-11-05/meta-data`
GCE and AWS both expose metadata servers, and GCE's 404 response
includes the URL in the content, which maatches the regex. So,
check the response code as well and if a 4xx code comes back,
take that to meanit's not AWS.