Commit Graph

61 Commits

Author SHA1 Message Date
Tim Gross 2d65dc418c
metrics: prevent negative counter from iowait decrease (#18849)
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
2023-10-24 10:37:46 -04:00
hc-github-team-nomad-core e5fb6fe687
backport of commit 615e76ef3c23497f768ebd175f0c624d32aeece8 (#17993)
This pull request was automerged via backport-assistant
2023-07-19 13:31:14 -05:00
Patric Stout ebb363d43e
metrics: add "total_ticks_count" for CPU metrics (#17579)
This counter tells you the total amount of ticks for that CPU
entry since the start of Nomad.
2023-07-05 10:28:55 -04:00
hashicorp-copywrite[bot] 005636afa0 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Seth Hoenig 87f4b71df0
client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672)
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips

This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728

* fingerprint/cpu: follow up cr items
2023-03-28 08:27:58 -05:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Tim Gross f820021f9e deps: bump gopsutil to v3.21.2 2021-03-30 16:02:51 -04:00
Mahmood Ali d59f149597
Update gopsutil code
Latest gosutil includes two backward incompatible changes:

First, it removed unused Stolen field in
cae8efcffa (diff-d9747e2da342bdb995f6389533ad1a3d)
.

Second, it updated the Windows cpu stats calculation to be inline with
other platforms, where it returns absolate stats rather than
percentages.  See https://github.com/shirou/gopsutil/pull/611.
2020-03-15 09:37:05 +01:00
Yoan Blanc f85cbddaf1
gopsutils: v2.20.2
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-15 09:36:59 +01:00
Danielle Lancashire 4f2343e1c0
client: Return empty values when host stats fail
Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including `nomad.client.unallocated.*` metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
2019-09-19 01:22:07 +02:00
Mahmood Ali 63acda956c Add Client Device Stats structs in `api` package 2018-11-14 14:41:19 -05:00
Mahmood Ali b74ccc742c Expose Device Stats in /client/stats API endpoint 2018-11-14 14:41:19 -05:00
Alex Dadgar 9baa7402ef fix test compiling 2018-10-16 16:56:55 -07:00
Michael Schurter 9d1ea3b228 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Josh Soref 82221f9a2b spelling: represents 2018-03-11 18:42:29 +00:00
Josh Soref 7ad77f568b spelling: purposes 2018-03-11 18:39:35 +00:00
Alex Dadgar 9bc75f0ad4 Fix manager tests and make testagent recover from port conflicts 2018-02-15 13:59:01 -08:00
Alex Dadgar 1472b943d6 Stats Endpoint 2018-02-15 13:59:00 -08:00
Michael Schurter 2a81160dcd Fix GC'd alloc tracking
The Client.allocs map now contains all AllocRunners again, not just
un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs
allocs.

Also stops logging "marked for GC" twice.
2017-11-01 15:16:38 -05:00
Alex Dadgar 4173834231 Enable more linters 2017-09-26 15:26:33 -07:00
Alex Dadgar 07ed83fdd5 Non-locked accessors to common Node fields
This PR removes locking around commonly accessed node attributes that do
not need to be locked. The locking could cause nodes to TTL as the
heartbeat code path was acquiring a lock that could be held for an
excessively long time. An example of this is when Vault is inaccessible,
since the fingerprint is run with a lock held but the Vault
fingerprinter makes the API calls with a large timeout.

Fixes https://github.com/hashicorp/nomad/issues/2689
2017-09-14 14:08:26 -07:00
Alex Dadgar 3ec7946b3e Fix invalid CPU stats on Windows
This PR fixes an issue introduced in Nomad 0.6.0 due to
https://github.com/shirou/gopsutil/issues/420. The issue arised from the
fact that the Windows stats from gopsutil reports CPUs in
percentages where we expected ticks.
2017-09-10 15:30:48 -07:00
James Nugent 448145872f client: Guard against "NaN" values from floats
This commit protects against finding `0.NaN` tokens in JSON streams
because of infinity representation on serialization.
2017-09-08 16:21:07 -05:00
Michael Schurter 78823d559b Squelch logspam when unable to get disk usage stats
To reproduce logspam:

```
$ docker plugin install --grant-all-permissions vieux/sshfs
$ nomad agent -dev
...
2017/08/25 17:09:03.282868 [WARN] client: error fetching host disk usage stats for /var/lib/docker/plugins/a8b4a69b07e5180f828d19e1e9e102ccc0e26f9c9939eaef85357260c30b20a7/rootfs/mnt/volumes: permission denied
... repeats every collection period ...
```
2017-08-28 12:04:32 -07:00
Alex Dadgar bb329977a4 Fix nil dereference 2017-01-10 14:14:58 -08:00
Diptanu Choudhury 6e6e0d364a Added comments 2016-12-20 10:49:48 -08:00
Diptanu Choudhury 36b5545d6b Making the gc allocator understand real disk usage 2016-12-16 18:34:59 -08:00
Diptanu Choudhury e855cd587b Refactored hoststats collector 2016-12-14 15:07:42 -08:00
Christoffer Kylvåg 6a1f32b8ba #1680: Continue after not being able to stat a mountpoint 2016-12-13 12:28:57 +01:00
Kenjiro Nakayama fe13453012 Update after the review 2016-08-11 10:53:33 +09:00
Kenjiro Nakayama c3b871e90d Return error when client failed to collect host stats 2016-08-11 09:38:28 +09:00
Alex Dadgar 3cd9c9590b guard against NaN 2016-06-20 10:29:46 -07:00
Alex Dadgar c4a819528a Merge pull request #1260 from hashicorp/f-alloc-stats-struct
Allocation resources returned in a struct
2016-06-12 11:18:57 -07:00
Alex Dadgar fdda90229f only support latest and remove ring buffer 2016-06-12 09:32:38 -07:00
Diptanu Choudhury 34f85baab0 Fix the calculation of total ticks for docker and exec 2016-06-12 18:08:35 +02:00
Diptanu Choudhury a5e81ebc3a Removing un-used code 2016-06-12 01:23:49 +02:00
Diptanu Choudhury 86e4f295da Fixed the calculation of the host node ticks 2016-06-12 01:14:51 +02:00
Diptanu Choudhury 7fb507e810 Moving the clkspeed code to helper 2016-06-11 17:31:49 +02:00
Diptanu Choudhury 59540c3e93 Extracted a method for getting clock speed 2016-06-11 02:07:28 +02:00
Diptanu Choudhury 01054db4fa Calculating total ticks consumed in the nomad client 2016-06-10 23:14:33 +02:00
Diptanu Choudhury 2d3798b076 Calculating the cpu ticks in nomad client 2016-06-10 22:22:32 +02:00
Alex Dadgar 3cf74e7fd8 Alloc-status only shows measured statistics and fixes to CPU calculations 2016-06-10 10:38:29 -07:00
Diptanu Choudhury c0dc6cfbf2 Changing the api of the stats endpoints 2016-05-28 19:59:20 -07:00
Diptanu Choudhury 02ceb6c697 Initializing the ring buffer with no cells 2016-05-28 19:59:20 -07:00
Diptanu Choudhury 2b37f08d49 creating the host cpu percent calculator lazily 2016-05-28 19:59:20 -07:00
Diptanu Choudhury c2da19bf11 Refactored the api for NewHostStatsCollector 2016-05-28 19:59:20 -07:00
Diptanu Choudhury 98016ec066 Incorporated review comments for executor 2016-05-28 19:59:20 -07:00
Diptanu Choudhury 05c221186b Added disk usage to node status 2016-05-28 19:59:20 -07:00
Diptanu Choudhury cf247c1309 Added uptime to node stats 2016-05-28 19:59:20 -07:00