open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	2d65dc418c	metrics: prevent negative counter from iowait decrease (#18849 ) The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Backport-of: #18835	2023-10-24 10:37:46 -04:00
hc-github-team-nomad-core	e5fb6fe687	backport of commit 615e76ef3c23497f768ebd175f0c624d32aeece8 (#17993 ) This pull request was automerged via backport-assistant	2023-07-19 13:31:14 -05:00
Patric Stout	ebb363d43e	metrics: add "total_ticks_count" for CPU metrics (#17579 ) This counter tells you the total amount of ticks for that CPU entry since the start of Nomad.	2023-07-05 10:28:55 -04:00
hashicorp-copywrite[bot]	005636afa0	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Seth Hoenig	87f4b71df0	client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips (#16672 ) * client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips This PR adds detection of asymetric core types (Power & Efficiency) (P/E) when running on M1/M2 Apple Silicon CPUs. This functionality is provided by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read undocumented registers containing CPU performance data. Currently working on getting that functionality merged upstream into gopsutil, but gopsutil would still not support detecting P vs E cores like this PR does. Also refactors the CPUFingerprinter code to handle the mixed core types, now setting power vs efficiency cpu attributes. For now the scheduler is still unaware of mixed core types - on Apple platforms tasks cannot reserve cores anyway so it doesn't matter, but at least now the total CPU shares available will be correct. Future work should include adding support for detecting P/E cores on the latest and upcoming Intel chips, where computation of total cpu shares is currently incorrect. For that, we should also include updating the scheduler to be core-type aware, so that tasks of resources.cores on Linux platforms can be assigned the correct number of CPU shares for the core type(s) they have been assigned. node attributes before cpu.arch = arm64 cpu.modelname = Apple M2 Pro cpu.numcores = 12 cpu.reservablecores = 0 cpu.totalcompute = 1000 node attributes after cpu.arch = arm64 cpu.frequency.efficiency = 2424 cpu.frequency.power = 3504 cpu.modelname = Apple M2 Pro cpu.numcores.efficiency = 4 cpu.numcores.power = 8 cpu.reservablecores = 0 cpu.totalcompute = 37728 * fingerprint/cpu: follow up cr items	2023-03-28 08:27:58 -05:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
Tim Gross	f820021f9e	deps: bump gopsutil to v3.21.2	2021-03-30 16:02:51 -04:00
Mahmood Ali	d59f149597	Update gopsutil code Latest gosutil includes two backward incompatible changes: First, it removed unused Stolen field in `cae8efcffa (diff-d9747e2da342bdb995f6389533ad1a3d)` . Second, it updated the Windows cpu stats calculation to be inline with other platforms, where it returns absolate stats rather than percentages. See https://github.com/shirou/gopsutil/pull/611.	2020-03-15 09:37:05 +01:00
Yoan Blanc	f85cbddaf1	gopsutils: v2.20.2 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-15 09:36:59 +01:00
Danielle Lancashire	4f2343e1c0	client: Return empty values when host stats fail Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.	2019-09-19 01:22:07 +02:00
Mahmood Ali	63acda956c	Add Client Device Stats structs in `api` package	2018-11-14 14:41:19 -05:00
Mahmood Ali	b74ccc742c	Expose Device Stats in /client/stats API endpoint	2018-11-14 14:41:19 -05:00
Alex Dadgar	9baa7402ef	fix test compiling	2018-10-16 16:56:55 -07:00
Michael Schurter	9d1ea3b228	client: hclog-ify most of the client Leaving fingerprinters in case that interface changes with plugins.	2018-10-16 16:53:30 -07:00
Alex Dadgar	300b1a7a15	Tests only use testlog package logger	2018-06-13 15:40:56 -07:00
Josh Soref	82221f9a2b	spelling: represents	2018-03-11 18:42:29 +00:00
Josh Soref	7ad77f568b	spelling: purposes	2018-03-11 18:39:35 +00:00
Alex Dadgar	9bc75f0ad4	Fix manager tests and make testagent recover from port conflicts	2018-02-15 13:59:01 -08:00
Alex Dadgar	1472b943d6	Stats Endpoint	2018-02-15 13:59:00 -08:00
Michael Schurter	2a81160dcd	Fix GC'd alloc tracking The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs. Also stops logging "marked for GC" twice.	2017-11-01 15:16:38 -05:00
Alex Dadgar	4173834231	Enable more linters	2017-09-26 15:26:33 -07:00
Alex Dadgar	07ed83fdd5	Non-locked accessors to common Node fields This PR removes locking around commonly accessed node attributes that do not need to be locked. The locking could cause nodes to TTL as the heartbeat code path was acquiring a lock that could be held for an excessively long time. An example of this is when Vault is inaccessible, since the fingerprint is run with a lock held but the Vault fingerprinter makes the API calls with a large timeout. Fixes https://github.com/hashicorp/nomad/issues/2689	2017-09-14 14:08:26 -07:00
Alex Dadgar	3ec7946b3e	Fix invalid CPU stats on Windows This PR fixes an issue introduced in Nomad 0.6.0 due to https://github.com/shirou/gopsutil/issues/420. The issue arised from the fact that the Windows stats from gopsutil reports CPUs in percentages where we expected ticks.	2017-09-10 15:30:48 -07:00
James Nugent	448145872f	client: Guard against "NaN" values from floats This commit protects against finding `0.NaN` tokens in JSON streams because of infinity representation on serialization.	2017-09-08 16:21:07 -05:00
Michael Schurter	78823d559b	Squelch logspam when unable to get disk usage stats To reproduce logspam: ``` $ docker plugin install --grant-all-permissions vieux/sshfs $ nomad agent -dev ... 2017/08/25 17:09:03.282868 [WARN] client: error fetching host disk usage stats for /var/lib/docker/plugins/a8b4a69b07e5180f828d19e1e9e102ccc0e26f9c9939eaef85357260c30b20a7/rootfs/mnt/volumes: permission denied ... repeats every collection period ... ```	2017-08-28 12:04:32 -07:00
Alex Dadgar	bb329977a4	Fix nil dereference	2017-01-10 14:14:58 -08:00
Diptanu Choudhury	6e6e0d364a	Added comments	2016-12-20 10:49:48 -08:00
Diptanu Choudhury	36b5545d6b	Making the gc allocator understand real disk usage	2016-12-16 18:34:59 -08:00
Diptanu Choudhury	e855cd587b	Refactored hoststats collector	2016-12-14 15:07:42 -08:00
Christoffer Kylvåg	6a1f32b8ba	#1680 : Continue after not being able to stat a mountpoint	2016-12-13 12:28:57 +01:00
Kenjiro Nakayama	fe13453012	Update after the review	2016-08-11 10:53:33 +09:00
Kenjiro Nakayama	c3b871e90d	Return error when client failed to collect host stats	2016-08-11 09:38:28 +09:00
Alex Dadgar	3cd9c9590b	guard against NaN	2016-06-20 10:29:46 -07:00
Alex Dadgar	c4a819528a	Merge pull request #1260 from hashicorp/f-alloc-stats-struct Allocation resources returned in a struct	2016-06-12 11:18:57 -07:00
Alex Dadgar	fdda90229f	only support latest and remove ring buffer	2016-06-12 09:32:38 -07:00
Diptanu Choudhury	34f85baab0	Fix the calculation of total ticks for docker and exec	2016-06-12 18:08:35 +02:00
Diptanu Choudhury	a5e81ebc3a	Removing un-used code	2016-06-12 01:23:49 +02:00
Diptanu Choudhury	86e4f295da	Fixed the calculation of the host node ticks	2016-06-12 01:14:51 +02:00
Diptanu Choudhury	7fb507e810	Moving the clkspeed code to helper	2016-06-11 17:31:49 +02:00
Diptanu Choudhury	59540c3e93	Extracted a method for getting clock speed	2016-06-11 02:07:28 +02:00
Diptanu Choudhury	01054db4fa	Calculating total ticks consumed in the nomad client	2016-06-10 23:14:33 +02:00
Diptanu Choudhury	2d3798b076	Calculating the cpu ticks in nomad client	2016-06-10 22:22:32 +02:00
Alex Dadgar	3cf74e7fd8	Alloc-status only shows measured statistics and fixes to CPU calculations	2016-06-10 10:38:29 -07:00
Diptanu Choudhury	c0dc6cfbf2	Changing the api of the stats endpoints	2016-05-28 19:59:20 -07:00
Diptanu Choudhury	02ceb6c697	Initializing the ring buffer with no cells	2016-05-28 19:59:20 -07:00
Diptanu Choudhury	2b37f08d49	creating the host cpu percent calculator lazily	2016-05-28 19:59:20 -07:00
Diptanu Choudhury	c2da19bf11	Refactored the api for NewHostStatsCollector	2016-05-28 19:59:20 -07:00
Diptanu Choudhury	98016ec066	Incorporated review comments for executor	2016-05-28 19:59:20 -07:00
Diptanu Choudhury	05c221186b	Added disk usage to node status	2016-05-28 19:59:20 -07:00
Diptanu Choudhury	cf247c1309	Added uptime to node stats	2016-05-28 19:59:20 -07:00

1 2

61 Commits