open-nomad

Author	SHA1	Message	Date
Michael Schurter	a54511b304	Merge pull request #5731 from hashicorp/b-ignore-dc client: drop unused DC field from servers list	2019-05-22 08:42:15 -07:00
Mahmood Ali	84419f08ce	client: synchronize client.invalidAllocs access invalidAllocs may be accessed and manipulated from different goroutines, so must be locked.	2019-05-22 09:37:49 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	b06e585713	Merge pull request #5739 from hashicorp/r-rm-logmon-syslog-deadcode logmon: remove syslog server deadcode	2019-05-21 11:46:48 -04:00
Mahmood Ali	eca23bf9c4	Merge pull request #5742 from hashicorp/b-test-fixes-20190520 Grab bag of (primarily race) test fixes	2019-05-21 11:46:36 -04:00
Mahmood Ali	e88bb61488	Merge pull request #5740 from hashicorp/b-nomad-exec-term-race exec: allow drivers to handle stream termination	2019-05-21 11:24:12 -04:00
Mahmood Ali	b475ccbe3e	client: synchronize access to ar.alloc `allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's use `allocRunner.Alloc()` helper function to access it.	2019-05-21 09:55:05 -04:00
Mahmood Ali	2a7b073167	tests: fix fifo lib race Accidentally accessed outer `err` variable inside a goroutine	2019-05-21 09:49:56 -04:00
Mahmood Ali	296bd41c9e	tests: fix data race in client TestDriverManager_Fingerprint_Periodic	2019-05-21 09:49:56 -04:00
Mahmood Ali	d9e59eece0	tests: fix client TestFS_Stream data race Close is invoked in a different goroutine from test	2019-05-21 09:49:56 -04:00
Mahmood Ali	75e0a3f405	exec: allow drivers to handle stream termination Without this change, alloc_endpoint cancel the context passed to handler when we detect EOF. This races driver in setting exit code; and we run into a case where the exec process terminates cleanly yet we attempt to mark it as failed with context error. Here, we rely on the driver to handle errors returned from Stream and without racing to set an error.	2019-05-21 09:40:25 -04:00
Mahmood Ali	974bcbecc9	logmon: remove syslog server deadcode Remove unused syslog server related code that got replaced by the docker logger in Nomad 0.9	2019-05-21 09:36:43 -04:00
Michael Schurter	d41abda957	client: drop unused DC field from servers list See #5730 for details.	2019-05-20 14:19:15 -07:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	3b1f8991a1	client: log when server list changes Stop logging in the happy path when nothing has changed.	2019-05-13 15:42:55 -07:00
Michael Schurter	48db8135da	Merge pull request #5492 from hashicorp/f-allocated-mem client: expose allocated memory per task	2019-05-13 13:31:22 -07:00
Lang Martin	1d03a43ce2	Merge pull request #5642 from hashicorp/b-network-fingerprinting-ipv4 network fingerprinting multiple IPs on the configured network device	2019-05-13 11:46:53 -04:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Lang Martin	f6bc45dd23	client improve a comment in updateNetworks	2019-05-10 11:25:04 -04:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Preetha	1d02886bb6	Merge pull request #5654 from hashicorp/b-hearbeat-lockfix Remove unnecessary locking and serverlist syncing in heartbeats	2019-05-08 13:36:39 -05:00
Preetha Appan	3289e7f4a0	fix typo and add one more test scenario	2019-05-08 10:54:22 -05:00
Preetha Appan	db6b291a5a	code review feedback	2019-05-07 16:23:32 -05:00
Chris Baker	93ec1293be	stale allocation data leads to incorrect (and even negative) metrics (#5637 ) * client: was not using up-to-date client state in determining which alloc count towards allocated resources * Update client/client.go Co-Authored-By: cgbaker <cgbaker@hashicorp.com>	2019-05-07 15:54:36 -04:00
Preetha Appan	b063fc81a4	Remove unnecessary locking and serverlist syncing in heartbeats This removes an unnecessary shared lock between discovery and heartbeating which was causing heartbeats to be missed upon retries when a single server fails. Also made a drive by fix to call the periodic server shuffler goroutine.	2019-05-06 14:44:55 -05:00
Michael Schurter	8c7b3ff45a	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	e19fa33f9c	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b99a204582	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Lang Martin	c32cce51f0	client fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Lang Martin	94f23016a2	client_test new test fingerprinting can keep multi ips on a device	2019-05-02 18:11:28 -04:00
Mahmood Ali	7a32d3f3aa	client: handle 0.8 server network resources Fixes https://github.com/hashicorp/nomad/issues/5587 When a nomad 0.9 client is handling an alloc generated by a nomad 0.8 server, we should check the alloc.TaskResources for networking details rather than task.Resources. We check alloc.TaskResources for networking for other tasks in the task group [1], so it's a bit odd that we used the task.Resources struct here. TaskRunner also uses `alloc.TaskResources`[2]. The task.Resources struct in 0.8 was sparsly populated, resulting to storing of 0 in port mapping env vars: ``` vagrant@nomad-server-01:~$ nomad version Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES) vagrant@nomad-server-01:~$ nomad server members Name Address Port Status Leader Protocol Build Datacenter Region nomad-server-01.global 10.199.0.11 4648 alive true 2 0.8.7 dc1 global vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.Job.TaskGroups[0].Tasks[0].Resources.Networks' [ { "CIDR": "", "Device": "", "DynamicPorts": [ { "Label": "db", "Value": 0 } ], "IP": "", "MBits": 10, "ReservedPorts": null } ] vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b \| jq '.TaskResources' { "redis": { "CPU": 500, "DiskMB": 0, "IOPS": 0, "MemoryMB": 256, "Networks": [ { "CIDR": "", "Device": "eth1", "DynamicPorts": [ { "Label": "db", "Value": 21722 } ], "IP": "10.199.0.21", "MBits": 10, "ReservedPorts": null } ] } } ``` Also, updated the test values to mimic how Nomad 0.8 structs are represented, and made its result match the non compact values in `TestEnvironment_AsList`. [1] `24e9040b18/client/taskenv/env.go (L624-L639)` [2] https://github.com/hashicorp/nomad/blob/master/client/allocrunner/taskrunner/task_runner.go#L287-L303	2019-05-02 12:08:38 -04:00
Mahmood Ali	446f06721d	aux: helper method that returns token as well as ACL policy This helper returns the token as well as the ACL policy, to be used in a later commit for logging the token info associated with nomad exec invocation.	2019-04-30 10:23:56 -04:00
Lang Martin	371014b781	Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config client fingerprinter doesn't overwrite manual configuration	2019-04-26 12:55:34 -04:00
Danielle	79515496cb	Merge pull request #5515 from hashicorp/dani/f-alloc-signal allocs: Add nomad alloc signal command	2019-04-26 14:21:05 +02:00
Danielle Lancashire	a8880f9643	alloc_signal: Add autcompletion and cmd tests	2019-04-26 12:47:53 +02:00
Mahmood Ali	bf0a09e270	retry grpc unavailable errors even if not shutting down	2019-04-25 18:39:17 -04:00
Mahmood Ali	81841e8528	try checking process status	2019-04-25 18:16:13 -04:00
Mahmood Ali	fc78521f29	add logging about attempts	2019-04-25 18:09:36 -04:00
Mahmood Ali	e6ca8641a8	try sleeping for stop signal to take effect	2019-04-25 17:16:29 -04:00
Mahmood Ali	ff3a095015	add a test that simulates logmon dying during Start() call	2019-04-25 16:41:17 -04:00
Mahmood Ali	bbac73883c	logmon: retry starting logmon if it exits Retry if we detect shutting down during Start() api call is started, locally.	2019-04-25 15:10:16 -04:00
Mahmood Ali	b51f00a7f3	logmon client to handle grpc closing errors	2019-04-25 14:32:24 -04:00
Danielle Lancashire	3409e0be89	allocs: Add nomad alloc signal command This command will be used to send a signal to either a single task within an allocation, or all of the tasks if <task-name> is omitted. If the sent signal terminates the allocation, it will be treated as if the allocation has crashed, rather than as if it was operator-terminated. Signal validation is currently handled by the driver itself and nomad does not attempt to restrict or validate them.	2019-04-25 12:43:32 +02:00
Chris Baker	91c4e1eabb	Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics client/metrics: fixed stale metrics	2019-04-22 15:07:30 -04:00
Mahmood Ali	f515b93b5e	Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable logging: Attempt to recover logmon failures	2019-04-22 14:40:24 -04:00
Michael Schurter	61f17a1043	tweak logging level for failed log line Co-Authored-By: notnoop <mahmood@notnoop.com>	2019-04-22 14:40:17 -04:00

1 2 3 4 5 ...

3741 commits