Commit graph

3741 commits

Author SHA1 Message Date
Michael Schurter a54511b304
Merge pull request #5731 from hashicorp/b-ignore-dc
client: drop unused DC field from servers list
2019-05-22 08:42:15 -07:00
Mahmood Ali 84419f08ce client: synchronize client.invalidAllocs access
invalidAllocs may be accessed and manipulated from different goroutines,
so must be locked.
2019-05-22 09:37:49 -04:00
Danielle Lancashire 27583ed8c1 client: Pass servers contacted ch to allocrunner
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.

It also raises the restoration logging message to `Info` to simplify log
analysis.
2019-05-22 13:47:35 +02:00
Mahmood Ali b06e585713
Merge pull request #5739 from hashicorp/r-rm-logmon-syslog-deadcode
logmon: remove syslog server deadcode
2019-05-21 11:46:48 -04:00
Mahmood Ali eca23bf9c4
Merge pull request #5742 from hashicorp/b-test-fixes-20190520
Grab bag of (primarily race) test fixes
2019-05-21 11:46:36 -04:00
Mahmood Ali e88bb61488
Merge pull request #5740 from hashicorp/b-nomad-exec-term-race
exec: allow drivers to handle stream termination
2019-05-21 11:24:12 -04:00
Mahmood Ali b475ccbe3e client: synchronize access to ar.alloc
`allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's
use `allocRunner.Alloc()` helper function to access it.
2019-05-21 09:55:05 -04:00
Mahmood Ali 2a7b073167 tests: fix fifo lib race
Accidentally accessed outer `err` variable inside a goroutine
2019-05-21 09:49:56 -04:00
Mahmood Ali 296bd41c9e tests: fix data race in client TestDriverManager_Fingerprint_Periodic 2019-05-21 09:49:56 -04:00
Mahmood Ali d9e59eece0 tests: fix client TestFS_Stream data race
Close is invoked in a different goroutine from test
2019-05-21 09:49:56 -04:00
Mahmood Ali 75e0a3f405 exec: allow drivers to handle stream termination
Without this change, alloc_endpoint cancel the context passed to handler
when we detect EOF.  This races driver in setting exit code; and we run
into a case where the exec process terminates cleanly yet we attempt to
mark it as failed with context error.

Here, we rely on the driver to handle errors returned from Stream and
without racing to set an error.
2019-05-21 09:40:25 -04:00
Mahmood Ali 974bcbecc9 logmon: remove syslog server deadcode
Remove unused syslog server related code that got replaced by the docker
logger in Nomad 0.9
2019-05-21 09:36:43 -04:00
Michael Schurter d41abda957 client: drop unused DC field from servers list
See #5730 for details.
2019-05-20 14:19:15 -07:00
Michael Schurter 2fe0768f3b docs: changelog entry for #5669 and fix comment 2019-05-14 10:54:00 -07:00
Michael Schurter af9096c8ba client: register before restoring
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
2019-05-14 10:53:27 -07:00
Michael Schurter e07f73bfe0 client: do not restart dead tasks until server is contacted (try 2)
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
2019-05-14 10:53:27 -07:00
Michael Schurter d7e5ace1ed client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter 3b1f8991a1 client: log when server list changes
Stop logging in the happy path when nothing has changed.
2019-05-13 15:42:55 -07:00
Michael Schurter 48db8135da
Merge pull request #5492 from hashicorp/f-allocated-mem
client: expose allocated memory per task
2019-05-13 13:31:22 -07:00
Lang Martin 1d03a43ce2
Merge pull request #5642 from hashicorp/b-network-fingerprinting-ipv4
network fingerprinting multiple IPs on the configured network device
2019-05-13 11:46:53 -04:00
Michael Schurter 1c4e585fa7 client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Lang Martin f6bc45dd23 client improve a comment in updateNetworks 2019-05-10 11:25:04 -04:00
Mahmood Ali 919827f2df
Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base
nomad exec part 1: plumbing and docker driver
2019-05-09 18:09:27 -04:00
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Preetha 1d02886bb6
Merge pull request #5654 from hashicorp/b-hearbeat-lockfix
Remove unnecessary locking and serverlist syncing in heartbeats
2019-05-08 13:36:39 -05:00
Preetha Appan 3289e7f4a0
fix typo and add one more test scenario 2019-05-08 10:54:22 -05:00
Preetha Appan db6b291a5a
code review feedback 2019-05-07 16:23:32 -05:00
Chris Baker 93ec1293be
stale allocation data leads to incorrect (and even negative) metrics (#5637)
* client: was not using up-to-date client state in determining which alloc count towards allocated resources

* Update client/client.go

Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
2019-05-07 15:54:36 -04:00
Preetha Appan b063fc81a4
Remove unnecessary locking and serverlist syncing in heartbeats
This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.
2019-05-06 14:44:55 -05:00
Michael Schurter 8c7b3ff45a
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:01:30 -05:00
Michael Schurter e19fa33f9c
Remove unnecessary boolean clause
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:00:17 -05:00
Preetha Appan b99a204582
Update deployment health on failed allocations only if health is unset
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
2019-05-02 22:59:56 -05:00
Lang Martin c32cce51f0 client fingerprinting can keep multi ips on a device 2019-05-02 18:11:28 -04:00
Lang Martin 94f23016a2 client_test new test fingerprinting can keep multi ips on a device 2019-05-02 18:11:28 -04:00
Mahmood Ali 7a32d3f3aa client: handle 0.8 server network resources
Fixes https://github.com/hashicorp/nomad/issues/5587

When a nomad 0.9 client is handling an alloc generated by a nomad 0.8
server, we should check the alloc.TaskResources for networking details
rather than task.Resources.

We check alloc.TaskResources for networking for other tasks in the task
group [1], so it's a bit odd that we used the task.Resources struct
here.  TaskRunner also uses `alloc.TaskResources`[2].

The task.Resources struct in 0.8 was sparsly populated, resulting to
storing of 0 in port mapping env vars:

```
vagrant@nomad-server-01:~$ nomad version
Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES)
vagrant@nomad-server-01:~$ nomad server members
Name                    Address      Port  Status  Leader  Protocol  Build  Datacenter  Region
nomad-server-01.global  10.199.0.11  4648  alive   true    2         0.8.7  dc1         global
vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b | jq '.Job.TaskGroups[0].Tasks[0].Resources.Networks'
[
  {
    "CIDR": "",
    "Device": "",
    "DynamicPorts": [
      {
        "Label": "db",
        "Value": 0
      }
    ],
    "IP": "",
    "MBits": 10,
    "ReservedPorts": null
  }
]
vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b | jq '.TaskResources'
{
  "redis": {
    "CPU": 500,
    "DiskMB": 0,
    "IOPS": 0,
    "MemoryMB": 256,
    "Networks": [
      {
        "CIDR": "",
        "Device": "eth1",
        "DynamicPorts": [
          {
            "Label": "db",
            "Value": 21722
          }
        ],
        "IP": "10.199.0.21",
        "MBits": 10,
        "ReservedPorts": null
      }
    ]
  }
}
```

Also, updated the test values to mimic how Nomad 0.8 structs are
represented, and made its result match the non compact values in
`TestEnvironment_AsList`.

[1] 24e9040b18/client/taskenv/env.go (L624-L639)
[2] https://github.com/hashicorp/nomad/blob/master/client/allocrunner/taskrunner/task_runner.go#L287-L303
2019-05-02 12:08:38 -04:00
Mahmood Ali 446f06721d aux: helper method that returns token as well as ACL policy
This helper returns the token as well as the ACL policy, to be used in a later
commit for logging the token info associated with nomad exec invocation.
2019-04-30 10:23:56 -04:00
Lang Martin 371014b781
Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config
client fingerprinter doesn't overwrite manual configuration
2019-04-26 12:55:34 -04:00
Danielle 79515496cb
Merge pull request #5515 from hashicorp/dani/f-alloc-signal
allocs: Add nomad alloc signal command
2019-04-26 14:21:05 +02:00
Danielle Lancashire a8880f9643 alloc_signal: Add autcompletion and cmd tests 2019-04-26 12:47:53 +02:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Mahmood Ali b51f00a7f3 logmon client to handle grpc closing errors 2019-04-25 14:32:24 -04:00
Danielle Lancashire 3409e0be89 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Chris Baker 91c4e1eabb
Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali f515b93b5e
Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00