inspecting state

2016-06-29 13:32:31 -07:00 · 2016-06-29 13:32:31 -07:00 · 98d19aadb6
parent bf728e8e51
commit 98d19aadb6
2 changed files with 162 additions and 2 deletions
--- a/website/source/docs/jobops/inspecting.html.md
+++ b/website/source/docs/jobops/inspecting.html.md
@ -6,4 +6,164 @@ description: |-
  Learn how to inspect a Nomad Job.
 ---

-# Operating a Job
+# Inspecting state
+
+Once a job is submitted, the next step is to ensure it is running. This section
+will assume we have submitted a job with the name "example".
+
+To get a high-level over view of our job we can use the [`nomad status`
+command](/docs/commands/status.html). This command will display the list of
+running allocations, as well as any recent placement failures. An example below
+shows that the job has some allocations placed but did not have enough resources
+to place all of the desired allocations. We run with `-evals` to see that there
+is an outstanding evaluation for the job:
+
+```
+$ nomad status example
+ID          = example
+Name        = example
+Type        = service
+Priority    = 50
+Datacenters = dc1
+Status      = running
+Periodic    = false
+
+Evaluations
+ID        Priority  Triggered By  Status    Placement Failures
+5744eb15  50        job-register  blocked   N/A - In Progress
+8e38e6cf  50        job-register  complete  true
+
+Placement Failure
+Task Group "cache":
+  * Resources exhausted on 1 nodes
+  * Dimension "cpu exhausted" exhausted on 1 nodes
+
+Allocations
+ID        Eval ID   Node ID   Task Group  Desired  Status
+12681940  8e38e6cf  4beef22f  cache       run      running
+395c5882  8e38e6cf  4beef22f  cache       run      running
+4d7c6f84  8e38e6cf  4beef22f  cache       run      running
+843b07b8  8e38e6cf  4beef22f  cache       run      running
+a8bc6d3e  8e38e6cf  4beef22f  cache       run      running
+b0beb907  8e38e6cf  4beef22f  cache       run      running
+da21c1fd  8e38e6cf  4beef22f  cache       run      running
+```
+
+In the above example we see that the job has a "blocked" evaluation that is in
+progress. When Nomad can not place all the desired allocations, it creates a
+blocked evaluation that waits for more resources to become available. We can use
+the [`eval-status` command](/docs/commands/eval-status.html) to examine any
+evaluation in more detail. For the most part this should never be necessary but
+can be useful to see why everything was not placed. For example if we run it on
+the evaluation that had placement failures we see:
+
+```
+nomad eval-status 8e38e6cf
+ID                 = 8e38e6cf
+Status             = complete
+Status Description = complete
+Type               = service
+TriggeredBy        = job-register
+Job ID             = example
+Priority           = 50
+Placement Failures = true
+
+Failed Placements
+Task Group "cache" (failed to place 3 allocations):
+  * Resources exhausted on 1 nodes
+  * Dimension "cpu exhausted" exhausted on 1 nodes
+
+Evaluation "5744eb15" waiting for additional capacity to place remainder
+```
+
+More interesting though is the [`alloc-status`
+command](/docs/commands/alloc-status.html). This command gives us the most
+recent events that occured for a task, its resource usage, port allocations and
+more:
+
+```
+nomad alloc-status 12
+ID            = 12681940
+Eval ID       = 8e38e6cf
+Name          = example.cache[1]
+Node ID       = 4beef22f
+Job ID        = example
+Client Status = running
+
+Task "redis" is "running"
+Task Resources
+CPU    Memory           Disk     IOPS  Addresses
+2/500  6.3 MiB/256 MiB  300 MiB  0     db: 127.0.0.1:57161
+
+Recent Events:
+Time                   Type        Description
+06/28/16 15:46:42 UTC  Started     Task started by client
+06/28/16 15:46:10 UTC  Restarting  Task restarting in 30.863215327s
+06/28/16 15:46:10 UTC  Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
+06/28/16 15:37:46 UTC  Started     Task started by client
+06/28/16 15:37:44 UTC  Received    Task received by client
+```
+
+In the above example we forced killed the docker container so that we could see
+in the event history that Nomad detected the failure and restarted the
+allocation.
+
+The `alloc-status` command is a good starting to point for debugging an
+application that did not start. In this example task we are trying to start a
+redis image using `redis:2.8` but the user has accidentally put a comma instead
+of a period, typing `redis:2,8`.
+
+
+When the job is run, it produces an allocation that fails. The `alloc-status`
+command gives us the reason why:
+
+```
+nomad alloc-status c0f1
+ID            = c0f1b34c
+Eval ID       = 4df393cb
+Name          = example.cache[0]
+Node ID       = 13063955
+Job ID        = example
+Client Status = failed
+
+Task "redis" is "dead"
+Task Resources
+CPU  Memory   Disk     IOPS  Addresses
+500  256 MiB  300 MiB  0     db: 127.0.0.1:23285
+
+Recent Events:
+Time                   Type            Description
+06/28/16 15:50:22 UTC  Not Restarting  Error was unrecoverable
+06/28/16 15:50:22 UTC  Driver Failure  failed to create image: Failed to pull `redis:2,8`: API error (500): invalid tag format
+06/28/16 15:50:22 UTC  Received        Task received by client
+```
+
+Not all failures are this easily debuggable. If the `alloc-status` command shows
+many restarts occuring as in the example below, it is a good hint that the error
+is occuring at the application level during start up. These failres can be
+debugged by looking at logs which is covered [here](/docs/jobops/logs.html).
+
+```
+$ nomad alloc-status e6b6
+ID            = e6b625a1
+Eval ID       = 68b742e8
+Name          = example.cache[0]
+Node ID       = 83ef596c
+Job ID        = example
+Client Status = pending
+
+Task "redis" is "pending"
+Task Resources
+CPU  Memory   Disk     IOPS  Addresses
+500  256 MiB  300 MiB  0     db: 127.0.0.1:30153
+
+Recent Events:
+Time                   Type        Description
+06/28/16 15:56:16 UTC  Restarting  Task restarting in 5.178426031s
+06/28/16 15:56:16 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
+06/28/16 15:56:16 UTC  Started     Task started by client
+06/28/16 15:56:00 UTC  Restarting  Task restarting in 5.00123931s
+06/28/16 15:56:00 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
+06/28/16 15:55:59 UTC  Started     Task started by client
+06/28/16 15:55:48 UTC  Received    Task received by client
+```
--- a/website/source/docs/jobops/taskconfig.html.md
+++ b/website/source/docs/jobops/taskconfig.html.md
@ -126,7 +126,7 @@ defaults
    timeout server  10000

 listen http-in
-    bind {{service "my-web-lb"}} {{range service "my-web"}}
+    bind {{env "NOMAD_ADDR_inbound"}} {{range service "my-web"}}
    server {{.Node}} {{.Address}}:{{.Port}}{{end}}
 ```