Content for the operating a job guide

This commit is contained in:
Michael Lange 2019-08-16 03:10:51 -07:00
parent 0b4c65b488
commit 5d9c2f35a3
1 changed files with 160 additions and 3 deletions

View File

@ -8,38 +8,195 @@ description: |-
# Reviewing Job and Allocation Status # Reviewing Job and Allocation Status
The Web UI can be a powerful companion when monitoring and debugging jobs running in Nomad. The Web
UI will list all jobs, link jobs to allocations, allocations to client nodes, client nodes to driver
health, and much more. This creates a fluid exploratory experience.
## Reviewing All Jobs ## Reviewing All Jobs
The first page you will see in the Web UI is the Jobs List page. Here you will find every job for a
namespace in a region. The table of jobs is searchable, sortable, and filterable. Each job row in
the table shows basic information, such as job name, status, type, and priority, as well as richer
information such as a visual representation of all allocation statuses.
This view will also live-update as jobs get submitted, get purged, and change status.
~> Screenshot (Jobs Overview)
## Filtering Jobs ## Filtering Jobs
If your Nomad cluster has many jobs, it can be useful to filter the list of all jobs down to only
those matching certain facets. The Web UI has four facets you can filter by:
1. **Type:** The type of job, including Batch, Parameterized, Periodic, Service, and System.
2. **Status:** The status of the job, including Pending, Running, and Dead.
3. **Datacenter:** The datacenter the job is running in, including a dynamically generated list
based on the jobs in the namespace.
4. **Prefix:** The possible common naming prefix for a job, including a dynamically generated list
based on job names up to the first occurrence of `-`, `.`, and `_`. Only prefixes that match
multiple jobs are included.
~> Screenshot (Zoom in on job filters, with one open)
## Monitoring an Allocation ## Monitoring an Allocation
In Nomad, allocations are the schedulable units of work. This is where runtime metrics begin to
surface. An allocation is composed of one or more tasks, and the utilization metrics for tasks are
aggregated so they can be observed at the allocation level.
### Resource Utilization ### Resource Utilization
Nomad has APIs for reading point-in-time resource utilization metrics for tasks and allocations. The
Web UI uses these metrics to create time-series graphs for the current session.
When viewing an allocation, resource utilization will automatically start logging.
~> Screenshot (resource tuilization)
### Task Events ### Task Events
When Nomad places, prepares, and starts a task, a series of task events are emitted to help debug
issues in the event that the task fails to start.
Task events are listed on the Task Detail page and live-update as Nomad handles managing the task.
~> Screenshot (task events)
### Rescheduled Allocations ### Rescheduled Allocations
Allocations will be placed on any client node that satisfies the constraints of the job definition.
However, just because a node sounds like a good fit doesn't mean there isn't the possibility of
unforeseen hostility, (e.g., corrupted `/bin`, no access to a container registry).
Allocations can be configured [in the job definition to reschedule](/docs/job-specification/reschedule.html)
to a different client node if the allocation ends in a failed status. This will happen after the
task has exhausted its [local restart attempts](/docs/job-specification/restart.html).
The end result of this automatic procedure is a failed allocation and that failed allocation's
rescheduled successor. Since Nomad handles all of this automatically, the Web UI makes sure to
explain the state of allocations through iconography and linking previous and next allocations in a
reschedule chain.
~> Screenshot (reschedule icon)
~> Screenshot (reschedule section on alloc detail)
### Unhealthy Driver ### Unhealthy Driver
Given the nature of long-lived processes, it's possible for the state of the client node an
allocation is scheduled on to change during the lifespan of the allocation. Nomad attempts to
monitor pertinent conditions including driver health.
The Web UI denotes when a driver an allocation depends on is unhealthy on the client node the
allocation is running on.
~> Screenshot (unhealthy driver)
### Preempted Allocations ### Preempted Allocations
Much like how Nomad will automatically reschedule allocations, Nomad will automatically preempt
allocations when necessary. When monitoring allocations in Nomad, it's useful to know what
allocations were preempted and what job caused the preemption.
The Web UI makes sure to tell this full story by showing which allocation caused an allocation to be
preempted as well as the opposite: what allocations an allocation preempted. This makes it possible
to traverse down from a job to a preempted allocation, to the allocation that caused the preemption,
to the job that the preempting allocation is for.
~> Screenshot (preempter)
~> Screenshot (preemptions)
## Reviewing Logs for a Task ## Reviewing Logs for a Task
~> Not all browsers support streaming http requests. In the event that streaming is not supported, logs will still be followed using interval polling. A task will typically emit log information to `stdout` and `stderr`. Nomad captures these logs and
exposes them through an API. The Web UI uses these APIs to offer `head`, `tail`, and streaming logs
from the browser.
The Web UI will first attempt to directly connecto to the client node the task is running on.
Typically, client nodes are not accessible from the public internet. If this is the case, the Web UI
will fall back and proxy to the client node from the server node with no loss of functionality.
~> Screenshot (task logs)
~> Not all browsers support streaming http requests. In the event that streaming is not supported,
logs will still be followed using interval polling.
## Restarting or Stopping an Allocation or Task ## Restarting or Stopping an Allocation or Task
## Forcing a Periodic Launch Ideally software always runs smoothly and as intended, but this isn't something we can count on.
Sometimes tasks will have a memory leak, sometimes a node will have noisy neighbors, and sometimes
we have no clue what's going so it's time to try turning it off and on again.
For these times, Nomad allows for restarting and stopping individual allocations and tasks. When a
task is restarted, Nomad will perform a local restart of the task. When an allocation is stopped,
Nomad will mark the allocation as complete and perform a reschedule onto a different client node.
Both of these features are also available in the Web UI.
~> Screenshot (allocation stop and restart)
## Forcing a Periodic Instance
Periodic jobs are configured like a cron job. Sometimes, we want to micromanage the job instead of
waiting for the period duration to elapse. Nomad calls this a
[periodic force](/docs/commands/job/periodic-force.html) and it can be done from the Web UI on the
Job Overview page for a periodic job.
~> Screenshot (periodic force)
## Submitting a New Version of a Job ## Submitting a New Version of a Job
~> Since each job within a namespace must have a unique name, it is possible to submit a new version of a job from the Run Job screen. Always review the plan output! From the Job Definition page, a job can be edited. After clicking the Edit button in the top-right
corner of the code window, the job definition JSON becomes editable. The edits can then be planned
and scheduled.
~> Screenshot (definition edit)
~> Since each job within a namespace must have a unique name, it is possible to submit a new version
of a job from the Run Job screen. Always review the plan output!
## Monitoring a Deployment ## Monitoring a Deployment
When a system or service job includes the [`update` stanza](/docs/job-specification/update.html), a
deployment is created upon job submission. Job deployments can be monitored in realtime from the Web
UI.
The Web UI will show as new allocations become placed, tallying towards the expected total, and
tally allocations as they becme healthy or unhealthy.
Optionally, a job may use canary deployments to allow for additional health checks or manual testing
before a full roll out. If a job uses canaries and is not configured to automatically promote the
canary, the canary promotion operation can be done from the Job Overview page in the Web UI.
~> Screenshot (job deployment with canary link)
## Stopping a Job ## Stopping a Job
Jobs can be stopped from the Job Overview page. Stopping a job will gracefully stop all allocations,
marking them as complete, and freeing up resources in the cluster.
~> Screenshot (job stop)
## Access Control ## Access Control
Depending on the size of your team and the details of your Nomad deployment, you may wish to control
which features different internal users have access to. This includes differentiation between
submitting jobs, restarting allocations, and viewing potentially sensitive logs.
Nomad has an access control list system for doing just that.
By default, all features—read and write—are available to all users of the Web UI. Check out the
[Securing the Web UI with ACLs](/guides/web-ui/securing.html) guide to learn how to prevent
anonymous users from having write permissions as well as how to continue to use Web UI write
features as a privileged user.
## Best Practices ## Best Practices
Although the Web UI lets users submit jobs in an ad-hoc manner, Nomad was deliberately designed to
declare jobs using a configuration language. It is recommended to treat your job definitions, like
the rest of your infrastructure, as code.
By checking in your job definition files as source control, you will always have a log of changes to
assist in debugging issues, rolling back versions, and collaborating on changes using development
best practices like code review.