Content for the operating a job guide
This commit is contained in:
parent
0b4c65b488
commit
5d9c2f35a3
|
@ -8,38 +8,195 @@ description: |-
|
||||||
|
|
||||||
# Reviewing Job and Allocation Status
|
# Reviewing Job and Allocation Status
|
||||||
|
|
||||||
|
The Web UI can be a powerful companion when monitoring and debugging jobs running in Nomad. The Web
|
||||||
|
UI will list all jobs, link jobs to allocations, allocations to client nodes, client nodes to driver
|
||||||
|
health, and much more. This creates a fluid exploratory experience.
|
||||||
|
|
||||||
## Reviewing All Jobs
|
## Reviewing All Jobs
|
||||||
|
|
||||||
|
The first page you will see in the Web UI is the Jobs List page. Here you will find every job for a
|
||||||
|
namespace in a region. The table of jobs is searchable, sortable, and filterable. Each job row in
|
||||||
|
the table shows basic information, such as job name, status, type, and priority, as well as richer
|
||||||
|
information such as a visual representation of all allocation statuses.
|
||||||
|
|
||||||
|
This view will also live-update as jobs get submitted, get purged, and change status.
|
||||||
|
|
||||||
|
~> Screenshot (Jobs Overview)
|
||||||
|
|
||||||
## Filtering Jobs
|
## Filtering Jobs
|
||||||
|
|
||||||
|
If your Nomad cluster has many jobs, it can be useful to filter the list of all jobs down to only
|
||||||
|
those matching certain facets. The Web UI has four facets you can filter by:
|
||||||
|
|
||||||
|
1. **Type:** The type of job, including Batch, Parameterized, Periodic, Service, and System.
|
||||||
|
2. **Status:** The status of the job, including Pending, Running, and Dead.
|
||||||
|
3. **Datacenter:** The datacenter the job is running in, including a dynamically generated list
|
||||||
|
based on the jobs in the namespace.
|
||||||
|
4. **Prefix:** The possible common naming prefix for a job, including a dynamically generated list
|
||||||
|
based on job names up to the first occurrence of `-`, `.`, and `_`. Only prefixes that match
|
||||||
|
multiple jobs are included.
|
||||||
|
|
||||||
|
~> Screenshot (Zoom in on job filters, with one open)
|
||||||
|
|
||||||
## Monitoring an Allocation
|
## Monitoring an Allocation
|
||||||
|
|
||||||
|
In Nomad, allocations are the schedulable units of work. This is where runtime metrics begin to
|
||||||
|
surface. An allocation is composed of one or more tasks, and the utilization metrics for tasks are
|
||||||
|
aggregated so they can be observed at the allocation level.
|
||||||
|
|
||||||
### Resource Utilization
|
### Resource Utilization
|
||||||
|
|
||||||
|
Nomad has APIs for reading point-in-time resource utilization metrics for tasks and allocations. The
|
||||||
|
Web UI uses these metrics to create time-series graphs for the current session.
|
||||||
|
|
||||||
|
When viewing an allocation, resource utilization will automatically start logging.
|
||||||
|
|
||||||
|
~> Screenshot (resource tuilization)
|
||||||
|
|
||||||
### Task Events
|
### Task Events
|
||||||
|
|
||||||
|
When Nomad places, prepares, and starts a task, a series of task events are emitted to help debug
|
||||||
|
issues in the event that the task fails to start.
|
||||||
|
|
||||||
|
Task events are listed on the Task Detail page and live-update as Nomad handles managing the task.
|
||||||
|
|
||||||
|
~> Screenshot (task events)
|
||||||
|
|
||||||
### Rescheduled Allocations
|
### Rescheduled Allocations
|
||||||
|
|
||||||
|
Allocations will be placed on any client node that satisfies the constraints of the job definition.
|
||||||
|
However, just because a node sounds like a good fit doesn't mean there isn't the possibility of
|
||||||
|
unforeseen hostility, (e.g., corrupted `/bin`, no access to a container registry).
|
||||||
|
|
||||||
|
Allocations can be configured [in the job definition to reschedule](/docs/job-specification/reschedule.html)
|
||||||
|
to a different client node if the allocation ends in a failed status. This will happen after the
|
||||||
|
task has exhausted its [local restart attempts](/docs/job-specification/restart.html).
|
||||||
|
|
||||||
|
The end result of this automatic procedure is a failed allocation and that failed allocation's
|
||||||
|
rescheduled successor. Since Nomad handles all of this automatically, the Web UI makes sure to
|
||||||
|
explain the state of allocations through iconography and linking previous and next allocations in a
|
||||||
|
reschedule chain.
|
||||||
|
|
||||||
|
~> Screenshot (reschedule icon)
|
||||||
|
|
||||||
|
~> Screenshot (reschedule section on alloc detail)
|
||||||
|
|
||||||
### Unhealthy Driver
|
### Unhealthy Driver
|
||||||
|
|
||||||
|
Given the nature of long-lived processes, it's possible for the state of the client node an
|
||||||
|
allocation is scheduled on to change during the lifespan of the allocation. Nomad attempts to
|
||||||
|
monitor pertinent conditions including driver health.
|
||||||
|
|
||||||
|
The Web UI denotes when a driver an allocation depends on is unhealthy on the client node the
|
||||||
|
allocation is running on.
|
||||||
|
|
||||||
|
~> Screenshot (unhealthy driver)
|
||||||
|
|
||||||
### Preempted Allocations
|
### Preempted Allocations
|
||||||
|
|
||||||
|
Much like how Nomad will automatically reschedule allocations, Nomad will automatically preempt
|
||||||
|
allocations when necessary. When monitoring allocations in Nomad, it's useful to know what
|
||||||
|
allocations were preempted and what job caused the preemption.
|
||||||
|
|
||||||
|
The Web UI makes sure to tell this full story by showing which allocation caused an allocation to be
|
||||||
|
preempted as well as the opposite: what allocations an allocation preempted. This makes it possible
|
||||||
|
to traverse down from a job to a preempted allocation, to the allocation that caused the preemption,
|
||||||
|
to the job that the preempting allocation is for.
|
||||||
|
|
||||||
|
~> Screenshot (preempter)
|
||||||
|
|
||||||
|
~> Screenshot (preemptions)
|
||||||
|
|
||||||
## Reviewing Logs for a Task
|
## Reviewing Logs for a Task
|
||||||
|
|
||||||
~> Not all browsers support streaming http requests. In the event that streaming is not supported, logs will still be followed using interval polling.
|
A task will typically emit log information to `stdout` and `stderr`. Nomad captures these logs and
|
||||||
|
exposes them through an API. The Web UI uses these APIs to offer `head`, `tail`, and streaming logs
|
||||||
|
from the browser.
|
||||||
|
|
||||||
|
The Web UI will first attempt to directly connecto to the client node the task is running on.
|
||||||
|
Typically, client nodes are not accessible from the public internet. If this is the case, the Web UI
|
||||||
|
will fall back and proxy to the client node from the server node with no loss of functionality.
|
||||||
|
|
||||||
|
~> Screenshot (task logs)
|
||||||
|
|
||||||
|
~> Not all browsers support streaming http requests. In the event that streaming is not supported,
|
||||||
|
logs will still be followed using interval polling.
|
||||||
|
|
||||||
## Restarting or Stopping an Allocation or Task
|
## Restarting or Stopping an Allocation or Task
|
||||||
|
|
||||||
## Forcing a Periodic Launch
|
Ideally software always runs smoothly and as intended, but this isn't something we can count on.
|
||||||
|
Sometimes tasks will have a memory leak, sometimes a node will have noisy neighbors, and sometimes
|
||||||
|
we have no clue what's going so it's time to try turning it off and on again.
|
||||||
|
|
||||||
|
For these times, Nomad allows for restarting and stopping individual allocations and tasks. When a
|
||||||
|
task is restarted, Nomad will perform a local restart of the task. When an allocation is stopped,
|
||||||
|
Nomad will mark the allocation as complete and perform a reschedule onto a different client node.
|
||||||
|
|
||||||
|
Both of these features are also available in the Web UI.
|
||||||
|
|
||||||
|
~> Screenshot (allocation stop and restart)
|
||||||
|
|
||||||
|
## Forcing a Periodic Instance
|
||||||
|
|
||||||
|
Periodic jobs are configured like a cron job. Sometimes, we want to micromanage the job instead of
|
||||||
|
waiting for the period duration to elapse. Nomad calls this a
|
||||||
|
[periodic force](/docs/commands/job/periodic-force.html) and it can be done from the Web UI on the
|
||||||
|
Job Overview page for a periodic job.
|
||||||
|
|
||||||
|
~> Screenshot (periodic force)
|
||||||
|
|
||||||
## Submitting a New Version of a Job
|
## Submitting a New Version of a Job
|
||||||
|
|
||||||
~> Since each job within a namespace must have a unique name, it is possible to submit a new version of a job from the Run Job screen. Always review the plan output!
|
From the Job Definition page, a job can be edited. After clicking the Edit button in the top-right
|
||||||
|
corner of the code window, the job definition JSON becomes editable. The edits can then be planned
|
||||||
|
and scheduled.
|
||||||
|
|
||||||
|
~> Screenshot (definition edit)
|
||||||
|
|
||||||
|
~> Since each job within a namespace must have a unique name, it is possible to submit a new version
|
||||||
|
of a job from the Run Job screen. Always review the plan output!
|
||||||
|
|
||||||
## Monitoring a Deployment
|
## Monitoring a Deployment
|
||||||
|
|
||||||
|
When a system or service job includes the [`update` stanza](/docs/job-specification/update.html), a
|
||||||
|
deployment is created upon job submission. Job deployments can be monitored in realtime from the Web
|
||||||
|
UI.
|
||||||
|
|
||||||
|
The Web UI will show as new allocations become placed, tallying towards the expected total, and
|
||||||
|
tally allocations as they becme healthy or unhealthy.
|
||||||
|
|
||||||
|
Optionally, a job may use canary deployments to allow for additional health checks or manual testing
|
||||||
|
before a full roll out. If a job uses canaries and is not configured to automatically promote the
|
||||||
|
canary, the canary promotion operation can be done from the Job Overview page in the Web UI.
|
||||||
|
|
||||||
|
~> Screenshot (job deployment with canary link)
|
||||||
|
|
||||||
## Stopping a Job
|
## Stopping a Job
|
||||||
|
|
||||||
|
Jobs can be stopped from the Job Overview page. Stopping a job will gracefully stop all allocations,
|
||||||
|
marking them as complete, and freeing up resources in the cluster.
|
||||||
|
|
||||||
|
~> Screenshot (job stop)
|
||||||
|
|
||||||
## Access Control
|
## Access Control
|
||||||
|
|
||||||
|
Depending on the size of your team and the details of your Nomad deployment, you may wish to control
|
||||||
|
which features different internal users have access to. This includes differentiation between
|
||||||
|
submitting jobs, restarting allocations, and viewing potentially sensitive logs.
|
||||||
|
|
||||||
|
Nomad has an access control list system for doing just that.
|
||||||
|
|
||||||
|
By default, all features—read and write—are available to all users of the Web UI. Check out the
|
||||||
|
[Securing the Web UI with ACLs](/guides/web-ui/securing.html) guide to learn how to prevent
|
||||||
|
anonymous users from having write permissions as well as how to continue to use Web UI write
|
||||||
|
features as a privileged user.
|
||||||
|
|
||||||
## Best Practices
|
## Best Practices
|
||||||
|
|
||||||
|
Although the Web UI lets users submit jobs in an ad-hoc manner, Nomad was deliberately designed to
|
||||||
|
declare jobs using a configuration language. It is recommended to treat your job definitions, like
|
||||||
|
the rest of your infrastructure, as code.
|
||||||
|
|
||||||
|
By checking in your job definition files as source control, you will always have a log of changes to
|
||||||
|
assist in debugging issues, rolling back versions, and collaborating on changes using development
|
||||||
|
best practices like code review.
|
||||||
|
|
Loading…
Reference in New Issue