f3df55ad58
* Adding check-legacy-links-format workflow * Adding test-link-rewrites workflow * Updating docs-content-check-legacy-links-format hash * Migrating links to new format Co-authored-by: Kendall Strautman <kendallstrautman@gmail.com>
202 lines
9.7 KiB
Plaintext
202 lines
9.7 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Transform - Secrets Engines - Tokenization
|
|
description: >-
|
|
More information on the Tokenization transform.
|
|
---
|
|
|
|
# Tokenization Transform
|
|
|
|
Not to be confused with Vault tokens, Tokenization exchanges a
|
|
sensitive value for an unrelated value called a _token_. The original sensitive
|
|
value cannot be recovered from a token alone, they are irreversible. Instead,
|
|
unlike format preserving encryption, tokenization is stateful. To decode the
|
|
original value, the token must be submitted to Vault where it is
|
|
retrieved from a cryptographic mapping in storage.
|
|
|
|
## Operation
|
|
|
|
On encode, Vault generates a random, signed token and stores a mapping of a
|
|
version of that token to encrypted versions of the plaintext and metadata, as
|
|
well as a fingerprint of the original plaintext which facilitates the `tokenized`
|
|
endpoint that lets one query whether a plaintext exists in the system.
|
|
|
|
Depending on the mapping mode, the plaintext may be decoded only with possession
|
|
of the distributed token, or may be recoverable in the export operation. See
|
|
[Security Considerations](#security-considerations) for more.
|
|
|
|
### Convergence
|
|
|
|
By default, tokenization produces a unique token for every encode operation.
|
|
This makes the resulting token fully independent of its plaintext and expiration.
|
|
Sometimes, though, it may be beneficial if the tokenization of a plaintext/expiration
|
|
pair tokenizes consistently to the same value. For example if one wants to
|
|
do a statistical analysis of the tokens as they relate to some other field
|
|
in a database (without decoding the token), or if one needed to tokenize
|
|
in two different systems but be able relate the results. In this case,
|
|
one can create a tokenization transformation that is *convergent*.
|
|
|
|
When enabled at transformation creation time, Vault alters the calculation so that
|
|
encoding a plaintext and expiration tokenizes to the same value every time, and
|
|
storage keeps only a single entry of that token. Like the exportable mapping
|
|
mode, convergence should only be enabled if needed. Convergent tokenization
|
|
has a small performance penalty in external stores and a larger one in the
|
|
built in store due to the need to avoid duplicate entries and to update
|
|
metadata when convergently encoding. It is recommended that if one has some
|
|
use cases that require convergence and some that do not, one should create two
|
|
different tokenization transforms with convergence enabled on only one.
|
|
|
|
### Token Lookup
|
|
|
|
Some use cases may want to lookup the value of a token given its plaintext. Ordinarily
|
|
this is contrary to the nature of tokenization where we want to prevent the ability
|
|
of an attacker to determine that a token corresponds to a plaintext value (a known
|
|
plaintext attack). But for use cases that require it, the
|
|
[token lookup](/vault/api-docs/secret/transform#token-lookup)
|
|
operation is supported, but only in some configurations of the tokenization
|
|
transformation. Token lookup is supported when convergence is enabled, or
|
|
if the mapping mode is exportable *and* the storage backend is external.
|
|
|
|
## Performance Considerations
|
|
|
|
### Builtin (Internal) Store
|
|
|
|
As tokenization is stateful, the encode operation necessarily writes values to
|
|
storage. By default, that storage is the Vault backend store itself. This
|
|
differs from some secret engines in that the encode and decode operations require
|
|
an access of storage per operation. Other engines use storage for configuration
|
|
but can process operations largely without accessing any storage.
|
|
|
|
Since these operations involve writes to storage, and therefore must be performed
|
|
on primary nodes, the scalability of the encode operation is limited by the
|
|
primary's storage performance.
|
|
|
|
Additionally, using internal storage, since writes must be performed on primary
|
|
nodes, the scalability of the encode operation will be limited by the performance
|
|
of the primary and its storage subsystem. All other operations can be performed
|
|
on secondaries.
|
|
|
|
Finally, due to replication, writes to the primary may take some time to reach
|
|
secondaries, so other read operations like decode or metadata may not succeed on
|
|
the secondaries until this happens. In other words, tokenization is eventually
|
|
consistent.
|
|
|
|
### External Storage
|
|
|
|
All nodes (except DRs) can participate in all operations using external storage,
|
|
but one must take care to monitor and scale the external storage for the level of
|
|
traffic experienced. The storage schema is simple however and well known approaches
|
|
should be effective.
|
|
|
|
## Security Considerations
|
|
|
|
The goal of Tokenization is to let end users' devices store the token rather than
|
|
their sensitive values (such as credit card numbers) and still participate in
|
|
transactions where the token is a stand-in for the sensitive value. For this reason
|
|
the token Vault generates is completely unrelated (e.g. irreversible) to the
|
|
sensitive value.
|
|
|
|
Furthermore, the Tokenization transform is designed to resist a number of attacks
|
|
on the values produced during encode. In particular it is designed so that
|
|
attackers cannot recover plaintext even if they steal the tokenization values
|
|
from Vault itself. In the default mapping mode,
|
|
even stealing the underlying transform key does not allow them to recover
|
|
the plaintext without also possessing the encoded token. An attacker must have
|
|
gotten access to all values in the construct.
|
|
|
|
In the `exportable` mapping mode however, the plaintext values are encrypted
|
|
in a way that can be decrypted within Vault. If the attacker possesses the
|
|
transform key and the tokenization mapping values, the plaintext can be
|
|
recovered. This mode is available for the case where operators prioritize the
|
|
ability to export all of the plaintext values in an emergency, via the
|
|
`export-decoded` operation.
|
|
|
|
### Metadata
|
|
|
|
Since tokenization isn't format preserving and requires storage, one can associate
|
|
arbitrary metadata with a token. Metadata is considered less sensitive than the
|
|
original plaintext value. As it has it's own retrieval endpoint, operators can
|
|
configure policies that may allow access to the metadata of a token but not
|
|
its decoded value to enable workflows that operate just on the metadata.
|
|
|
|
## TTLs and Tidying
|
|
|
|
By default, tokens are long lived, and the storage for them will be maintained
|
|
indefinitely. Where there is a concept of time-to-live, it is strongly encouraged
|
|
that the tokens be generated with a TTL. For example, as credit cards
|
|
have an expiration date, it is recommended that tokenizing a credit card
|
|
primary account number (PAN) be done with a TTL that corresponds to the time
|
|
after which the PAN is invalid.
|
|
|
|
This allows such values to be _tidied_ and removed from storage once expired.
|
|
Tokens themselves encode the expiration time, so decode and other operations
|
|
can immediately reject the operation when presented with an expired token.
|
|
|
|
## Storage
|
|
|
|
### External SQL Stores
|
|
|
|
Currently the PostgreSQL, MySQL, and MSSQL relational databases are supported as
|
|
external storage backends for tokenization.
|
|
The [Schema Endpoint](/vault/api-docs/secret/transform#create-update-store-schema)
|
|
may be used to initialize and upgrade the necessary database tables. Vault uses
|
|
a schema versioning table to determine if it needs to create or modify the
|
|
tables when using that endpoint. If you make changes to those tables yourself,
|
|
the automatic schema management may become out of sync and may fail in the future.
|
|
|
|
External stores may often be preferred due to their ability to achieve a much
|
|
higher scale of performance, especially when used with batch operations.
|
|
|
|
### Snapshot/Restore
|
|
|
|
Snapshot allows one to iteratively retrieve the tokenization state, for
|
|
backup or migration purposes. The resulting data can be fed to the restore
|
|
endpoint of the same or a different tokenization store. Note that the state
|
|
is only useable by the tokenization transform that created it, as state is
|
|
encrypted via keys in that configured transform.
|
|
|
|
### Export Decoded
|
|
|
|
For stores configured with the `exportable` mapping mode, the export decoded
|
|
endpoint allows operators to retrieve the _decoded_ contents of tokenization
|
|
state, which includes tokens and their decoded, sensitive values. The
|
|
`exportable` mode is only recommended if this use case is required, as the default
|
|
cannot be decoded by attackers even if they gain access to Vault's storage and
|
|
keys.
|
|
|
|
### Migration
|
|
|
|
Tokenization stores are configured separately from the tokenization transform,
|
|
and the transform can point to multiple stores. The primary use case for this
|
|
one-to-many relationship is to facilitate migration between two tokenization
|
|
stores.
|
|
|
|
When multiple stores are configured, Vault writes new tokenization state to all
|
|
configured stores, and reads from each store in the order they were configured.
|
|
Thus, one can use multiple configured stores along with the snapshot/restore
|
|
functionality to perform a zero-downtime migration to a new store:
|
|
|
|
1. Configure the new tokenization store in the API.
|
|
1. Modify the existing tokenization transform to use both the existing and new
|
|
store.
|
|
1. Snapshot the old store.
|
|
1. Restore the snapshot to the new store.
|
|
1. Perform any desired validations.
|
|
1. Modify the tokenization transform to use only the new store.
|
|
|
|
## Key Management
|
|
|
|
Tokenization supports key rotation. Keys are tied to transforms, so key
|
|
names are the same as the name of the corresponding tokenization transform.
|
|
Keys can be rotated to a new version, with backward compatibility for
|
|
decoding. Encoding is always performed with the newest key version. Keys versions
|
|
can be tidied as well. Keys may also be rotated automatically on a user-defined
|
|
time interval, specified by the `auto_rotate_field` of the key config. For more
|
|
information, see the [transform api docs](/vault/api-docs/secret/transform).
|
|
|
|
## Tutorial
|
|
|
|
Refer to [Tokenize Data with Transform Secrets
|
|
Engine](/vault/tutorials/adp/tokenization) for a
|
|
step-by-step tutorial.
|