open-vault/website/content/docs/secrets/transform/tokenization.mdx

202 lines
9.7 KiB
Plaintext

---
layout: docs
page_title: Transform - Secrets Engines - Tokenization
description: >-
More information on the Tokenization transform.
---
# Tokenization transform
Not to be confused with Vault tokens, Tokenization exchanges a
sensitive value for an unrelated value called a _token_. The original sensitive
value cannot be recovered from a token alone, they are irreversible. Instead,
unlike format preserving encryption, tokenization is stateful. To decode the
original value, the token must be submitted to Vault where it is
retrieved from a cryptographic mapping in storage.
## Operation
On encode, Vault generates a random, signed token and stores a mapping of a
version of that token to encrypted versions of the plaintext and metadata, as
well as a fingerprint of the original plaintext which facilitates the `tokenized`
endpoint that lets one query whether a plaintext exists in the system.
Depending on the mapping mode, the plaintext may be decoded only with possession
of the distributed token, or may be recoverable in the export operation. See
[Security Considerations](#security-considerations) for more.
### Convergence
By default, tokenization produces a unique token for every encode operation.
This makes the resulting token fully independent of its plaintext and expiration.
Sometimes, though, it may be beneficial if the tokenization of a plaintext/expiration
pair tokenizes consistently to the same value. For example if one wants to
do a statistical analysis of the tokens as they relate to some other field
in a database (without decoding the token), or if one needed to tokenize
in two different systems but be able relate the results. In this case,
one can create a tokenization transformation that is *convergent*.
When enabled at transformation creation time, Vault alters the calculation so that
encoding a plaintext and expiration tokenizes to the same value every time, and
storage keeps only a single entry of that token. Like the exportable mapping
mode, convergence should only be enabled if needed. Convergent tokenization
has a small performance penalty in external stores and a larger one in the
built in store due to the need to avoid duplicate entries and to update
metadata when convergently encoding. It is recommended that if one has some
use cases that require convergence and some that do not, one should create two
different tokenization transforms with convergence enabled on only one.
### Token lookup
Some use cases may want to lookup the value of a token given its plaintext. Ordinarily
this is contrary to the nature of tokenization where we want to prevent the ability
of an attacker to determine that a token corresponds to a plaintext value (a known
plaintext attack). But for use cases that require it, the
[token lookup](/vault/api-docs/secret/transform#token-lookup)
operation is supported, but only in some configurations of the tokenization
transformation. Token lookup is supported when convergence is enabled, or
if the mapping mode is exportable *and* the storage backend is external.
## Performance considerations
### Builtin (Internal) store
As tokenization is stateful, the encode operation necessarily writes values to
storage. By default, that storage is the Vault backend store itself. This
differs from some secret engines in that the encode and decode operations require
an access of storage per operation. Other engines use storage for configuration
but can process operations largely without accessing any storage.
Since these operations involve writes to storage, and therefore must be performed
on primary nodes, the scalability of the encode operation is limited by the
primary's storage performance.
Additionally, using internal storage, since writes must be performed on primary
nodes, the scalability of the encode operation will be limited by the performance
of the primary and its storage subsystem. All other operations can be performed
on secondaries.
Finally, due to replication, writes to the primary may take some time to reach
secondaries, so other read operations like decode or metadata may not succeed on
the secondaries until this happens. In other words, tokenization is eventually
consistent.
### External storage
All nodes (except DRs) can participate in all operations using external storage,
but one must take care to monitor and scale the external storage for the level of
traffic experienced. The storage schema is simple however and well known approaches
should be effective.
## Security considerations
The goal of Tokenization is to let end users' devices store the token rather than
their sensitive values (such as credit card numbers) and still participate in
transactions where the token is a stand-in for the sensitive value. For this reason
the token Vault generates is completely unrelated (e.g. irreversible) to the
sensitive value.
Furthermore, the Tokenization transform is designed to resist a number of attacks
on the values produced during encode. In particular it is designed so that
attackers cannot recover plaintext even if they steal the tokenization values
from Vault itself. In the default mapping mode,
even stealing the underlying transform key does not allow them to recover
the plaintext without also possessing the encoded token. An attacker must have
gotten access to all values in the construct.
In the `exportable` mapping mode however, the plaintext values are encrypted
in a way that can be decrypted within Vault. If the attacker possesses the
transform key and the tokenization mapping values, the plaintext can be
recovered. This mode is available for the case where operators prioritize the
ability to export all of the plaintext values in an emergency, via the
`export-decoded` operation.
### Metadata
Since tokenization isn't format preserving and requires storage, one can associate
arbitrary metadata with a token. Metadata is considered less sensitive than the
original plaintext value. As it has it's own retrieval endpoint, operators can
configure policies that may allow access to the metadata of a token but not
its decoded value to enable workflows that operate just on the metadata.
## TTLs and tidying
By default, tokens are long lived, and the storage for them will be maintained
indefinitely. Where there is a concept of time-to-live, it is strongly encouraged
that the tokens be generated with a TTL. For example, as credit cards
have an expiration date, it is recommended that tokenizing a credit card
primary account number (PAN) be done with a TTL that corresponds to the time
after which the PAN is invalid.
This allows such values to be _tidied_ and removed from storage once expired.
Tokens themselves encode the expiration time, so decode and other operations
can immediately reject the operation when presented with an expired token.
## Storage
### External SQL stores
Currently the PostgreSQL, MySQL, and MSSQL relational databases are supported as
external storage backends for tokenization.
The [Schema Endpoint](/vault/api-docs/secret/transform#create-update-store-schema)
may be used to initialize and upgrade the necessary database tables. Vault uses
a schema versioning table to determine if it needs to create or modify the
tables when using that endpoint. If you make changes to those tables yourself,
the automatic schema management may become out of sync and may fail in the future.
External stores may often be preferred due to their ability to achieve a much
higher scale of performance, especially when used with batch operations.
### Snapshot/Restore
Snapshot allows one to iteratively retrieve the tokenization state, for
backup or migration purposes. The resulting data can be fed to the restore
endpoint of the same or a different tokenization store. Note that the state
is only useable by the tokenization transform that created it, as state is
encrypted via keys in that configured transform.
### Export decoded
For stores configured with the `exportable` mapping mode, the export decoded
endpoint allows operators to retrieve the _decoded_ contents of tokenization
state, which includes tokens and their decoded, sensitive values. The
`exportable` mode is only recommended if this use case is required, as the default
cannot be decoded by attackers even if they gain access to Vault's storage and
keys.
### Migration
Tokenization stores are configured separately from the tokenization transform,
and the transform can point to multiple stores. The primary use case for this
one-to-many relationship is to facilitate migration between two tokenization
stores.
When multiple stores are configured, Vault writes new tokenization state to all
configured stores, and reads from each store in the order they were configured.
Thus, one can use multiple configured stores along with the snapshot/restore
functionality to perform a zero-downtime migration to a new store:
1. Configure the new tokenization store in the API.
1. Modify the existing tokenization transform to use both the existing and new
store.
1. Snapshot the old store.
1. Restore the snapshot to the new store.
1. Perform any desired validations.
1. Modify the tokenization transform to use only the new store.
## Key management
Tokenization supports key rotation. Keys are tied to transforms, so key
names are the same as the name of the corresponding tokenization transform.
Keys can be rotated to a new version, with backward compatibility for
decoding. Encoding is always performed with the newest key version. Keys versions
can be tidied as well. Keys may also be rotated automatically on a user-defined
time interval, specified by the `auto_rotate_field` of the key config. For more
information, see the [transform api docs](/vault/api-docs/secret/transform).
## Tutorial
Refer to [Tokenize Data with Transform Secrets
Engine](/vault/tutorials/adp/tokenization) for a
step-by-step tutorial.