This commit adds the 3rd of the 5 possible scalar types in YAML to the
scanner. It is compliant with the YAML spec, _except_ for its handling
of "JSON like" keys, which allow for the following value token (e.g ':')
to _not_ have a whitespace following it.
I frankly find this exception absurd, as the spec _clearly_ half assed
this in so that they can declare that they are a "strict super set of
JSON", nevermind that _a lot_ of the semantics of _every_ other context
for keys rely on a key being followed by whitespace.
I may eventually return to this add it; I've a pretty good idea how --
we just need to keep track of the "last" token produced, as only
?'"]} characters would modify the behavior, but I'd need to
make sure I haven't missed any subtle side effects, as almost all other
key handling implicitly relies on: Key token === ": ".
before the loop would incorrectly update scalar_stats _after_ reaching a
': ' terminus. This is now fixed, as I check for the cases before
reentering the word loop.
the primary driver for scanning plain YAML scalars. This implementation
tries to fit as closely as possible to the YAML spec, particularly in
its handling of (the lack of) spacing requirements inside flow contexts,
comment detection and special casing of - ? : as first character in flow
contexts.
Two things that are notably missing:
1. Proper tab '\t' handling in block context indentation
2. A sane maximum whitespace limit && better handling of whitespace
storage. Rather than storing every whitespace given, I could instead
count the whitespace separated by line breaks, and then add it back
later, such that the maximum described above would apply to total
line breaks, with the intervening whitespace stored as a u64/usize
While the previous commit did add support for _adding_ zero indented
sequences to the token stream, it unfortunately relied on the indent
stack flush that happens once reaching end of stream to push the stored
BlockEnd tokens.
This commit adds better support for removing zero indented sequences
from the stack once finished.
The heuristic used here is:
A zero_indented BlockSequence starts when:
- The top stored indent is for a BlockMapping
- A BlockEntry occupies the same indentation level
And terminates when:
- The top indent stored is a BlockSequence & is tagged as zero indented
- A BlockEntry _does not_ occupy the same indentation level
This fixes the edge case YAML allows where a sequence may be zero
indented, but still start a sequence.
E.g using the following YAML:
key:
- "one"
- "two"
The following tokens would have been produced (before this commit):
StreamStart
BlockMappingStart
Key
Scalar('key')
Value
BlockEntry
Scalar('one')
BlockEntry
Scalar('two')
BlockEnd
StreamEnd
Note the the lack of any indication that the values are in a sequence.
Post commit, the following is produced:
StreamStart
BlockMappingStart
Key
Scalar('key')
Value
BlockSequenceStart <--
BlockEntry
Scalar('one')
BlockEntry
Scalar('two')
BlockEnd <--
BlockEnd
StreamEnd
Before we only checked for the existence of a saved key, but *didn't*
also check that it was still valid / possible.
This lead to a subtle error wherein scalar that weren't valid keys
(anymore) would still be picked up and used.
as before we were double checking for the existence of a Value, once
after parsing a scalar, and again when actually adding the Value token
to the queue. This way we simplify the flow for scalar tokens, and stop
doing unnecessary work
This commit completely rewrites the key subsystem of the Scanner. Rather
than merely tracking whether a key could be added, Key now manages the
state tracking for potential implicit keys.
A TokenEntry is designed as a wrapper for Tokens returned from the
Scanner, ensuring that they are returned from the Queue in an order that
mirrors where in the buffer the token was read.
This will allow me to push Tokens out of order particularly when
handling Keys and still have them returned in the expected order