This commit surfaces a public API for streaming YAML events from a read
source. It provides callers an Events{} type that can be generated from
any reader::Read implementation -- so for the moment, OwnedReader(s) and
BorrowReader(s) -- via the module functions from_reader() and
from_reader_with(). This type implements IntoIterator, and thus can be
integrated with any iterator based flows, and benefits from the entire,
extensive ecosystem around them.
That said, I expect this to be a relatively unused part of this library
in the long term, being the lowest level public API exposed by this
library.
These define the configuration that library users are allowed to set
when iterating over Events.
It currently only has one meaningful option, O_LAZY which reflects the
behavior exposed by lib/scanner. This will likely change in the future,
if more customization is desired when working with Event streams.
- set default rust version
instead of a folder override, as we always expect to use the provided
version globally per run.
- explicitly declare extra rustup components
rather than implicitly rely on the current defaults
These macros make op the test harness used by module tests. They allow
us to declare a set of tokens! which will be matched against the expected
events! that the tokens should produce.
The others simplify the process of declaring some of the more nested
event structures quickly
- flow_sequence_entry_mapping_key
- flow_sequence_entry_mapping_value
- flow_sequence_entry_mapping_end
These are special cased due to how some of the implied values can pop
up, and because we need far fewer rules then in the transition from
block_{sequence,mapping}->flow_mapping.
- block_sequence_entry
- block_mapping_key
- block_mapping_value
- flow_sequence_entry
- flow_mapping_key
- flow_mapping_value
These were mostly straightforward, only tricky bit is handling all the
cases in which YAML allows a (scalar) node to be "implied".
- document_start
- document_end
- explicit_document_content
Note that we guarantee at least one (DocumentStart, DocumentEnd) event
pair in the event stream, regardless of whether these tokens exist or
not.
We also guarantee that each DocumentStart _will_ have a DocumentEnd
eventually, again regardless of whether such exists in the token stream.
This isn't explicitly required by the YAML spec, but makes usage of the
Parser more pleasant to callers, as all "indentation" events --
documents, sequences, mappings -- have a guaranteed start and end event,
without the caller needing to infer this behavior from the stream
itself.
If the caller is interested, each DocumentStart and DocumentEnd event
records whether it was implicit (missing from the byte stream), or not.
The most notable of the types included in this commit is EventData. Its
parent, Event, is a small wrapper with some additional stream information
encoded -- the approximate start and end bytes covered.
EventData has 10 variants:
1. StreamStart
2. StreamEnd
3. DocumentStart
4. DocumentEnd
5. Alias
6. Scalar
7. MappingStart
8. MappingEnd
9. SequenceStart
10. SequenceEnd
Combined, they allow us to express a stream of YAML in an iterative
event model, that should hopefully be easy (at least compared to YAML
proper) to consume.
Expressed in pseudo backus-naur, this is the expected form of any
given event stream:
=== Event Stream ===
stream := StreamStart document+ StreamEnd
document := DocumentStart content? DocumentEnd
content := Scalar | collection
collection := sequence | mapping
sequence := SequenceStart node* SequenceEnd
mapping := MappingStart (node node)* MappingEnd
node := Alias | content
=== Syntax ===
? => 0 or 1 of prefix
* => 0 or more of prefix
+ => 1 or more of prefix
() => production grouping
| => production logical OR
=== End ===
This module will house the first, lowest level public API of this
library, eventually exposing a structure that allows callers to consume
high level YAML 'Events', likely with an Iterator interface.
This is an implementation of "Stacked Borrows" wherein memory is
allocated in chunks, and once a chunk is reached, a new chunk is
allocated and the old one's stack state (cap,len,ptr) is moved into
the tail.
Any Read implementation must uphold the contract:
(&'de self) -> Tokens<'de>
That is, any borrows into the backing bytes given out must not be mutated
in any way.
For an existing borrow (e.g &str) this is trivially possible, however
things get much more complicated when dealing with an owned source that
might not be complete -- a `std::io::Read` object, for example.
While we could simply read the entire thing first, and then borrow from
the complete byte stream this is less than ideal, particularly for Serde
implementations as an owned source will only provide a DeserializeOwned
implementation, consequently copying data. It also makes stream
processing YAML arbitrarily limited to the total size of the stream,
rather than the actual data stored -- e.g: sum(SCALAR.len()) + count(SCALAR)
-- which is a strong limitation, given YAML natural stream processing
capabilities.
To overcome this limitation, I've decided to introduce a "Stacked Borrow"
pattern with the use of a little unsafe.
```
; A rust vector is just a capacity, length and ptr to somewhere in the
; heap
VEC := (cap,len,ptr)
; Each OwnedReader keeps two VECs, one for bytes (u8) and another for
; VECs of bytes
OwnedReader := {
head: (cap, len, ptr)
tail: (cap, len, ptr)
}
; Demonstration of the various memory segments stored on the program's heap
; and how the OwnedReader's ptrs connect
HEAP := {
head.ptr->[u8..]
tail.ptr->[VEC..]
tail[0].ptr->[u8..]
tail[n].ptr->[u8..]
}
```
The OwnedReader makes a promise to NEVER call realloc on an existing
heap segment; therefore any references given out to heap segments are
immutable, fulfilling the contract required by the Parser (and Scanner).
Instead, if/when more of the byte stream is requested, it will allocate
a new .head and swapping out the old .head onto the .tail stack thus
keep the memory live.
Notably, this process hasn't described how to determine if any .tail
segments are no longer needed and unload them. Mostly because I haven't
figured that part out completely yet. Probably keeping track of the
lowest borrowed segment somehow and running reconciliation periodically.
But it _is_ possible using this strategy.
Each of these will likely appear in the parts of the public API, even if
they aren't directly used.
Its likely these will be "public but unreachable" -- e.g a public type
in a private module.
This will likely be revisited on the way to a stable 1.0 library
version, but works for now.
When checking for a terminating sequence in plain scalars, we either
need a flow indicator (in flow contexts only), or a ': ' byte sequence,
where the space can be any valid YAML whitespace.
The issue here is that the lazy variant was correctly identifying the
terminating sequence, _but not recording it_ for the Deferred's slice.
This commit fixes that, ensuring we always record the final 1 or 2 bytes
before exiting the main loop.
to remain consistent with scanner/tests, also derive the base TEST_FLAGS
from scanner/tests.TEST_FLAGS, minus options that do not make sense for
the test battery (O_EXTENDABLE)