mirror of https://github.com/google/snappy.git
Add a framing format description. We do not have any implementation of this at
the current point, but there seems to be enough of a general interest in the topic (cf. public bug #34). R=csilvers,sanjay git-svn-id: https://snappy.googlecode.com/svn/trunk@55 03e5f5b5-db94-4691-08a0-1a8bf15f6143
This commit is contained in:
parent
d7eb2dc413
commit
0755c81519
|
@ -17,7 +17,7 @@ TESTS = snappy_unittest
|
||||||
noinst_PROGRAMS = $(TESTS)
|
noinst_PROGRAMS = $(TESTS)
|
||||||
|
|
||||||
EXTRA_DIST = autogen.sh testdata/alice29.txt testdata/asyoulik.txt testdata/baddata1.snappy testdata/baddata2.snappy testdata/baddata3.snappy testdata/cp.html testdata/fields.c testdata/geo.protodata testdata/grammar.lsp testdata/house.jpg testdata/html testdata/html_x_4 testdata/kennedy.xls testdata/kppkn.gtb testdata/lcet10.txt testdata/mapreduce-osdi-1.pdf testdata/plrabn12.txt testdata/ptt5 testdata/sum testdata/urls.10K testdata/xargs.1
|
EXTRA_DIST = autogen.sh testdata/alice29.txt testdata/asyoulik.txt testdata/baddata1.snappy testdata/baddata2.snappy testdata/baddata3.snappy testdata/cp.html testdata/fields.c testdata/geo.protodata testdata/grammar.lsp testdata/house.jpg testdata/html testdata/html_x_4 testdata/kennedy.xls testdata/kppkn.gtb testdata/lcet10.txt testdata/mapreduce-osdi-1.pdf testdata/plrabn12.txt testdata/ptt5 testdata/sum testdata/urls.10K testdata/xargs.1
|
||||||
dist_doc_DATA = ChangeLog COPYING INSTALL NEWS README format_description.txt
|
dist_doc_DATA = ChangeLog COPYING INSTALL NEWS README format_description.txt framing_format.txt
|
||||||
|
|
||||||
libtool: $(LIBTOOL_DEPS)
|
libtool: $(LIBTOOL_DEPS)
|
||||||
$(SHELL) ./config.status --recheck
|
$(SHELL) ./config.status --recheck
|
||||||
|
|
|
@ -0,0 +1,124 @@
|
||||||
|
Snappy framing format description
|
||||||
|
Last revised: 2011-12-15
|
||||||
|
|
||||||
|
This format decribes a framing format for Snappy, allowing compressing to
|
||||||
|
files or streams that can then more easily be decompressed without having
|
||||||
|
to hold the entire stream in memory. It also provides data checksums to
|
||||||
|
help verify integrity. It does not provide metadata checksums, so it does
|
||||||
|
not protect against e.g. all forms of truncations.
|
||||||
|
|
||||||
|
Implementation of the framing format is optional for Snappy compressors and
|
||||||
|
decompressor; it is not part of the Snappy core specification.
|
||||||
|
|
||||||
|
|
||||||
|
1. General structure
|
||||||
|
|
||||||
|
The file consists solely of chunks, lying back-to-back with no padding
|
||||||
|
in between. Each chunk consists first a single byte of chunk identifier,
|
||||||
|
then a two-byte little-endian length of the chunk in bytes (from 0 to 65535,
|
||||||
|
inclusive), and then the data if any. The three bytes of chunk header is not
|
||||||
|
counted in the data length.
|
||||||
|
|
||||||
|
The different chunk types are listed below. The first chunk must always
|
||||||
|
be the stream identifier chunk (see section 4.1, below). The stream
|
||||||
|
ends when the file ends -- there is no explicit end-of-file marker.
|
||||||
|
|
||||||
|
|
||||||
|
2. File type identification
|
||||||
|
|
||||||
|
The following identifiers for this format are recommended where appropriate.
|
||||||
|
However, note that none have been registered officially, so this is only to
|
||||||
|
be taken as a guideline. We use "Snappy framed" to distinguish between this
|
||||||
|
format and raw Snappy data.
|
||||||
|
|
||||||
|
File extension: .sz
|
||||||
|
MIME type: application/x-snappy-framed
|
||||||
|
HTTP Content-Encoding: x-snappy-framed
|
||||||
|
|
||||||
|
|
||||||
|
3. Checksum format
|
||||||
|
|
||||||
|
Some chunks have data protected by a checksum (the ones that do will say so
|
||||||
|
explicitly). The checksums are always masked CRC-32Cs.
|
||||||
|
|
||||||
|
A description of CRC-32C can be found in RFC 3720, section 12.1, with
|
||||||
|
examples in section B.4.
|
||||||
|
|
||||||
|
Checksums are not stored directly, but masked, as checksumming data and
|
||||||
|
then its own checksum can be problematic. The masking is the same as used
|
||||||
|
in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant
|
||||||
|
0xa282ead8 (using wraparound as normal for unsigned integers). This is
|
||||||
|
equivalent to the following C code:
|
||||||
|
|
||||||
|
uint32_t mask_checksum(uint32_t x) {
|
||||||
|
return ((x >> 15) | (x << 17)) + 0xa282ead8;
|
||||||
|
}
|
||||||
|
|
||||||
|
Note that the masking is reversible.
|
||||||
|
|
||||||
|
The checksum is always stored as a four bytes long integer, in little-endian.
|
||||||
|
|
||||||
|
|
||||||
|
4. Chunk types
|
||||||
|
|
||||||
|
The currently supported chunk types are described below. The list may
|
||||||
|
be extended in the future.
|
||||||
|
|
||||||
|
|
||||||
|
4.1. Stream identifier (chunk type 0xff)
|
||||||
|
|
||||||
|
The stream identifier is always the first element in the stream.
|
||||||
|
It is exactly six bytes long and contains "sNaPpY" in ASCII. This means that
|
||||||
|
a valid Snappy framed stream always starts with the bytes
|
||||||
|
|
||||||
|
0xff 0x06 0x00 0x73 0x4e 0x61 0x50 0x70 0x59
|
||||||
|
|
||||||
|
The stream identifier chunk can come multiple times in the stream besides
|
||||||
|
the first; if such a chunk shows up, it should simply be ignored, assuming
|
||||||
|
it has the right length and contents. This allows for easy concatenation of
|
||||||
|
compressed files without the need for re-framing.
|
||||||
|
|
||||||
|
|
||||||
|
4.2. Compressed data (chunk type 0x00)
|
||||||
|
|
||||||
|
Compressed data chunks contain a normal Snappy compressed bitstream;
|
||||||
|
see the compressed format specification. The compressed data is preceded by
|
||||||
|
the CRC-32C (see section 3) of the _uncompressed_ data.
|
||||||
|
|
||||||
|
Note that the data portion of the chunk, i.e., the compressed contents,
|
||||||
|
can be at most 65531 bytes (2^16 - 1, minus the checksum).
|
||||||
|
However, we place an additional restriction that the uncompressed data
|
||||||
|
in a chunk must be no longer than 32768 bytes. This allows consumers to
|
||||||
|
easily use small fixed-size buffers.
|
||||||
|
|
||||||
|
|
||||||
|
4.3. Uncompressed data (chunk type 0x01)
|
||||||
|
|
||||||
|
Uncompressed data chunks allow a compressor to send uncompressed,
|
||||||
|
raw data; this is useful if, for instance, uncompressible or
|
||||||
|
near-incompressible data is detected, and faster decompression is desired.
|
||||||
|
|
||||||
|
As in the compressed chunks, the data is preceded by its own masked
|
||||||
|
CRC-32C (see section 3).
|
||||||
|
|
||||||
|
An uncompressed data chunk, like compressed data chunks, should contain
|
||||||
|
no more than 32768 data bytes, so the maximum legal chunk length with the
|
||||||
|
checksum is 32772.
|
||||||
|
|
||||||
|
|
||||||
|
4.4. Reserved unskippable chunks (chunk types 0x02-0x7f)
|
||||||
|
|
||||||
|
These are reserved for future expansion. A decoder that sees such a chunk
|
||||||
|
should immediately return an error, as it must assume it cannot decode the
|
||||||
|
stream correctly.
|
||||||
|
|
||||||
|
Future versions of this specification may define meanings for these chunks.
|
||||||
|
|
||||||
|
|
||||||
|
4.5. Reserved skippable chunks (chunk types 0x80-0xfe)
|
||||||
|
|
||||||
|
These are also reserved for future expansion, but unlike the chunks
|
||||||
|
described in 4.4, a decoder seeing these must skip them and continue
|
||||||
|
decoding.
|
||||||
|
|
||||||
|
Future versions of this specification may define meanings for these chunks.
|
Loading…
Reference in New Issue