diff --git a/Makefile.am b/Makefile.am index 2fa36fd..f17b2e3 100644 --- a/Makefile.am +++ b/Makefile.am @@ -17,7 +17,7 @@ TESTS = snappy_unittest noinst_PROGRAMS = $(TESTS) EXTRA_DIST = autogen.sh testdata/alice29.txt testdata/asyoulik.txt testdata/baddata1.snappy testdata/baddata2.snappy testdata/baddata3.snappy testdata/cp.html testdata/fields.c testdata/geo.protodata testdata/grammar.lsp testdata/house.jpg testdata/html testdata/html_x_4 testdata/kennedy.xls testdata/kppkn.gtb testdata/lcet10.txt testdata/mapreduce-osdi-1.pdf testdata/plrabn12.txt testdata/ptt5 testdata/sum testdata/urls.10K testdata/xargs.1 -dist_doc_DATA = ChangeLog COPYING INSTALL NEWS README format_description.txt +dist_doc_DATA = ChangeLog COPYING INSTALL NEWS README format_description.txt framing_format.txt libtool: $(LIBTOOL_DEPS) $(SHELL) ./config.status --recheck diff --git a/framing_format.txt b/framing_format.txt new file mode 100644 index 0000000..08fda03 --- /dev/null +++ b/framing_format.txt @@ -0,0 +1,124 @@ +Snappy framing format description +Last revised: 2011-12-15 + +This format decribes a framing format for Snappy, allowing compressing to +files or streams that can then more easily be decompressed without having +to hold the entire stream in memory. It also provides data checksums to +help verify integrity. It does not provide metadata checksums, so it does +not protect against e.g. all forms of truncations. + +Implementation of the framing format is optional for Snappy compressors and +decompressor; it is not part of the Snappy core specification. + + +1. General structure + +The file consists solely of chunks, lying back-to-back with no padding +in between. Each chunk consists first a single byte of chunk identifier, +then a two-byte little-endian length of the chunk in bytes (from 0 to 65535, +inclusive), and then the data if any. The three bytes of chunk header is not +counted in the data length. + +The different chunk types are listed below. The first chunk must always +be the stream identifier chunk (see section 4.1, below). The stream +ends when the file ends -- there is no explicit end-of-file marker. + + +2. File type identification + +The following identifiers for this format are recommended where appropriate. +However, note that none have been registered officially, so this is only to +be taken as a guideline. We use "Snappy framed" to distinguish between this +format and raw Snappy data. + + File extension: .sz + MIME type: application/x-snappy-framed + HTTP Content-Encoding: x-snappy-framed + + +3. Checksum format + +Some chunks have data protected by a checksum (the ones that do will say so +explicitly). The checksums are always masked CRC-32Cs. + +A description of CRC-32C can be found in RFC 3720, section 12.1, with +examples in section B.4. + +Checksums are not stored directly, but masked, as checksumming data and +then its own checksum can be problematic. The masking is the same as used +in Apache Hadoop: Rotate the checksum by 15 bits, then add the constant +0xa282ead8 (using wraparound as normal for unsigned integers). This is +equivalent to the following C code: + + uint32_t mask_checksum(uint32_t x) { + return ((x >> 15) | (x << 17)) + 0xa282ead8; + } + +Note that the masking is reversible. + +The checksum is always stored as a four bytes long integer, in little-endian. + + +4. Chunk types + +The currently supported chunk types are described below. The list may +be extended in the future. + + +4.1. Stream identifier (chunk type 0xff) + +The stream identifier is always the first element in the stream. +It is exactly six bytes long and contains "sNaPpY" in ASCII. This means that +a valid Snappy framed stream always starts with the bytes + + 0xff 0x06 0x00 0x73 0x4e 0x61 0x50 0x70 0x59 + +The stream identifier chunk can come multiple times in the stream besides +the first; if such a chunk shows up, it should simply be ignored, assuming +it has the right length and contents. This allows for easy concatenation of +compressed files without the need for re-framing. + + +4.2. Compressed data (chunk type 0x00) + +Compressed data chunks contain a normal Snappy compressed bitstream; +see the compressed format specification. The compressed data is preceded by +the CRC-32C (see section 3) of the _uncompressed_ data. + +Note that the data portion of the chunk, i.e., the compressed contents, +can be at most 65531 bytes (2^16 - 1, minus the checksum). +However, we place an additional restriction that the uncompressed data +in a chunk must be no longer than 32768 bytes. This allows consumers to +easily use small fixed-size buffers. + + +4.3. Uncompressed data (chunk type 0x01) + +Uncompressed data chunks allow a compressor to send uncompressed, +raw data; this is useful if, for instance, uncompressible or +near-incompressible data is detected, and faster decompression is desired. + +As in the compressed chunks, the data is preceded by its own masked +CRC-32C (see section 3). + +An uncompressed data chunk, like compressed data chunks, should contain +no more than 32768 data bytes, so the maximum legal chunk length with the +checksum is 32772. + + +4.4. Reserved unskippable chunks (chunk types 0x02-0x7f) + +These are reserved for future expansion. A decoder that sees such a chunk +should immediately return an error, as it must assume it cannot decode the +stream correctly. + +Future versions of this specification may define meanings for these chunks. + + +4.5. Reserved skippable chunks (chunk types 0x80-0xfe) + +These are also reserved for future expansion, but unlike the chunks +described in 4.4, a decoder seeing these must skip them and continue +decoding. + +Future versions of this specification may define meanings for these chunks.