Menu

Binary options strategy mmx

5 Comments

binary options strategy mmx

Copyright CDell, Inc. You are permitted to copy and distribute material from this book provided any material you distribute includes this license, the material is not modified, and you do not charge a fee or require any other considerations for copies or for any works that incorporate material from this book. These restrictions do not apply to normal "fair use", defined as cited quotations totaling less than one page. This book is for the reader who wants to understand how data compression works, or who wants to write data compression software. Prior programming ability and some math skills will be needed. Specific topics include: This book is intended to be self contained. Data compression is the art of reducing the number of bits needed to store or transmit data. Compression can be either lossless or lossy. Losslessly compressed data can be decompressed to exactly its original value. An example is Morse Code. Each letter of the alphabet is coded as a sequence of dots and dashes. The most common letters in English like E and T receive the shortest codes. The least common like J, Q, X, and Z are assigned the longest codes. All data compression algorithms consist of at least a model and a coder with optional preprocessing transforms. A model estimates the probability distribution E is more common than Z. The coder assigns shorter codes to the more likely symbols. There are efficient and optimal solutions to the coding problem. However, optimal modeling has been proven not computable. Modeling or equivalently, prediction is both an artificial intelligence AI problem and an art. Lossy compression discards "unimportant" data, for example, details of an image or audio clip that are not perceptible to the eye or ear. An example is the NTSC standard for broadcast color TV, used until The human eye is less sensitive to fine detail between colors of equal brightness like red and green than it is to brightness black and white. Thus, the color signal is transmitted with less resolution over a narrower frequency band than the monochrome signal. Lossy compression consists of a transform to separate important from unimportant data, followed by lossless compression of the important part and discarding the rest. The transform is an AI problem because it requires understanding what the human brain can and cannot perceive. Information theory places hard limits on what can and cannot be compressed losslessly, and by how much: This is proved by the counting argument. Suppose there were a compression algorithm that could compress all strings of at least a certain size, say, n bits. There are exactly 2 n different binary strings of length n. A universal compressor would have to encode each input differently. Otherwise, if two inputs compressed to the same output, then the decompresser would not be able to decompress that output correctly. However there are only 2 n binary strings shorter than n bits. In fact, the vast majority of strings cannot be compressed by very much. The fraction of strings that can be compressed from n bits to m bits is at most 2 m - n. For example, less than 0. Every compressor that can compress any input must also expand some of its input. However, the expansion never needs to be more than one symbol. Any compression algorithm can be modified by adding one bit to indicate that the rest of the data is stored uncompressed. The counting argument applies to systems that would recursively compress options own output. In general, compressed data appears random to the algorithm that compressed it so that it cannot be compressed again. Assume our model is that each digit occurs with probability 0. The compression ratio is 4 bits per character 4 bpc. The decompresser would decode the data by dividing it into 4 bit strings. The code is uniquely decodable because no code is a prefix of any other code. The compression ratio is bpc. The binary code is not uniquely decodable. For example, could be decoded as 7 or 31 or 13 or There are better codes than the Huffman code given above. For example, we could assign Huffman codes to pairs of digits. There are pairs each with probability We could assign 6 bit codes through to 00 through 27, and 7 bits through to 28 through The average code length is bits per pair of digits, or bpc. Similarly, coding groups of 3 digits using 9 or 10 bits would yield bpc. Shannon and Weaver proved that the best you can do for a symbol with probability p is assign a code of length log p. Shannon defined the expected information content or equivocation now called entropy of a random variable X as its expected code length. Suppose X may have values X 1X 2 . There is no smaller code for this model that could be decoded unambiguously. The information content of a set of strings is at most the sum of the information content of the individual strings. If they are equal, then X and Y are independent. Knowing one string would tell you nothing about the other. The entropy of an information source is the expected limit to which you can compress it. There are efficient coding methods, such as arithmetic codes, which are for all practical purposes optimal in this sense. It should be emphasized, however, that entropy can only be calculated for a known probability distribution. But in general, the model is not known. However, it is possible to use a better model. The digits are only unknown until you compute them. The counting argument says that most strings are not compressible. So it is a rather remarkable fact that most strings that we care about, for example English text, images, software, sensor readings, and DNA, are in fact compressible. These strings generally have short descriptions, whether they are described in English or as a program in C or x86 machine code. Solomonoff, Kolmogorovand Chaitin independently proposed a universal a-priori probability distribution over strings based on their minimum description length. The algorithmic probability of a string x is defined as the fraction of random programs in some language L that output x, where each program M is weighted by 2 - M and M is the length of M in bits. This probability is dominated by the shortest such program. We call this length the Kolmogorov complexity K L x of x. Algorithmic probability and complexity of a string x depend on the choice of language L, but only by a constant that is independent of x. Suppose that M1 and M2 are encodings of x in languages L1 and L2 respectively. If L2 is English, the M2 would be a description of x with just enough detail that it would allow you to write x exactly. Now it is possible for any pair of languages to write in one language a compiler or interpreter or rules for understanding the other language. The size of the language description or compiler does not depend on x in any way. Then for any description M1 in any language L1 of any x, it is possible to find a description M2 in any other language L2 of x by appending to M1 a fixed length description of L1 written in L2. It is not proven that algorithmic probability is a true universal prior probability. Nevertheless it is widely accepted on empirical grounds because of its success in sequence prediction and machine learning over a wide range of data types. In machine learningthe minimum description length principle of choosing the simplest hypothesis consistent with the training data applies to a wide range of algorithms. To summarize, the best compression we can achieve for any string x is to encode it as the shortest program M in some language L that outputs x. Furthermore, the choice of L becomes less important as the strings get longer. All that remains is to find a procedure that finds M for any x in some language L. However, Kolmogorov proved that there is no such procedure in any language. In fact, there is no procedure that just gives you Mthe length of the shortest possible description of x, for all x. Then it would be possible to describe "the first string that cannot be described in less than a million bits" leading to the paradox that we had just done so. By "first", assume an ordering over strings from shortest to longest, breaking ties lexicographically. Nor is there a general test for randomness. Then it would be possible to to enumerate all proofs lists of axioms and applications of the rules of inference and describe "the string x such that it is the first to be proven to be a million bits long and random" in spite of the fact that we just gave a short description of it. If F describes any formal system, then the longest string that can be proven to be random is never much longer than F, even though there are an infinite number of longer strings and most of them are random. Nor is there a general test to determine whether the compression of a string can be improved any further, because that would be equivalent to testing whether the compressed string is random Because determining the length of the shortest descriptions of strings is not computable, neither is optimal compression. It is not hard to find difficult cases. To any program that does not know the key, the data looks completely random and incompressible. Prediction is intuitively related to understanding. If you understand a sequence, then you can predict it. If you understand a language, then you could predict what word might appear next in a paragraph in that language. If you understand an image, then you could predict what might lie under portions of the image that are covered up. Alternatively, random data has no meaning and is not predictable. This intuition suggests that prediction or compression could be used to test or measure understanding. We now formalize this idea. Optimal compression, if it were computable, would optimally solve the artificial intelligence AI problem under two vastly different definitions of "intelligence": the Turing test Turing,and universal intelligence Legg and Hutter, Turing first proposed a test for AI to sidestep the philosophically difficult question which he considered irrelevant of whether machines could think. This test, now known as the Turing test, is now widely accepted. The test is a game played by two humans who have not previously met and the machine under test. One human the judge communicates with the other human the confederate and with the machine through a terminal. Both the confederate and the machine try to convince the judge that each is human. But if a model could make such predictions accurately, then it could also generate responses indistinguishable from that of a human. Predicting transcripts is similar to the problem to predicting ordinary written language. It requires in either case vast, real-world knowledge. Shannon estimated that the information content of written case-insensitive English without punctuation is to bits per character, based on experiments in which human subjects guessed successive characters in text with the help of letter n-gram frequency tables and dictionaries. The uncertainty is due not so much to variation in subject matter and human skill as it is due to the fact that different probability assignments lead to the same observed guessing sequences. Nevertheless, the best text compressors are only now compressing near the upper end of this range. They consider the problem of reward-seeking agents in completely arbitrary environments described by random programs. In this model, an agent communicates with an environment by sending and receiving symbols. The environment also sends a reinforcement or reward signal to the agent. The goal of the agent is to maximize accumulated reward. Universal intelligence is defined as the expected reward over all possible environments, where the probability of each environment described by a program M is algorithmic, proportional to 2 - M. Hutterproved that the optimal but not computable strategy for the agent is to guess after each input that the distribution over M is dominated by the shortest program consistent with past observation. Hutter calls this strategy AIXI. It is, of course, is just our uncomputable compression problem applied to a transcript of past interaction. The best predictor of the future is the simplest or shortest theory that explains the past. There is no such thing as universal compression, recursive compression, or compression of random data. Compression is both an art and an artificial intelligence problem. The key to compression is to understand the data you want to compress. A data compression benchmark measures compression ratio over a data set, and sometimes memory usage and speed on a particular computer. Some benchmarks evaluate size only, in order to avoid hardware dependencies. Compression ratio is often measured by the size of the compressed output file, or in bits per character bpc meaning compressed bits per uncompressed byte. In either case, smaller numbers are better. Generally there is a 3 way trade off between size, speed, and memory usage. The top ranked compressors by size require a lot of computing resources. The Calgary corpus is the oldest compression benchmark still in use. It was created in and described mmx a survey of text compression models in Bell, Witten and Cleary, It consists of 14 files with a total size of 3,141,622 bytes as follows ,261 BIB ASCII text in UNIX "refer" format bibliographic references ,771 BOOK unformatted ASCII text - Thomas Hardy: Far from the Madding Crowd ,856 BOOK ASCII text in UNIX "troff" format - Witten: Principles of Computer Speech ,400 GEO bit numbers in IBM floating point format - seismic data ,109 NEWS ASCII text - USENET batch file on a variety of topics ,504 OBJ VAX executable program - compilation of PROGP ,814 OBJ Macintosh executable program - "Knowledge Support System" ,161 PAPER1 - UNIX "troff" format - Witten, Neal, Cleary: Arithmetic Coding for Data Compression ,199 PAPER2 - UNIX "troff" format - Witten: Computer in security ,216 PIC x bitmap image MSB first : text in French and line diagrams ,611 PROGC - Source code in C - UNIX compress v ,646 PROGL - Source code in Lisp - system software ,379 PROGP - Source code in Pascal - program to evaluate PPM compression ,695 TRANS - ASCII and control characters - transcript of a terminal session The struture of the corpus is shown in the diagram below. Each pixel represents a match between consecutive occurrences of a string. The color of the pixel represents the length of the match: black for 1 byte, red for 2, green for 4 and blue for 8. The horizontal axis represents the position of the second occurrence of the string. The vertical axis represents the distance back to the match on a logarithmic scale. The image was generated by the fv program with labels added by hand. The horizontal dark line at around in BOOK1 is the result of linefeed characters repeating at this regular interval. Dark lines at and in GEO are due to repeated data in block headers. The dark bands at multiples of 4 are due to the 4 byte structure of the data. PIC has a dark band at 1 due to long runs of zero bytes, and lines at multiples of 216, which is the width of the image in bytes. The blue curves at the top of the image show matches between different text files, namely BIB, BOOK1, BOOK2, NEWS, PAPER1, PAPER2, PROGC, PROGL, PROGP and TRANS, all of which contain English text. Thus, compressing these files together can result in better compression because they contain mutual information. No such matches are seen between GEO, OBJ2, and PIC with other files. Thus, these files should be compressed separately. Other structure can be seen. For example, there are dark bands in OBJ2 at 4, 6, 8, and higher even numbers due to the lengths of machine instructions. The dark band in BIB at around represents the average length of a bibliographic entry. We are brought on to a day in February, on which was held the yearly statute or hiring fair in the county-town of Casterbridge At one end of the street stood from two to three hundred blithe and hearty labourers waiting upon Chance all men of the stamp to whom labour suggests nothing worse than a wrestle with gravitation, and pleasure nothing better than a renunciation of the same among these, carters and waggoners were distinguished by having a piece of whip-cord twisted round their hats; The tables below show then 10 most frequent n-grams n byte sequences in BOOK1 for n from 1 to 5, and the 10 most frequent words, bigrams, and trigrams or 3 word sequences. The image is represented by scanning in rows from the top left using a 1 for black or 0 for binary. The bits are packed 8 to a byte MSB first with bytes per row. The file has no header. It can be viewed in some imaging software as a. Each number has a sign bit 1 if negativea 7 bit exponent, and a 24 bit mantissa representing a fraction between 0 and 1. The number is decoded as sign x mantissa x 16 exponent The decoded data is graphed below. Each line represents one data sequence of either or samples starting at the top. The file has a repeating structure with a byte header not shown preceding each sequence. The numeric values range from to Values have at most 16 bits of precision. Thus, the last 8 bits of all of the mantissa values are 0. The points are further rounded as follows: between and 240, all values are rounded to a multiple of Other values with magnitudes less than are rounded to Between andvalues are rounded to multiples of Between andvalues are rounded to integers. Abovevalues are rounded to multiples of Early tests sometimes mmx an 18 file version of the corpus that included 4 additional papers PAPER3 through PAPER6. Programs were often ranked by measuring bits per character bpc on each file separately and reporting them individually or taking the average. Simply adding the compressed sizes is called a "weighted average" since it is weighted toward the larger files. The Calgary corpus is no longer widely used due to its small size. However, it has been used since in an ongoing compression challenge run by Leonid A. Broukhis with small cash prizes. The best compression ratios established as of Feb are as follows Size Date NameSep Malcolm TaylorAug Maxim SmirnovSep Maxim SmirnovNov Serge VoskoboynikovJan Matt MahoneyApr Alexander RhatushnyakDec Alexander RhatushnyakApr Przemyslaw SkibinskiOct Alexander RhatushnyakDec Alexander RhatushnyakMay Alexander RhatushnyakJul Alexander Rhatushnyak The rules of the Calgary challenge specify that the compressed size include the size of the decompression program, either as a Windows or Linux executable file or as source code. This is to avoid programs that cheat by hiding information from the corpus in the decompression program. Furthermore, the program and compressed files must either be packed in an archive in one of several specified formatsor else 4 bytes plus the length of each file name is added. This is to prevent cheating by hiding information in the file names and sizes. Without such precautions, programs like barf could claim to compress to zero bytes. Submissions prior to are custom variants of compressors by the authors based on PPM algorithms rk for Taylor, slim for Voskoboynikov, ppmn for Smirnov. Subsequent submissions are variants of the open source paq6a context mixing algorithm. The submission is based on paq8. Compression times CT and decompression times DT are process times not wall times in seconds and are measured on a 2 GHz T under Windows Vista. Underlined values indicate the Pareto frontiermeaning that no other program is both faster and higher ranked. When options are shown, they are usually set for maximum or nearly maximum compression at the expense of speed and memory up to 1 GB. However, memory usage is not critical because of the small size of the test data. Some programs are tested again with default options. The decompression program size is not included. The year is for the version tested. Many programs have more than one author. Denis Kyznetsovbalz LZ Ilia Muravievlzturbo m LZ Hamid Bouzidiquad x ROLZ Ilia Muravievha a PPM Harri Hirvolatarsalzp LZP Piotr Tarsaqazar d9 -x7 -l7 LZP Denis Kyznetsovdmc DMC Gordon V. Tamayo ,112, lzuf LZ Gerald R. Tamayo ,131, lcssr SR Frank Schwellinger ,137, lzss ex LZSS Ilia Muraviev ,140, ulz c LZ Ilia Muraviev ,155, arbc2z PPM David A. Scott ,156, lzc LZ Nania Francesco Antonio ,162, lzop LZ Markus F. Oberhumer ,237, zhuff LZ Yann Collet ,243, lzw LZW Ilia Muraviev ,267, srank C8 -O SR Peter Fenwick ,319, compress LZW Spencer Strategy et. They are described in more detail throughout the rest of this book. The Large Text Compression Benchmark consists of a single Unicode encoded XML file containing a dump of Wikipedia text from Mar. As of Febdifferent programs including different versions and options were evaluated for compressed size including the decompression program source or executable and any other needed files as a zip archivespeed, and memory usage. The benchmark is open, meaning that anyone can submit results. Programs are ranked by compressed size with options selecting maximum compression where applicable. The best result obtained is 127,784,888 bytes by Dmitry Shkarin for a customized version of durilca using 13 GB memory. It took seconds to compress and seconds to decompress using a size-optimized decompression program on a GHz quad core Q with 16 GB memory under 64 bit Windows XP Pro on July 21, The data was preprocessed with a custom dictionary built from the benchmark and encoded with order 40 PPM. By comparison, zip -9 compresses to 322,649,703 bytes in seconds and decompresses in 35 seconds using MB memory. The benchmark shows a 3 way trade off between compressed size, speed, and memory usage. The two graphs below show the Pareto frontier, those compressors for which no other compressors both compress smaller and faster or smaller and strategy less memory. The graphs are from Augbut the current data shows a similar trend. In particular, no single algorithm shown in parenthesis is the "best". Pareto frontier: compressed size vs. Note that speed tests may be run on different machines, and that only the options for maximum compression for each program are used. Nevertheless, the general trend remains valid. Individual compressors often have options that allow the user to make the same 3 way trade off. The reason that the benchmark is so dependent on memory is that there are string matches that span the entire length of the 1 GB file. This can be seen in the analysis below. The blue area at the top right of the image represents matches between the beginning and end of the file. The dark band in the middle of the image is due to several thousand Wikipedia articles for U. This region is highly compressible. The Hutter prize is based on the first bytes the file enwik8 of the Large Text Compression benchmark with similar rules and goals. The best result is 15,949,688 bytes for an archive and a decompresser submitted by Alexander Rhatushnyak on May 23, It requires seconds and MB memory to decompress on a 2 GHz dual core T under 32 bit Windows Vista. The submission is based on two open source, context mixing programs paq8hp12 and lpaq9m with a custom dictionary for preprocessing. By comparison, zip -9 compresses the same data to 36,445,373 bytes and uncompresses in seconds using MB memory. The maximum compression benchmark has two parts: a set of 10 public files totaling 53 MB, and a private collection of files totaling MB. In the public data set SFC or single file compressioneach file is compressed separately and the sizes added. Programs are ranked by size only, with options set for best compression individually for each file. WinRKanother context mixing algorithm, is top ranked on 4 of the files txt, exe, dll, pdf. WinRK uses a dictionary which is not included in the total size programs are ranked. The binary is not available for download. Files are compressed together to a single archive. If a compressor cannot create archives, then the files are collected into an uncompressed archive TAR or QFCwhich is compressed. In the MFC test, paq8px is top ranked by size. All are archivers that detect file type and apply different algorithms depending on type. The Generic Compression Benchmark has the goal of evaluating compression algorithms in the context of universal prediction or intelligence, as defined by Legg and Hutter By this definition, data sources are assumed to have a universal Solomonoff distribution, i. The purpose of the test is to find good compression algorithms that are not tuned to specific file types. The benchmark does not publish any test data. Rather, it publishes a program to generate the data from a secret seed or an internally hardware generated random number. The data consists of the bit string outputs of one million random Turing machines, truncated to bits and packed into null terminated byte strings. The average output size is about MB. The test allows public verification while eliminating the need to measure the decompresser size because it is not possible to hide the test data in the decompresser without knowing the cryptographic random number seed. Programs are ranked by the ratio of compressed output to the compressed output of a reference compressor ppmonstr to improve repeatability. Unfortunately the benchmark fails to completely eliminate the problem of tuning compressors to public benchmarks. The top ranked program is a stationary context mixing model configuration implemented in zpaq using a preprocessor by J. Ondrus that splits each string into an incompressible prefix and a bit repeat instruction. Its score iscompared to for zip Generally, however, the rank order of compressors is similar to that of other benchmarks. Compression Ratings by Sami Runsas ranks programs on GB of various data types from public sources using a score that combines size and speed. The data includes English text, Windows executable code, RGB and grayscale images, CD quality audio, and a mix of data types from two video games. Some of the 10 data sets contains multiple files. Single file compressors are tested on an equivalent tar file. Compressed sizes include the size of the decompression program source code or executable program compressed with 7zip. Programs must pass a qualifying round with minimum compression ratio and time requirements on a small data set. The benchmark includes a calculator that allows the user to rank compressors using different weightings for the importance of size, compression speed, and decompression speed. The default scale is " lower ratio or twice the total time is worth half the rating". The top ranked programs for the default settings were nanozip followed by freearc, CCM, flashzip, and 7-zip. Runsas is the author of nanozip. There are additional benchmarks that compare BWT based compressors and bzip2-compatible compressors, and additional benchmarks for some special file types. Squeeze Chart by Stephen Busch, ranks programs on GB of mostly private data of various types by size only. There are separate tests for single file compressors and archivers. Both use context mixing. UCLC by Johan de Bock contains several benchmarks of public data for compressors with a command line interface which is most of them. As of Febpaq8i or paq8p was top ranked by size on most of them. Metacompressor is an automated benchmark that allows developers to submit programs and test files and compare size excluding the decompressercompression time, and decompression time. The fraction in use remained nearly constant since in spite of a doubling of disk capacity almost every year. Average file size also remained nearly constant at about 4 KB since However the number of files increased, to an average of 225,000 files in 36,000 directories in File size varies widely. About half of all data is in files larger than 25 MB. One fourth is in files smaller than 3 MB and one fourth in files larger than 1 GB. These percentiles are also increasing as the number of larger files grow. If links are allowed to point to duplicates on other file systems, then compression improves by about 3 percentage points for each doubling of the number of computers. The obvious application of deduplication is incremental backups. It is only necessary to back up data that has changed. In the study, the median time since last modification was days. A code is an assignment of bit strings to symbols such that the strings can be decoded unambiguously to recover the original data. The optimal code for a symbol with probability p will have a length of log p bits. Several efficient coding algorithms are known. Huffman developed an algorithm that calculates an optimal assignment over an alphabet of n symbols in O n log n time. However, Huffman codes are inefficient in practice because code lengths must be rounded to a whole number of bits. This coding inefficiency can be reduced by assigning probabilities to longer groups of symbols but only at the cost of an exponential increase in alphabet size, and thus in run time. The algorithm is as follows. We are given an alphabet and a probability for each symbol. We construct a binary tree by starting with each symbol in its own tree and joining the two trees that have the two smallest probabilities until we have one tree. We can label the branches 0 for left and 1 for right, although the choice is arbitrary From this tree we construct the code Symbol Code A code may be static or dynamic. A static code is computed by the compressor and transmitted to the decompresser as part of the compressed data. A dynamic code is computed by the compressor and periodically updated, but not transmitted. Instead, the decompresser reconstructs the code using exactly the same algorithm using the previously decoded data to estimate the probabilities. Neither method compresses better because any space saved by not transmitting the model is paid back by having less data with which to estimate probabilities. Huffman codes are typically static, mainly for speed. The compressor only needs to compute the code once, using the entire input to compute probabilities. To transmit a Huffman table, it is only necessary to send the size of each symbol, for example ,3,3,3,4,4,4,4,3,3. Both the compressor and decompresser would then assign codes by starting with the shortest symbols, counting up from 0, and appending a 0 bit whenever the code gets longer. This would result in the following different but equally effective code Symbol Size Code For file compression, Huffman coded data still needs to be packed into bytes. JPEG packs bits in MSB most significant bit to LSB least significant bit order. For example, the codes would be packed as The deflate format used in zip, gzip, and png files packs bits in LSB to MSB order, as if each byte is written backward, i. JPEG does this by not assigning any symbol to a code of all 1 bits, and then padding the last byte with 1 bits. Deflate handles this by reserving a code to indicate the end of the data. This tells the decoder not to decode the remaining bits of the last byte. Huffman coding has the drawback that code lengths must be a whole number of bits. This effectively constrains the model to probabilities that are powers of The size penalty for modeling errors is roughly proportional to the square of the error. The penalty can be large for small codes. For example, the only possible ways to Huffman code a binary alphabet is to code each bit as itself or its oppositeresulting in no compression. Arithmetic coding Rissanen,also called range coding, does not suffer from this difficulty. Let P be a model, meaning that for any string x, P x is the probability of that string. Such a number can always be found that is no more than 1 bit longer than the Shannon limit log P x. Then the portion of the range corresponding to the symbol to be coded is used to update the range. As the range shrinks, the leading bits of low and high match and can be output immediately because the code y must be between them. The decompresser is able to decode y by making an identical sequence of predictions and range reductions. Most modern data compressors use arithmetic coding. Early compressors used Huffman coding because arithmetic coding was patented and because its implementation required multiplication operations, which was slow on older processors. Neither of these issues are relevant today because the important patents have expired and newer processors have fast multiply instructions faster than memory access. The most common arithmetic coders code one byte at a time PPM or one bit at a time CM, DMC. Free source code with no licensing restrictions for a bytewise encoder can be found in the source code for ppmd D. Bitwise coders licensed under GPL can be found in the source code for PAQ based compressors including FPAQ, LPAQ, and ZPAQ, the BWT compressor BBB, and the symbol ranking compressor SR2. The simplest of these is the order 0 coder fpaq0. The encoder range is represented by two 32 bit integers unsigned int strategy and high, strategy are initially 1 and 0xffffffff respectively. After the range is split, a 1 is coded in the lower part of the range and a 0 in the upper part. After the range is split and reduced, it is normalized by shifting out any leading bytes that are identical between low and high. This is not a requirement in general. It is not used in the rest of the PAQ series. The decoder has one additional variable in its state, the 32 bit integer curr, which is initialized to the first 4 bytes of compressed data. One additional detail is how to handle the end of file. Most of the PAQ series compressors encode the file size separately and perform 8 encoding operations per byte. After the last encoding operation, the 4 bytes of either high or low or some value in between must be flushed to the archive because the decoder will read these 4 bytes in. After that, 4 zero bytes are written to mark the end of the compressed data. The assertions are not necessary to make the code work. The code is compiled with -DNDEBUG to remove them after debugging. Most high end compressors use arithmetic coding. However, another possibility with the same theoretical coding and time efficiency for bit strings is asymmetric binary coding or ABC Duda, An asymmetric coder has a single n-bit integer state variable x, as opposed to two variables low and high in an arithmetic coder. This allows a lookup table implementation. Because compression and decompression are reverse operations of each other, they must be performed in reverse order by storing the predictions and coded bits in a stack in either the compressor or the decompresser. The coder binary implemented in the order-0 compressors fpaqa, fpaqb, and fpaqc and the context mixing compressor lpaq1a from the PAQ series. It uses a 10 bit state and the probability is quantized to 7 bits on a nonlinear scale finer near 0 and 1 and coarse near The stack size is 500, Increasing these numbers would result in better compression at the expense of a larger table. The coder for fpaqc is used in lpaq1a a context mixing programreplacing the arithmetic coder in lpaq1. Although a lookup table implementation might be faster on some small processors, it is slower on modern computers because multiplication is faster than random memory access. Some benchmarks are shown for enwik MB text on a GHz T processor running on one of two cores. Ratio is fraction of original size. Compression and decompression times are nanoseconds per byte Ratio Comp Decomp Coder fpaqa ABC lookup table fpaqb ABC direct calculation fpaqc ABC direct calculation fpaqa -DARITH arithmetic fpaqa compiled with -DARITH For high end compressors, CPU time and memory are dominated by the model, so the choice of coder makes little difference. Timed on a GHz Athlon Ratio Comp Decomp Coder lpaq1a ABC direct calculation fpaqb lpaq arithmetic The arithmetic coder in lpaq1 and fpaqa -DARITH compresses slightly worse than the ABC coder because it uses 12 bits of precision for the probability, rather than bits as in ZPAQ. Numeric codes are Huffman codes for numbers with some common probability distributions, such as might appear in coding prediction errors in images or audio, or run lengths in highly redundant data. Generally such codes have simpler descriptions to transmit to the decoder than a full Binary code length table. A unary code for a non negative integer n is n ones followed by a zero or n zeros followed by a one. Signed integers can be encoded by mapping positive n to 2n and negative n to -2n-1 to form the sequence 0, -1, 1, -2, 2, -3, 3, etc. Then q is encoded in unary and r in binary. If M is a power of 2, then the code is called a "Rice code", and all of the remainders are coded in log M bits. Otherwise, the smaller remainders are coded using floor log M bits and the larger ones with one more bit. An "extra bit" code uses a Huffman code to encode a range of numbers, then encodes a value within that range the extra binary in binary. This method can be used to approximate any distribution that is approximately smoothly varying. JPEG uses extra bit codes to encode discrete cosine transform coefficients. The ranges are 0, 1,increasing by powers of 2. It also uses a sign bit for all of the ranges except The deflate format used in zip and gzip uses extra bit codes to encode match lengths and offsets using a more complex set of ranges. A floating point number is a special case of an extra-bit code where the reals are approximated by mapping to integers on a logarithmic scale, and where each range is the same size and equally probable. It encodes the real number sm2 e. Compressed files are stored in archives. An options may contain a single file or multiple files. For example, gzip compresses a file, adds a. Decompression restores the original and deletes the compressed file zip stores multiple files in an archive with a. It has commands to list, add, update, delete, and extract files from the archive without deleting the input files. However, both programs use the same compression algorithm. Compression can sometimes be improved by compressing similar files together to take advantage of mutual information. Some archivers support this using a "solid" format, which requires that all of the files be compressed or extracted at the same time. It is optional for some archivers such as WinRAR. A file compressor such as gzip can be used to create solid archives by first collecting the input files together using a non-compressing archiver such as tar and then compressing. The tar format was originally designed for UNIX or Linux backups on tape. Thus, it is possible to make a tar file of an entire file system and restore its structure unchanged. The tar program appends a byte header to each file containing the file name, path, size, timestamp, owner, group, permissions, and padding with zero bytes. Then the files are all appended together. The files are also padded with zero bytes to a multiple of These zero bytes are easily compressed. Archivers zip, WinRAR, and 7-zip have features similar to tar in that they preserve timestamps and can compress and extract directory trees. However they do not store owner, group, and permission information in order to be compatible with Windows, which has a different security model. Archivers sometimes add a checksum to detect transmission errors. The checksum is computed and stored before compression and verified after decompression. The need for a checksum is questionable because networks and storage media typically use error correction so that errors are very unlikely. Decompression errors are more likely to result from software bugs or deliberate manipulation. A single bit change in an archive can produce a completely different file. A checksum increases the archive size slightly usually by 4 bytes and also increases compression and decompression time by the time it takes to compute it. Some common checksum functions: The checksum is computed by counting the number of 1 bits in the input stream. The checksum is 1 if the total is odd or 0 if even. All single bit errors are detected, but other errors are detected only with probability CRC-32 is a commonly used cyclic redundancy check. It detects all burst errors up to 31 bits and all other errors with probability The input is a stream of bits which are shifted through a 32 bit window. Each time a 1 bit is shifted out, the new window contents is XORed with 0x04C11DB7, otherwise it is unchanged. The final contents of the window is the 32 bit checksum. A fast implementation will shift and XOR one byte at a time using a by 32 bit lookup table. The Adler checksum is used in the zlib implementation of deflate, which is used in zip and gzip. It is a 32 bit checksum which detects errors with probability ofwhich is almost The checksum has two 16-bit parts. The first part is 1 plus the sum of all input bytes, modulo The second part is the sum of all the intermediate sums of the first part, modulo The computation is very fast because the slow mod operation only needs to be calculated infrequently using 32 bit arithmetic is the largest prime number less than CRC-32 and Adler-32 detect most accidental errors but it is easy to deliberately alter the input without changing the checksum Cryptographic hash functions are designed to resist even deliberate attempts to find collisions two different inputs that produce the same checksum. They are designed so that the best possible algorithms are no better than randomly guessing the inputs and testing the outputs. The disadvantage is that they are larger usually to 32 bytes and take longer to compute although still reasonably fast. ZPAQ optionally adds a bit byte SHA-1 checksum. It detects errors with probabilityeven if deliberately introduced. The checksum calculation adds ns per byte on a 2 GHz T3200, compared to ns for CRC-32 and ns for Adler-32 using the SlavaSoft fsum v implementation and testing on a 1 GB text file. Some archives will encrypt data as well as compress it. I do not recommend adding this feature unless you know what you are doing. It is tempting to design your own algorithm, perhaps combining it with compression by altering the initial state of the model in a way that depends on a password. Such an attempt will almost certainly fail. There is no test for randomness. If you add encryption, use a well tested algorithm such as AES. Compress your data before encryption, because encrypted data cannot be compressed. Use published encryption code. Then publish all your source code and have security experts look at it. Others will try to break your code, and you should make their job as easy as possible by clearly documenting what your code does and how it works. For example, strategy your archiver deletes the plaintext, it might still be recovered from unallocated disk sectors or the swap file. Likewise for the password or password derived data in memory that gets swapped to disk. There are many subtle things that can go wrong. Do not believe that a secret algorithm or an executable only release is more secure. An encryption algorithm is not considered secure unless it can withstand chosen plaintext attacks by adversaries with full knowledge of the algorithm. Most security experts will not even waste their time trying to attack a closed system. They will just use other, well tested solutions. If your system ever does become popular, then others will reverse engineer it and they will break it. A model is an estimate of the probability distribution of inputs to a compressor. Usually this is expressed as a sequence of predictions of successive symbols bits, bytes, or words in the input sequence given the previous input as context. Once we have a model, coding is a solved problem. But as proved by Kolmogorov there is no algorithm for determining the best model. This is the hard problem in data compression. A model can be static or adaptive. In the static case, the compressor analyzes the input, computes the probability distribution for its symbols, and transmits this data to the decompresser followed by the coded symbols. Both the compressor and decompresser select codes of the appropriate lengths using identical algorithms. This method is often used with Huffman coding. Typically the best compressors use dynamic models and arithmetic coding. The compressor uses past input to estimate a probability distribution prediction for the next symbol without looking at it. Then it passes the prediction and symbol to the arithmetic coder, and finally updates the model with the symbol it just coded. The decompresser makes an identical prediction using the data it has already decoded, decodes the symbol, then updates its model with the decoded output symbol. The model is unaware of whether it is compressing or decompressing. This is the technique we will use in the rest of this chapter. The simplest model is a fixed order model. An order n model inputs the last n bytes or symbols the context into a table and outputs a probability distribution for the next symbol. In the update phase, the predicted symbol is options and the table is updated to increase its probability when the same context next appears. An order 0 model uses no context. A probability distribution is typically computed by using a counter for each symbol in the alphabet. If the symbols are bytes, then the size of the alphabet is The prediction for each symbol is the count for that symbol divided by the total count. The update procedure is to increment the count for that symbol. If the arithmetic coder codes one byte at a time, then you pass the array of counts and the total to the arithmetic coder. For compression, you also pass the byte to be coded. For decompression, it returns the decoded byte. They each take an array of counts and divide the current range in proportion to those counts. The update procedure stores the last n bytes in the low bits of the context. There are a few problems with this method. First, what happens if a byte has a probability of zero? An ideal encoder would give it a code of infinite size. In practice, the encoder would fail. One fix is to initialize all elements of count to 1. Sometimes it improves compression if the initial count were smaller, say or This could be done effectively by increasing the increment to, say, 2 or A second problem is that a count can eventually overflow. One solution is that when a count becomes too large, to rescale all of the counts by dividing by 2. Setting a small upper limit typically 30 to several hundred can improve compression of data with blocks of mixed types like text with embedded images because the statistics reflect recent input. This is an adaptive or nonstationary model. Data with uniform statistics such as pure text are compressed better with stationary models, where the counts are allowed to grow large. In this case, probabilities depend equally on all of the past input. A third problem is that the table of counts grows exponentially with the context order. Then the last nk bits of context depend on the last n bytes of input. A larger k will result in better compression at the expense of more memory. A fourth problem is that straightforward bytewise arithmetic coding is inefficient. The decoder must compute range intervals to find the one containing the compressed data. This could be solved by using cumulative counts, i. One solution is to encode one bit at mmx time using the bitwise encoder like the one described in section The idea is to encode one bit at a time by using the previous bits of the current byte as additional context. Only two values are stored: a count of ones, count 1and a total count. The update procedure is to increment count and to increment count 1 if the bit is 1. We handle zero probabilities, overflow, and large contexts as before. Alternatively, we can avoid a slow division operation by storing the prediction directly. Each bitwise context is associated with a prediction that the next bit will be a 1 initially and an update count initially The update rate is initially fast and decreases as the count is incremented, resulting in a stationary model. At each of the 8 steps, the previously coded bits are packed into a number in the range as a binary number 1 followed by up to 7 earlier bits. After 8 decoding operations it has the value and the leading 1 is stripped off before being returned. The count is incremented up to a maximum value. At this point, the model switches from stationary to adaptive. DELTA and LIMIT are tunable parameters. The best values depend on the data. A large LIMIT works best for stationary data. A options LIMIT works better for mixed data types. The choice of DELTA is less critical because it only has a large effect when the data size is small relative to the model size. ZPAQ packs a 22 bit prediction and 10 bit count into a 32 bit model element. As a further optimization, the model is stored as a one dimensional array aligned on a 64 byte cache line boundary. The bytewise context is updated once per byte as usual, but the extra bits are expanded in groups of 4 in a way that causes only two cache misses per byte. The leading bits are expanded to 9 bits as shown below, then exclusive-ORed with the bytewise context address x xx xxx xxxx xxxx 001x xxxx 01xx xxxx 1xxx ZPAQ fixes DELTA at but LIMIT is configurable to 4, 8, 12. Using a higher order model can improve compression at the cost of memory. However, direct lookup tables are not practical for orders higher than about 2. The order 2 model in ZPAQ uses MB memory. Suppose you are given a sequence like and asked to predict what bit is next. If we assume that the source is stationary, then the answer is because 1 out of 10 bits is a 1. If we assume a nonstationary source then the answer is higher because we give preference to newer history. How do we decide? An indirect model learns the answer by observing what happened after similar sequences appeared. The model uses two tables. The first table maps a context to a bit history, a state representing a past sequence of bits. The second table maps the history to a prediction, just like a direct context model. Indirect models were introduced in paq6 in A bit history may be written in the form n 0n 1LB which means that there have been n 0 zeros, n 1 ones, and that the last bit was LB or For example, the sequence would result in the state 3, 2, The initial state is 0, 0, - meaning there is no last bit. In paq6 and its derivatives including ZPAQa bit history is stored as 1 byte, which limits the number of states to The state diagram below shows the allowed states in ZPAQ with n 0 on the horizontal axis and n 1 on the vertical axis. A single dot represents a single state where LB can take only one value because the state is reachable with either a 0 or 1 but not both. LB is the larger of the two counts. In general, an update with a 0 moves to the right and an update with a 1 moves up. The initial state is marked with a 0 in the lower left corner. The diagram is symmetric about the diagonal. There are a total of states n n 0 There are some exceptions to the update rule. Since it is not possible to go off the end of the diagram, the general rule is to move back to the nearest allowed state in the direction of the lower left corner preserving the ratio of n to n There is another rule intended to make the model somewhat nonstationary, and that is when one of the counts is large and the other is incremented, then the larger count is reduced. The specific rule from the ZPAQ standard is that if the larger count is 6 or 7 it is decremented, and if it is larger than 7 then it is reduced to 7. This rule is applied first, prior to moving mmx from an invalid state. The initial prediction for each bit history is n n n The details of the design were determined experimentally to improve compression slightly over the PAQ8 series, which uses a similar design. An indirect model is more memory efficient because it uses only one byte per context instead of four. In all of the PAQ series, it is implemented as a hash table. In the PAQ8 series, the hash table is designed to allow lookups with at most 3 cache misses per byte. In ZPAQ, there are 2 cache misses per byte, similar to the direct model. The ZPAQ hash table maps a context on a 4 bit boundary to an array of bit histories and an 8-bit checksum. The histories represent all 15 possible contexts that occur after up to 3 more bits. The effect is larger for indirect models so we use an 8 bit confirmation. There is still about a 1. For comparison, the best results for direct context models are shown. Direct models 3, 4, and 5 use context hashes with LIMIT set to 32 and the same memory usage Model Direct Indirect pi order, Calgary order ,680, ,716, Calgary order ,304, ,289, Calgary order ,089, ,048, Calgary order ,017,Calgary order ,069, ,010, Calgary order ,232, ,148,677 There are many other possibilities. For example, M1, a context mixing compressor by Christopher Mattern, uses 2 dimensional context models taking 2 quantized bit histories as input. Fixed order models compress better using longer contexts up to a point order 3 for the Calgary corpus. Beyond that, compression gets worse because many higher order contexts are being seen for the first time and no prediction can be made. One solution is to collect statistics for different orders at the same time and then use the longest matching context for which we know something. DMC does this for bit level predictions, and PPM for byte level predictions. DMC dynamic Markov coding was described in a paper by Gordon Cormack and Nigel Horspool in and implemented in C The compressor hook by Nania Francesco Antonio was written in and is based on this implementation. This implements a stationary, direct context model. There are other possibilities, of course. Contexts are not stored or looked up in the table. Rather, each table entry has a pair of pointers to the next entries corresponding to appending a 0 or 1 bit to the current context. The next context might also drop bits off the back. Normally it is arranged so that each context starts on a byte boundary. In DMC and HOOK, the initial table contains 64K entries consisting of all contexts of length 8 to bits that start on a byte boundary, i. In addition to updating counts, we may also add new table entries corresponding to the input just observed. We do this by "cloning" the next state with one that does not drop any bits off the back. Consider the case below where the context is and we update the model with a 0 bit. However we proportionally reduce the two original and two new counts in proportion to the contribution from the original input and from other states. The newly cloned state gets that fraction of the counts and the original gets the rest. This scaling maintains the condition that the input and output counts are equal. The counts are implemented as fixed or floating point numbers to allow fractional values. The transition from to is changed to point to the new state Before cloning which uses memorythere should be sufficient justification to do so. In the original DMC, both thresholds are 2. Normally when the state table is full, the model is discarded and re-initialized. Setting higher thresholds can delay this from happening. PPM prediction by partial match uses a byte-wise context model. Each context up to some maximum order is mapped to an array of counts for the next byte that occurred in this context. To predict the next byte, we find the longest context seen at least once before and allocate probabilities in proportion to those counts. The main complication is that some of the counts might be zero but we must never assign a zero probability to any byte value because if it did occur, it would have an infinite code length. This is handled by assigning an "escape" probability that the next byte will not be any of the ones that have nonzero counts, and divide those according to the frequency distribution of the next lower context. This is repeated all the way down to order If any of the byte values have still not been assigned a probability, then the remaining space is divided equally an order -1 model. Estimating the escape probability can be complex. Suppose you draw 10 marbles from an urn. There are 8 blue, 1 green, and 1 white. What is the probability that the next marble will be a different color not seen before? Method X was shown to be optimal under certain assumptions, including that the source is stationary. Of course, that is not always the case. Suppose you receive the sequence "BBBBBBBBWG" and are asked to predict whether the next character will be novel. The answer might be different for the sequence "WGBBBBBBBB". These methods are not used in modern PPM compressors. Rather, better compression can be obtained by learning the escape probability directly. The method is known as secondary escape estimation SEE. Charles Bloom used SEE in PPMZ. Dmitry Shkarin refined this method in ppmd. A newer variant ppmd J in gets slightly better compression than I. The code is the basis of slower but better compressing programs such as ppmonstr and durilca. A variant of durilca using GB memory is top ranked on the large text benchmark. Then r is the context. It also adjusts the prediction for the most probable byte using secondary symbol estimation SSE. This is a direct context model taking as input the quantized prediction and a very complex context and outputting a new prediction. Both programs use other techniques to improve compression. They use partial update exclusion. When a character is counted in some context, it is counted with a weight of in the next lower order context. Also, when computing symbol probabilities, it performs a weighted averaging with the predictions of the lower order context, with the weight of the lower order context inversely proportional to the number of different higher order contexts of which it is a suffix. Statistics are stored in a tree which grows during modeling. When the memory limit is reached, the tree is discarded and rebuilt from scratch. This improves compression but takes longer. Although bytewise arithmetic encoding can be inefficient, ppmd is in practice faster than many equivalent bitwise context mixing models. First, a byte is encoded as a sequence of escape symbols followed by a non-escape. Binary of these encodings is from a smaller alphabet. Second, the alphabet within each context can be ordered so that the most likely symbols are first. This reduces the number of operations in the majority of cases. Shown below are compressed sizes of options Calgary corpus as a tar file and separate files. Compression and decompression times are the same. Option -o16 means use maximum order m256 says use MB memory. CTW, like DMC, predicts one bit at a time using bit count statistics for contexts that begin on byte boundaries. Like PPM and unlike DMC, it combines the predictions of different order contexts from order 0 up to some maximum order D. Each context is associated with two bit counters, n 0 and n 1and a mixing weight, w. Each context order i from 0 through D computes an estimator e i that the next bit will be a 1 depending on the two counts. All of the lower order contexts predict p i which is a weighted average of e i and the next higher order p i The final prediction is either p 0 or p On update with bit y 0 or 1the count n y is incremented and w is updated to favor the more accurate prediction in each context, either toward e i or p i except for the special case where one of the counts is 0, which we consider later. Then the prediction, expressed as a ratio, is: CTW uses a zero redundancy estimator for the special case where one of the two counts is zero. The idea is to predict more strongly the other bit. Then define k n as the cumulative probability of n consecutive bits by the KT estimator, and define the ZR estimator e as: All estimators are symmetric with respect to 0 and 1. The probability of a 1 given the sequence is the same as the probability of 0 given The ZR estimator gives a smaller probability for these cases than the KT estimator. The following table gives the probability that a 1 will follow n 0 zeros and no ones with each estimator n KT ZR For large runs of identical bits, the total code length for KT grows logarithmically but the ZR estimator is bounded to 2 bits. However, ZR adds 1 bit of redundancy when the other bit value eventually occurs. For large n 0the ZR estimated probability of a 1 approaches 0 at the rate n This is faster than n for the KT estimator. The CTW implementation uses a context tree to store the two counts and w. Each count is 8 bits. If one count exceeds 255, then both counts are divided by 2. It stores log w rather than w and does all arithmetic in the log domain so that multiplication and division are implemented using addition and subtraction. The KT and ZR estimators are looked up in a table indexed by the two counts. The parent of each node is the suffix obtained by removing one byte. The children of the root node represent the possible partial byte order 0 contexts of length 0 to 7 bits. To save space, children of contexts that have only occurred once are not stored. Each tree node uses 8 bytes, including a 24 bit pointer to the context in a separate file buffer. CTW is best suited for stationary sources. The published CTW implementation compresses the Calgary corpus to 767,594 bytes as 14 separate files in 62 seconds with options -n16M -b16M -d12 set for maximum compression. When the tree is full, no new contexts are added but the counts and weights continue to strategy updated. These settings require MB memory, the maximum that the published implementation can use. Context mixing algorithms based on the PAQ series are top ranked on many benchmarks by size, but are very slow. These algorithms predict one bit at a time like CTW except that weights are associated with models rather than contexts, and the contexts need not be mixed from longest to shortest context order. Contexts can be arbitrary functions of the history, not just suffixes of different lengths. Often the result is that the combined prediction of independent models compresses better than any of the individuals that contributed to it. DMC, PPM, and CTW are based on the premise that the longest context for which statistics is available is the best predictor. This is usually true for text but not always the case. For example, in an audio file, a predictor would be better off ignoring the low order bits of the samples in its context because they are mostly noise. For image compression, the best predictors are the neighboring pixels in two dimensions, which do not form a contiguous context. For text, we can improve compression using some contexts that begin on word boundaries and merge upper and lower case letters. In data with fixed length records such as spreadsheets, databases or tables, the column number is a useful context, sometimes in combination with adjacent data in two dimensions. PAQ based compressors may have tens or hundreds of these different models to predict the next input bit. A fundamental question is how do we combine predictions? Assume that A and B have occurred often enough for the two models to make reliable guesses, but that both contexts have never occurred together before. Probability theory does not answer the question. It is possible to create sequences where p can be anything at all for any pa and pb. But intuitively, we should do some kind of averaging or weighted averaging. For example, if we wish to estimate P car accident dark and raining given P car accident dark and P car accident rainingwe would expect the effects to be additive. Furthermore, we want to mix predictions weighed by confidence. All PAQ versions do this. Early versions expressed mmx as pairs of counts observed zeros and ones and added them together. This implicitly gave greater weight to predictions near 0 or 1 because such predictions can only occur when one of the counts is large. This allowed greater flexibility in modeling techniques. Early versions used fixed weights. In PAQ6 Mahoney, aa probability is expressed as a count of zeros and ones. Probabilities are combined by weighted addition of the counts. Weights are adjusted in the direction that minimizes coding cost in weight space. The optimal weight update can be found by taking the partial derivative of the coding cost with respect to w i. The coding cost of a 0 is -log p The coding cost of a 1 strategy -log p The result is that after coding bit y 0 or 1the weights are updated by moving along the cost gradient in weight space: Counts are discounted to favor newer data over older. A pair of counts is represented as a bit history similar to the one described in sectionbut with more aggressive discounting. When a bit is observed and the count for the opposite bit is more than 2, the excess is halved. It is more general, since only a probability is needed as input. This allows the use of direct context models and a more flexible arrangement of different model types. It is used in the PAQ8, LPAQ, PAQ8HP series and in ZPAQ. Given a set of predictions p i that the next bit will be a 1, and a set of weights w ithe combined prediction is: The probability computation is essentially a neural network evaluation taking stretched probabilities as input. Again we find the optimal weight update by taking the partial derivative of the coding cost with respect to the weights. The result is that the update for bit y 0 or 1 is simpler than back propagation which would minimizes RMS error instead. Unlike linear mixing, weights can be negative. Compression can often be improved by using a set of weights selected by a small context, such as a bytewise order 0 context. In PAQ and ZPAQ, squash and stretch are implemented using lookup tables. In PAQ, both output 12 bit fixed point numbers. A stretched probability has a resolution of and range of -8 to 8. Squashed probabilities are multiples of ZPAQ represents stretched probabilities as 12 bits with a resolution of and range to Squashed probabilities are 15 bits as an odd multiple of This representation was found to give slightly better compression than PAQ. ZPAQ allows different components models and mixers to be connected in arbitrary ways. All components output a stretched probability, which simplifies the mixer implementation. In ZPAQ, 16 bits was found to be inadequate for best compression. Weights were expanded to 20 bit signed values with range to 8 and precision SSE secondary symbol estimation is implemented in all PAQ versions beginning with PAQ2. Like in ppmonstr, it inputs a prediction and a context and outputs a refined prediction. The prediction is quantized typically to 32 or 64 values on a nonlinear scale with finer resolution near 0 and 1 and sometimes interpolated between the two closest values. A typical place for SSE is to adjust the output of a mixer using a low order 0 or 1 context. SSE components may be chained in series with contexts typically in increasing order. Or they may be in parallel with independent contexts, and the mmx mixed or averaged together. The table is initialized so that the output prediction is equal to the input prediction for all contexts. SSE was introduced to PAQ in PAQ2 in with 64 quantization levels and no interpolation. Later versions used 32 levels and interpolation with updates to the two nearest values above and below. In some versions of PAQ, SSE is also known as an APM adaptive probability map. ZPAQ allows a SSE to be placed anywhere in the prediction sequence with any context. There are other possibilities. CCM, a context mixing compressor by Christian Martelock, uses a 2 dimensional SSE taking 2 quantized predictions as input. Binary indirect secondary symbol estimation is a technique introduced in paq9a in Dec and is a component in ZPAQ. The idea is to use SSE as a direct prediction method rather than to refine an existing prediction. However, SSE does not work well with high order contexts because the large table size uses too much memory. More generally, a large model with lots of free parameters each table entry is a free parameter will overfit the training data and have no predictive power for future input. As a general rule, a model should not be larger than the input it is trained on. ISSE does not use a 2-D table. Instead it first maps a context to a bit history as with an indirect context map. Then the 8-bit bit history is used as a context to select the pair of weights for a 2 input mixer taking the input prediction and a fixed constant as its two inputs. The weights are initialized tomeaning that the initial output prediction is equal to the input. PAQ9A and the default compression mode of ZPAQ both start with an order 0 model prediction and refine it using a chain of ISSE components in increasing order. In ZPAQ, the weights are 20 bit signed, fixed point numbers with range -8 to 8 and precision like in a MIX. The accuracy of the prediction depends on the length of the context match. Longer matches generally give more confidence to the prediction. Typically a match model of order is mixed with lower order context models. A match model is faster and uses less memory than a corresponding context model but does not model well for low orders. Match models are used in PAQ and ZPAQ. Mmx consist of a rotating history buffer and a hash table mapping contexts to pointers into the buffer. In ZPAQ, a pointer to the match is maintained until a mismatching bit is found. The model will then look for a new match at the start of the next byte. On each byte boundary, the buffer is updated with the modeled byte and the hash table is updated so that the current context hash points to the end of the buffer. ZPAQ allows both the hash table size and buffer size to be user specified as powers of For best compression, the history buffer should be as large as the input if this is practical and the hash table size is typically of this. Because each pointer is 4 bytes, both data structures use the same amount of memory. Match models in PAQ maintain multiple context hashes of different orders and multiple pointers into the buffer. The prediction is indirect by mapping the match length to a prediction through a direct context model. ZPAQ uses a simpler match model with just one pointer and one hash, although it is possible to have multiple, independent match models. The prediction for a match of L bytes or 8L bits is that the next bit will be the same with probability L. The user may specify the context length by using a rolling hash that depends on the desired number of characters. The high compression ratio and slow speed in PAQ comes from using many different context models for different types of data. These are described in historical order. Schmidhuber and Heil developed an experimental neural network data compressor. It used a 3 layer network trained by back propagation to predict characters from an 80 character alphabet in text. It used separate training and prediction phases. Compressing 10 KB of text required several days of computation on an HP workstation. These changes make the algorithm about times faster, mainly because only a few neurons out of millions are active at any one time. To make the training online, it is necessary to add a pair of counters to each weight to count 0 and 1 bits so that the learning rate is initially large. The rate decreases in inverse proportion to the smaller of the two counts. There are 3 versions: P5, P6, and P12. P5 uses KB memory to represent 5 orders using input neurons each representing a context hash and one output the bit prediction. P6 uses input neurons. P12 is the same size as P6 but adds a whole word model. This context hash depends only on alphabetic characters mapped to lower case, and is reset after a nonalphabetic character. It improves compression of text files. In PAQ1 Mahoney,it was realized that the counts alone could be used to make predictions, so the weights were eliminated. Predictions are combined by adding the 0 and 1 counts associated with each context. Each counter is 1 byte. PAQ3N by Serge Osnach in Oct added sparse models: three order-2 models that skipped 1, 2, or 3 bytes of context between the two context bytes. This improves compression of some binary files. PAQ4 Nov uses adaptive linear weighting of models as described in section It also introduced a record model. It identifies structures that repeat at regular intervals, as found in spreadsheets, tables, databases, and images, and adds contexts of adjacent bytes in two dimensions. PAQ5 Dec has some minor improvements over PAQ4, including word models for text, models for audio and images, an improved hash table, and modeling of run lengths within contexts. It uses two mixers with different contexts to select their weights. The two mixer outputs are averaged together. It uses about MB of memory. It takes options allowing up to MB memory. It is the basis of a number of forks and dozens of versions. An early version won the Calgary challenge. Many other models and optimizations were added by Berto Destasio, Johan de Bock, David A. Scott, Fabio Buffoni, Jason Schmidt, Alexandar Rhatushnyak PAQARPrzemyslaw Skibinski PASQDA, text preprocessingRudi Cilibrasi, Pavel Holoborodko, Bill Pettis, Serge Osnach, and Jan Ondrus. The primary difference is a greatly increased number of mixers and SSE chains. PAQ7 Dec was a complete rewrite. It uses logistic mixing rather than linear mixing, as described in section It has models for color BMP, TIFF, and JPEG images. The BMP and TIFF models use adjacent pixels as context. JPEG is already compressed. The model partially undoes the compression back to the DCT discrete cosine transform coefficients and uses these as context to predict the Huffman codes. PAQ8A Jan adds a E8E9 preprocessor to improve compression of x86 EXE and DLL files. The preprocessor replaces relative CALL and JMP addresses with absolute addresses, which improves compression because an address may appear multiple times. Many other compressors use this technique. It adds an external dictionary to replace words in text files with 1 to 3 byte symbols. This technique was used successfully in the Hutter Prize and in the top ranked programs in the large text benchmark. PAQ8A2, PAQ8B, PAQ8C, PAQ8D, PAQ8E, and PAQ8G Feb. PAQ8F Feb adds a more memory efficient context model and a new indirect model: The byte history within a low order context is modeled by another low order context. As of Febdevelopment remains active on the PAQ8 series. There have been hundreds of versions with improvements and additional models. A benchmark for the Calgary corpus is given below for versions of PAQ from to Jan showing major code changes. About intermediate versions with minor improvements have been omitted. Sizes marked with a D use an external English dictionary that must be present during decompression. The size shown does not include the dictionary, so it is artificially low. However, including it would give a size artificially high because the dictionary is not extracted from the test data. All versions of PAQ are archivers that compress in solid mode, exploiting similarities between files. In addition to PAQ8PX, there are 3 additional forks, no longer under active development. Of the hundreds of PAQ versions, no program can decompress files compressed by any other version. The goal of the proposed ZPAQ standard is to solve this problem. It specifies a format in which a description of the compression algorithm is stored in the compressed archive. The specification includes a reference decoder. The specification does not describe the compression algorithm. However, several compression programs and models are available on the ZPAQ page. There is a ZPAQ program that takes a configuration file to describe the compression algorithm, as well as other programs like ZPIPE that use a fixed compression algorithm. All of these produce files that can be decompressed with the reference decoder or by any of these binary. The standard was published in Mar by Matt Mahoney. ZPAQ describes an archive format, although it may be used for single file compression or memory to memory compression. A compressed stream consists of a sequence of blocks that are independent and can be reordered. Each block starts with a header that describes the decompression algorithm. A block consists of a sequence of compressed segments that must be decompressed in sequence from the start of the block. A segment may be a file or a part of a file. Each segment has an optional file name, an optional comment file size, timestamp, etc. If the file name is omitted, then the decompresser must supply it. An algorithm description consists of a network of components each making a predictiona program that computes the bytewise contexts for each component, and an optional program that post-processes the output. Both programs are written in a language called ZPAQL which is compiled or interpreted by the binary. If the post-processing program is present, then it is appended to the front of the first uncompressed segment and compressed along with its input data. If not, then the data is compressed directly with a one byte header to indicate its absence. Up to components are placed in an array. Each component in the model inputs a context hash and possibly the predictions of earlier components, and outputs a stretched prediction. The output of the last component is squashed and used to arithmetic code the data one bit at a time. The components are as follows: Contexts are computed by a ZPAQL program that is called once per modeled byte with that byte as input. The program places the context hashes in an array H of 32 bit unsigned values. Only the low bits of the hash are used, depending on the table size of each component. Because each bit must be modeled, the context hashes are combined with the previous bits of the current byte. This is done by expanding the previous bits to a 9 bit value ICM or ISSE or 8 bits all others and adding it to the bytewise context. ZPAQL is designed for fast execution rather than ease of programming. It resembles an assembly language instruction set. A ZPAQL program runs on a virtual options with the following state, all initialized to 0 at the start of a block: The sizes of H and M are specified as powers of 2 in the block header. Most instructions are either 1 byte, or 2 bytes including a numeric operand. The instruction set is as follows: The post-processor called PCOMPif it is present, is called once per decoded byte with that byte in the A register. At the end of each segment, it is called once more with -1 in A. The decompresser output is whatever is output by the OUT instruction. The context model called HCOMP is always present. It is called once per decoded byte. It puts its result in H. OUT has no effect. HCOMP sees as input the PCOMP code if present followed by a contiguous stream of segments with no separator symbols. Mmx ZPAQ program is a development environment for writing and debugging models. It allows programs to be single stepped or run separate from compression. It allows passing of numeric arguments and comments in parenthesis. Otherwise the code is interpreted. Decompression requires the same amount. The configuration file is divided into 3 sections. COMP describes the arrangement of components. HCOMP contains ZPAQL code that computes the contexts and puts them in H. POST 0 indicates that there is no post-processing. COMP is followed by 5 arguments: hh, hm, ph, pm, n. They must be numbered 0 through n-1 in the COMP section. Except for the line numbers, each token compiles to one byte of ZPAQL. The low 7 bits index a table of 16 byte arrays, and the next 8 bits are the checksum to detect collisions. The HCOMP section computes the contexts and puts them in H. It leaves bits clear because the bitwise context will be added to this. This is not a concern for the other contexts because they are hashed. HCOMP is called once after modeling each byte with the byte in the A register. All state information except A and PC which is reset to the first instruction is preserved between calls. It uses B as a working pointer to compute hashes and D as a pointer into H to store the result. Components 9 and 10 are an ISSE chain of word-oriented order 0 and 1 contexts for modeling text. These form a separate chain. Generally, the best compression is obtained when each ISSE context contains the lower order context of its input. Otherwise the models should be independent and mixed later. The context is formed by mapping upper and lower case letters together and discarding all other input. The order 0 context is a hash of all of the letters since the beginning of the word or 0 if the last byte was not a letter. The order 1 hash combines this with the previous word. Components 11 through 13 are sparse order 1 contexts with a gap of 1 to 3 bytes between the context and the current byte. These are useful for modeling binary files. Component 14 is a model for CCITT binary fax images PIC in the Calgary corpus. The context is the 8 bits from the previous scan line and 2 additional bits from the second to last scan line. Components 15 and 16 are order 1 and 0 mixers taking all earlier components as inputs. The second order 0 mixer has a slower learning rate because it has fewer free parameters. Z ASCII The "if" is converted to a conditional jump to the matching "else" or "endif". If the test passes then the two word hashes are updated. JMP 9 skips 9 bytes past the commented "else" clause. The following results are for the Calgary corpus as a solid archive when supported. Stubbe and Ashe Simon Christensen in for producing small, self extracting x86 executable programs. It is optimized for producing demos, or programs limited to bytes. It works like UPXextracting to memory and then running itself. But for this application, it is important to use a much better compression algorithm and that the decompression code be very small, in this case to bytes of size-optimized x assembler. Compression speed is much less important. As explained by Alexandre Mutel, Crinkler is based on PAQ1 linear mixing with sparse models and static weights tuned by the compressor. Model weights are powers of 2. Up to 32 model weights are delta coded and packed into 4 bytes, such that a 1 bit means that the weight is twice the previous weight and a 0 means it is the same. Contexts are mapped to a pair of 8-bit counters in a hash table. To make the decompresser as small as possible, there is no hash table collision detection. To compensate, the hash table is made very large, usually MB or MB. Also, there is no counter overflow detection. This saves 6 bytes of code. The compressor searches model space to find the best compression. It tries adding one model at a time, iterating in bit mirrored order, and keeping a model if it improves compression. Each time a model is added, it tries removing existing models one at a time to see if compression improves further. Crinkler uses additional tricks to save space prior to compression. It is a size-optimized linker. Thus, it can reorder code and data segments to group similar segments together. It eliminates unnecessary data that some linkers include, such as debugging symbols, and it replaces the DLL symbol table with a small piece of custom code. It compresses the code and data segments separately and applies a E8E9 transform replacing relative JMP and CALL operands with absolute addresses to the code segment. A transform converts data into a sequence of symbols which can be compressed with a simpler or faster model, or one using less memory, such as an order 0 context model. Those symbols still need to be modeled and coded as described in sections 4 and 3 respectively. A transform should ideally be a bijection. For each input, there should be exactly one possible representation. More often, the transform is an injection, and its inverse a surjection. An input may have more than one valid representation, either of which is transformed back to the original data during decompression. This increases the information content of the transformed data because the arbitrary choice of representation has to be modeled and coded, taking up space. A run length code replaces a long repeating sequence of identical symbols with two symbols representing a count and the value to be repeated. For example, the string "AAAAA" would be coded as 5,"A". In LZ77 compression, duplicate strings in the input are replaced by pointers to previous occurrences. LZ77 is not a bijection. For example, given the string: A pointer consists of an offset and a length. It is allowed for the copied string to overlap the output. For example "AAAAA" could be coded as a A, -1,4 meaning write a literal "A" and then go back 1 and copy the next 4 bytes. Thus, LZ77 may also encode run lengths. LZ77 decompression is extremely fast, faster than compression. The compressor must search for matching strings, typically using a hash table or tree. The decompresser only needs to maintain an output buffer from which to copy repeated strings, and then write a copy of its output to the buffer. The name "LZ77" comes from Lempel and Ziv, who described it in a paper Ziv and Lempel, LZSS Lempel-Ziv-Storer-Szymanski, uses 1 bit flags to mark whether the next symbol is a literal or options pointer. LZSS is used in NTFS file compression in Windows when the folder property is set to store files compressed. Its primary goal is to be extremely fast faster than disk access rather than provide good compression. The exact format was not published. This is followed by 16 symbols which are either 1 byte literals or 2 byte pointers. The offset is variable length with a maximum of 12 bits. Any remaining bits are allocated to the length, which has a minimum value of 3. Thus, after 2K of input, each pointer is a 12 bit offset and a 4 bit length ranging from 3 to Windows indicates that a compressed folder containing the Calgary corpus occupies 1,916,928 bytes. On the large text benchmark, the 1 GB text file enwik9 compresses to MB, slightly larger than an order 0 coder and about twice the size of zip. Copying enwik9 between 2 uncompressed folders takes 41 seconds on the test machine a laptop with a GHz T Copying from a compressed folder to an uncompressed folder takes 35 seconds, i. Copying from an uncompressed folder to a compressed folder takes 51 seconds. This is equivalent to compressing the Calgary corpus in seconds over the time to copy it. The NTFS implementation of LZSS is very similar to lzrw1-a implemented by Ross Williams in lzrw1-a uses a fixed 12 bit offset and 4 bit length. The widely popular deflate format is used in zip and gzip and is supported by many other archivers. It is used internally in PNG images, PDF documents, and Java JAR archives. The format is documented in RFC and supported by the open source zlib library. In the deflate format, pointer offsets range from 1 to and length from 3 to Literals and lengths are coded in a symbol alphabet which is Huffman coded followed by up to 5 extra uncompressed bits of the length. A length code is followed by an offset from a 30 symbol Huffman coded alphabet followed by up to 13 extra uncompressed bits. That sequence is itself compressed by run length encoding using another Huffman code to encode the literals and run lengths. This sequence is further compressed by reordering the sequence so that the values most likely to be 0 not used are at the end, and sending the sequence only up to the last nonzero value. A 4 bit number indicates the sequence length. The order is: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, All Huffman codes are packed in LSB to MSB order. All options produce compressed data in deflate format which decompresses at the same speed much faster than compression with the same algorithm. The difference is that with the higher options, the compressor spends more time looking for encodings that compress better. A typical implementation will keep a list of 3 byte matches the shortest allowed in a hash table and test the following data to find the longest match. With a higher option, the compressor will spend more time searching. It is also sometimes possible to improve compression by encoding a literal even if a match is found, if it results in a longer match starting at the next byte. Such testing also increases compression time. Compressed sizes and compression times on a GHz T are shown below for the 14 file Calgary corpus Program Size Time zip ,194, sec zip ,151, zip ,115, zip ,072, zip ,041, zip ,028, default zip ,025, zip ,021, zip ,020, kzipunzip LZMA LZMA Lempel Ziv Markov Algorithm is the native compression mode of zip. Although 7-zip is open source, the algorithm was never well documented or described until it was analyzed by Charles Bloom in Aug Compression is improved over LZ77 by using a longer history buffer up to 4 GBoptimal parsing, shorter codes for recently repeated matches like LZXliteral exclusion after matches, and arithmetic coding. Simple LZ77 uses greedy parsing. The next byte is always coded using the longest match found in the history buffer. Matias and Sahinalp proved that for any dictionary encoding method with the prefix property, that there is a fast O n algorithm that optimally parses the input into the minimum number of phrases by looking only at the lengths of the next two phrases considered. A dictionary has the prefix property if for any phrase in it, all of its prefixes are also in the dictionary. LZ77 and LZW both have this property. However, optimal parsing is more difficult to achieve in practice because phrases may have different code lengths, so that the minimum number of phrases does not necessarily give the best compression. Repeated matches zip uses shorter codes like LZX to code the 3 most recent matches in place of the LZ77 offset. In the case of the most recent match, the match length can be as small as Literal exclusion after matches zip uses a "delta literal" to code the first literal after the match. The idea is that the literal to be coded is known to be different than the predicted byte following the match in the history buffer. An ordinary context model would incorrectly assign a high probability to the predicted byte, which would waste code space. To prevent this, the coded literal is exclusive-ored with the predicted byte. Coding zip uses a binary arithmetic coder. It first codes one of 4 possible choices: literal, match, rep repeat of one of the 3 last matches or short rep repeat the last match with a length of A literal code is followed by the 8 bits of the coded byte in MSB to LSB order. A match is coded as a length followed by the offset, each using a variable length code in which the length of the binary value is transmitted followed by the value. A rep is followed by a length and a 1 or 2 byte code indicating the position in a move-to-front queue of the last 3 offsets. A short rep SRep code is complete. Literals are coded using an order 1 model plus the low bits of the current position. The position is useful for modeling binary files that have a repeating structure that is a power of 2, for example, an array of integers or floating point numbers. An order 1 model means that the context for each bit is the 8 bits of the previous byte and the 0 to 7 previously coded bits of the current byte. Match offsets are coded by transmitting the binary length of the offset and then the offset in LSB to MSB order. The length of the offset is coded using the match length as context. The context for the 4 low bits is the 0 to 3 previously coded bits. The higher order bits are not compressed. The state is one of 12 values and is updated by the following state table Next state State Lit Mat Rep SRep Meaning of current state literal after 2 or more literals literal after match, literal literal after rep, literal or after non-literal, short rep, literal literal after literal, short rep, literal delta literal after match delta literal after rep or after non-literal, short rep delta literal after literal, short rep match after literal rep after literal short rep after literal match after non-literal rep or short rep after non-literal LZX LZX is an LZ77 variant used in Microsoft CAB files and compressed help CHM files. It uses a history buffer of up to 2 MB and Huffman coding. To improve compression, it uses shorter codes to code the 3 most recent matches as with LZMA. The idea of using shorter codes for recent matches can be extended. The compressor for lzrw3 builds a dictionary a hash table of pointers into the history buffer as usual to find matching strings, but instead of coding the offset, it codes the index into the table. The decompresser builds an identical hash table from the data it has already decompressed, then uses the index to find the match. The length is coded as usual. ROLZ reduced offset LZ extends this idea further by replacing the large hash table with many smaller hash tables selected by a low order context. This reduces the size of the offset, although it can sometimes cause the best match to be missed. ROLZ was implemented in WinRK. The extreme case of ROLZ is to use one element per hash table. In this case, only a literal or length must be coded. This algorithm is called LZP. It was first described by Charles Bloom in LZP works best with a high order context. Thus, it is often used as a preprocessor to a low or moderate order context model, rather than a fast order 0 model like LZ snappy is a free, open source Apache compression library from Google in It uses byte aligned LZ77 intended for high speed especially on x hardware rather than good compression. Google uses snappy internally to compress its data structures for its search engine. Longer literals are coded by storing in the upper 6 bits to indicate that the length is encoded in the next 1 to 4 bytes in little-endian LSB first format. This is followed by the uncompressed literals. Matches of length 4 to 11 with offsets of 1 to are encoded using a 1 byte match. The match length is stored in the middle 3 bits of the tag byte. The most significant 3 bits of the offset are stored in the upper 3 bits of the tag byte. The lower 8 bits of the offset are stored in the next byte. A match may overlap the area to be copied. The length is encoded in the high 6 bits of the tag byte. The offset is stored in the next 2 bytes with the least significant byte first. Longer matches are broken into matches of length 64 or less. A 4 byte match allows offsets up to to be encoded as with a 2 byte match. The decompresser will decode them but the compressor does not produce them because the input is compressed in 32K blocks such that a match does not span a block boundary. The entire sequence of matches and literals is preceded by the uncompressed length up to written in base 128, LSB first, using 1 to 5 digits in the low 7 bits. The high bit is 1 to indicate that more digits follow. Compression searches for matches by comparing a hash of the 4 current bytes with previous occurrences of the same hash earlier in the 32K block. The hash function interprets the 4 bytes as a 32 bit value, LSB first, multiplies by 0x1e35a7bd, and shifts out the low bits. The hash table size is the smallest power of 2 in the range to that is at least as large as the input string. This is a speed optimization to reduce the time to clear the hash table for small inputs. As an optimization for hard to compress data, after 32 failures to find a match, the compressor checks only every second location in the input for the next 32 tests, then every third for the next 32 tests, and so on. When it finds a match, it goes back to testing every location. As another optimization for the x architecture, copies of bytes or less are done using two 64-byte assignments rather than memcpy To support this, if 15 or fewer bytes remain after a match then they are encoded as literals with no further search. Deduplication is the application of LZ77 to a file system rather than a data stream. The idea is to find duplicate files or files containing large blocks of data duplicated elsewhere, and replace them with pointers. Deduplication can save disk space, but is more commonly used for incremental backups where only data that has been changed is saved. Whole file deduplication is easy to implement. If a file system enforces a "last modified" timestamp or an "archive" flag, then files that have not been modified since the last backup need not be saved. Otherwise, or for on-disk deduplication, a secure hash such as SHA-1 of each file can be computed. If two files have the same bit hash, then it is safe to assume that the contents are identical. Furthermore, it is believed to be computationally infeasible to deliberately produce a collision. Compression can be improved by dividing large files into smaller blocks and searching for duplicates among them. This can be done efficiently by computing a hash of each block and saving pointers to the blocks in a hash table. To save space, it might be desirable to use a smaller but insecure hash and check for potential collisions by comparing the blocks. Dividing large files into fixed sized blocks typically 8 KB to 64 KB will not find duplicates between versions in which a small amount of data was inserted or deleted near the beginning because the data afterward would has moved relative to the block boundaries. This problem can be solved by selecting boundaries that depend on a rolling hash function over a sliding window. We set a boundary whenever n bits of the hash are equal to a fixed value. The average block size will be 2 n. A dictionary code may be, fixed, static or dynamic. In the fixed case, the dictionary is specified as part of the algorithm. In the static case, the compressor analyzes the input, constructs a dictionary, and transmits it to the decompresser. In the dynamic case, both the compressor and decompresser construct identical dictionaries from past data using identical algorithms. LZW Lempel-Ziv-Welch is a dynamic dictionary method. It is used by the UNIX compress program, GIF images, and is one of the compressed modes in TIFF images. The algorithm was patented by both Sperry later Unisys in and by IBM in when the USPTO did not realize that they were the same algorithm. Unisys was criticized for waiting until GIF became an established standard before demanding royalties from makers of software that could read or write GIF images. Both patents are now expired. LZW starts with a dictionary of byte symbols. It parses the input into the longest possible strings that match a dictionary entry, then replaces the string with its index. After each encoding, that string plus the byte that follows it is added to the dictionary. When the size is to 512, each code has 9 bits. When it is toeach code is 10 bits, and so on. A Windows version of the UNIX compress program compresses the Calgary corpus to files strategy 1,272,722 bytes in seconds and decompresses in seconds. Other variants of LZW may use larger dictionaries, mmx may use other replacement strategies like LRU least recently usedor other strategies for adding new symbols strategy as concatenating the last two coded symbols instead of a symbol plus the next byte. Dictionary encoding improves the compression of text files by replacing whole words with symbols ranging from 1 to 3 bytes. Fixed English dictionaries are used in WinRK, durilca, and in some versions of PAQ such as PAsQDacc 4. It is, of course, possible to compress to arbitrarily small sizes using this technique. The extreme case is barf. It has a built in 14 word dictionary, one for each file of the Calgary corpus. When the compressor detects a match, it "compresses" the file to 1 byte, which the decompresser correctly expands. For this reason, the large text benchmark and contests like the Calgary challenge and Hutter prize include the size of the decompression program and all other files needed to decompress. Still, it may be useful to use a dictionary for one or more languages if the input is expected to contain text in those languages. Of more interest are static dictionaries. These are used by the top 3 programs on the large text benchmark as of Feband in all of the Hutter options winners. Some of the later Calgary challenge winners also use small dictionaries. Recall from section that text compression is an AI problem. Humans can beat computers at the game because the prediction problem requires vast understanding of English and of the world. Nevertheless, some of the constraints of natural language can be modeled. These rules are categorized as follows: While playing the game, you will notice that useful contexts start on word boundaries. Thus, text models in PAQ construct contexts that options on word boundaries. It should be irrelevant if a context spans a line break. Thus, word contexts in PAQ discard the characters between words. Furthermore, it should be irrelevant if the context is upper or lower case, because it does not change the meaning. Thus, there are word models that ignore case. Semantics can be modeled by associating each pair of words like cat, mouse with a co-occurrence frequency over a small window. Words that frequently occur near each other tend to have related meanings. This can be modeled with a sparse order-1 word model, skipping one or more words in between the context and the predicted word. Many PAQ versions have sparse word models with small gaps of 1 to 3 words. For modeling semantics, it is useful to split text into "meaningful" units or morphemes if possible. For example, "moves" really has two independent parts, the stem "move" and suffix "s". Ideally these should be modeled as separate words. Grammar constrains text to make certain sequences more likely, such as the NOUN or a ADJ NOUN. It is possible to learn the parts of speech by observing when words occur in similar contexts and grouping them. For example "the dog", "the cat", and "a dog" could be used to predict the unseen sequence "a cat". This works by grouping "the" and "a" into one semantic category and "dog" and "cat" into another category. A dictionary transform works by replacing the input text with a sequence of highly predictive symbols corresponding to morphemes, independent of capitalization and punctuation. This improves compression both by allowing simpler models to be used, and by reducing the size of the input, which improves speed and reduces pressure on memory. Compression can be improved further by grouping semantically or grammatically related words so that the compressor can merge them into single contexts by ignoring the low bits of dictionary codes. Care should be taken not to remove useful context, which can happen if a dictionary is too large or divides words in the wrong places. For example, coding "move" and "remove" as unrelated symbols would not help compression. A capitalization model replaces upper case letters with a special symbol followed by a lower case letter. For example, "The" could be coded as "Athe" where "A" directs the decompresser to capitalize the next letter. Alternatively, a more sophisticated model might automatically capitalize the first letter after a period, and insert a special symbol to encode exceptions. Because newlines are semantically equivalent to spaces, it is sometimes useful to replace them with spaces, and encode in a separate stream the information to put them back. A simple transform is space stuffing, where a space is inserted after every newline. Space stuffing does not help with multi-word contexts that span lines. However the alternative is to remove context that could predict newlines, such as periods at the end of a paragraph. Word encoding is done in two passes. In the first pass, the text is parsed into words sequences of A-Z, a-z, and possibly UTF-8 characters in non-English alphabets and counted. Words with counts below a threshold are discarded. In the second pass, words found in the dictionary are replaced with 1 or 2 byte codes or 3 bytes for large dictionaries. The dictionary is listed at the beginning of the output, followed by the encoded data. Words not found in the dictionary and non-letters are passed unchanged.

$376 Profit in 5 Minute [Binary Options]

$376 Profit in 5 Minute [Binary Options]

5 thoughts on “Binary options strategy mmx”

  1. ad-web says:

    The new Example Textbook Lists which were reviewed by Learning List are for Biology, Calculus, Chemistry, Computer Science Principles, Physics 1 and 2, and World History.

  2. andy_veresov says:

    Adolf Berger, Encyclopedic Dictionary of Roman Law (Transactions of the American.

  3. anness says:

    When REM sleep blends with heightened frontal lobe activity, the imaginal richness of the dreamworld is enhanced with self-awareness and powerful drives for emotional significance.

  4. alexshab says:

    On the fateful morning of September 11, 2001, few of us knew that the history of America, let alone our individual lives, was dramatically and permanently changing.

  5. Anon_name says:

    Mom aus dem Schnee hervorgraben, vom Schnee Iwfreion (besonders vom Holz im.

Leave a Reply

Your email address will not be published. Required fields are marked *

inserted by FC2 system