📡

Information Theory Foundations

Entropy, bits, and the science of communication

This pack distills the core principles of information theory from Shannon onward, including entropy, channel capacity, compression, and error correction. It connects these ideas to modern applications in data science, cryptography, biology, and cognition. Designed for professionals and thinkers seeking rigorous conceptual tools rather than mathematical proofs.

10 documents · sourced from Ilan Shomorony · Tai-Danae Bradley / Entropy as a Topological Operad Derivation / arXiv:2107.09581v2 · Ioannis Haranas · Shannon’s source coding theorem (Perplexity web research) / arXiv 1511.06071v2 · Erik Agrell / The Channel Capacity Increases with Power / arXiv:1108.0391v3 · arXiv 2308.05472v1 / arXiv 2007.07507v1 / Perplexity web research on noisy channel coding theorem · Chang · Sudipto Mukherjee · Nilanjana Datta · Perplexity web research on entropy and information measures in ML

Install this pack — try MIND free →Open in MIND

What’s inside

The Birth of Information Theory: Shannon's Foundational 1948 Paper

Ilan Shomorony, Reinhard Heckel / Information-Theoretic Foundations of DNA Data Storage / arXiv:2211.05552v1

Recent research extends information-theoretic principles into specialized domains with rigorous analysis of fundamental limits. In DNA data storage, millions of short noisy sequences must reliably encode and reconstruct data despite synthesis, sequencing, and decay errors; probabilistic models derived from current technological constraints quantify achievable storage rates under these conditions. Group testing addresses sparse defective-item identification through pooled tests, yielding optimal rates via achievability bounds for efficient algorithms and matching converse bounds that hold across sparsity regimes, with explicit constant-factor characterizations of information learned per test. Quantum theory receives an informational treatment that frames its axioms through operational tasks and foils, clarifying distinctions between classical and quantum resources. Homotopy type theory supplies univalent foundations by identifying isomorphic structures via the univalence axiom and by using higher inductive types to define homotopy-theoretic objects directly inside type theory, enabling invariant conceptions of mathematical structures inaccessible in set-theoretic settings. Collectively these developments demonstrate how core information measures govern capacity, inference, and foundational reasoning across physical and abstract systems.

Entropy: Measuring Uncertainty in Information Systems

Tai-Danae Bradley / Entropy as a Topological Operad Derivation / arXiv:2107.09581v2

Shannon entropy for a discrete random variable X equals the negative sum over outcomes x of p(x) log base two of p(x) and equals the expected information content or surprise needed to identify a realized outcome when the distribution is already known. This expectation is larger when improbable events dominate because the term negative log two p(x) grows as probability falls, so entropy rises with greater uniformity across many outcomes and falls when the source is highly predictable. In the operad of topological simplices Shannon entropy satisfies the derivation property with respect to an abelian bimodule over the operad; every derivation of this operad coincides with a constant multiple of Shannon entropy at some evaluation point, recovering and extending Faddeev’s 1956 axiomatic characterization together with Leinster’s recent variant. For any N-point distribution whose Rényi entropies of orders two and three are known, explicit lower and upper bounds on Shannon entropy follow directly and their average supplies a concrete extrapolation formula that also relates von Neumann entropy of a mixed quantum state to its linear entropy. In continuous time-dependent densities the same Gibbs-Shannon functional evolves differently from Kullback-Leibler divergence under both classical Smoluchowski and quantum Schrödinger dynamics, exposing power-transfer processes inaccessible to relative entropy alone, while successive-measurement uncertainty relations formulated with Rényi or Tsallis entropies retain state-independent lower bounds set by apparatus acceptance functions.

Bits as the Atomic Unit of Information

Ioannis Haranas, Ioannis Gkigkitzis / arXiv:1406.3041v1

A bit functions as the atomic unit quantifying information, defined precisely as the entropy tied to revealing the outcome of a fair binary alternative with equal probabilities of one half. Shannon entropy for any discrete random variable equals the negative sum over each outcome of its probability multiplied by the base-two logarithm of that probability, producing an average information measure expressed in bits. When all outcomes are equiprobable the resulting value is exactly one bit, while unequal probabilities yield any non-negative real number of bits and fractional bits indicate that no information can be extracted. With b bits it becomes possible to distinguish among two to the power b distinct messages, so that specifying one outcome among N equally likely alternatives requires log base two of N bits. This same information-bit quantity N enters explicit relations derived from the cosmological constant, Planck constant, speed of light and gravitational constant, linking it to a minimum quantum mass of order 2.0 times ten to the minus sixty two kilograms and a maximum gravitational mass of order 2.3 times ten to the fifty four kilograms; the same framework produces the large number ten to the one hundred twenty two from fundamental cosmological parameters. Related constructs such as mutual information, quantum bit error rate and maximal alpha-leakage quantify information transfer, error and leakage in concrete channels, recovering ordinary mutual information when the tunable parameter alpha equals one.

Source Coding Theorem and Lossless Compression

Shannon’s source coding theorem (Perplexity web research) / arXiv 1511.06071v2

Shannon's source coding theorem establishes that a discrete memoryless source X with Shannon entropy H(X) admits asymptotically lossless block compression to an average rate of exactly H(X) bits per symbol. Blocks of N symbols can be encoded with total length slightly above N H(X) while the error probability vanishes as N tends to infinity, yet any scheme whose rate lies strictly below H(X) must incur information loss with high probability. The same limit holds for stationary ergodic sources once the entropy rate h equals the limit of (1/n) H(X1 … Xn) is substituted for H(X). For ordinary prefix codes on individual symbols the expected length L of an optimal code satisfies the tight bounds H(X) ≤ L < H(X) + 1. These classical statements receive a quantum extension in the coded-source-compression setting of arXiv 1511.06071v2, where measurement compression supplies the direct-coding argument that yields a single-letter characterization of the achievable rate region when a helper sends quantum side information over a classical channel; separate measurement followed by compression is shown to be strictly suboptimal.

Channel Capacity: The Maximum Rate of Reliable Transmission

Erik Agrell / The Channel Capacity Increases with Power / arXiv:1108.0391v3

Channel capacity is defined as the maximum reliable information rate of a channel, given mathematically by the maximum of the mutual information between channel input and output over all possible input distributions. For a discrete memoryless channel with input X and output Y, this capacity equals the supremum of I(X;Y) taken over all input distributions p_X(x). Operationally, it is the supremum of all rates at which information can be transmitted with arbitrarily small error probability, meaning any rate below capacity is achievable with vanishing error using long enough codes. For symmetric channels, the capacity calculation simplifies with the optimal input distribution being uniform. Specific cases include the binary symmetric channel where capacity is one minus the binary entropy of the crossover probability, and the binary erasure channel with capacity one minus the erasure probability. For continuous-time additive white Gaussian noise channels that are bandlimited and power constrained, the capacity follows the Shannon-Hartley formula as bandwidth times the base-two logarithm of one plus the signal to noise ratio. It has been proved that for memoryless vector channels the capacity cannot decrease with increasing average transmitted power.

Noisy Channel Coding Theorem Explained

arXiv 2308.05472v1 / arXiv 2007.07507v1 / Perplexity web research on noisy channel coding theorem

The noisy channel coding theorem establishes that every discrete memoryless channel possesses a finite capacity C, such that reliable communication is possible at any rate R below C through appropriate coding while the error probability can be driven arbitrarily close to zero, yet becomes bounded away from zero whenever R exceeds C. This holds because noise imposes a strict rate limit rather than rendering communication impossible, with redundancy introduced via sufficiently long codewords enabling error correction below capacity. Research on polarization-adjusted convolutional codes demonstrates that these concatenated schemes, built on polar codes, approach the finite-length bound for binary-input additive white Gaussian noise channels at short blocklengths and extend similarly to source and joint source-channel coding tasks. Analyses of noisy permutation channels, formed by a discrete memoryless channel followed by an independent random permutation, yield matching lower and upper bounds on capacity for strictly positive full-rank stochastic matrices, expressed via matrix rank for achievability and entrywise positivity for converses. Entanglement-assisted zero-error source-channel coding results further show that the Lovász theta number upper-bounds the Shannon capacity even with entanglement while Szegedy's number supplies lower bounds on achievable rates in the corresponding graph-theoretic formulations. Capacity characterizations for the linear deterministic interference channel with noisy output feedback likewise identify asymmetric scenarios where feedback on one link matches the sum-rate gains of bilateral feedback.

Error Detection and Correction Codes

Chang, Hsun-Hsien / An Introduction to Error-Correcting Codes: From Classical to Quantum / arXiv:quant-ph/0602157v1

Foundations of error detection and correction codes rest on principles that add redundancy to combat noise in both classical and quantum settings. Classical block codes process a fixed number of information bits into extended codewords with convolutional variants incorporating memory from prior bits while specific constructions like repetition codes enable majority voting parity bits detect basic inconsistencies Hamming codes correct single errors as in the seven four example BCH codes handle cyclic corrections Reed Solomon address burst errors effectively in practical systems and Hadamard codes provide robust distance in noisy environments. These derive from information theory approaches to channel reliability. In quantum contexts entanglement assisted quantum error correcting codes offer structured protection leveraging shared entanglement as introduced in the work arXiv 1610.04013v1. Continuous time quantum error correction models both decoherence and feedback operations as ongoing processes using weak measurements reducing to single qubit protection strategies according to arXiv 1311.2485v2. Memory oriented unidirectional codes minimize check bits tailored to byte organized systems per arXiv 1002.1191v1. Surveys bridging classical and quantum methods highlight their common reliance on redundancy for detectability and correctability conditions as presented in arXiv quant-ph/0602157v1.

Mutual Information and Dependence Measures

Sudipto Mukherjee, Himanshu Asnani, Sreeram Kannan / arXiv:1906.01824v1

Mutual information quantifies dependence between random variables X and Y by the Kullback-Leibler divergence between the joint distribution p_XY and the product of marginals p_X p_Y, so that the quantity equals zero if and only if the variables are independent. This definition registers any departure from independence and therefore captures nonlinear relationships in addition to linear ones. For discrete variables the same quantity equals the difference H(X) + H(Y) − H(X,Y), showing that dependence reduces joint entropy below the sum of the separate entropies. Classifier-based estimators obtain the divergence by training a model to separate samples from the joint versus the product distribution, then extend the construction to conditional mutual information via conditional generative models; the resulting estimators maintain accuracy as dimension grows, unlike k-nearest-neighbor or kernel baselines. Ensemble estimators formed as weighted sums of kernel plug-in estimators with different bandwidths attain the parametric 1/N mean-squared-error rate for generalized mutual information when the relevant conditional densities are sufficiently smooth, including the mixed discrete-continuous setting. Finite-time mutual information between correlated Gaussian processes, obtained via Mercer expansions of trace-class operators, can exceed the long-term average rate inside a single observation window. The same operator framework yields mutual-information expressions for spatially continuous electromagnetic channels modeled by random fields under both white and colored noise.

Rate-Distortion Theory for Lossy Compression

Nilanjana Datta, Min-Hsiu Hsieh, Mark M. Wilde / Quantum rate distortion, reverse Shannon theorems, and source-channel separation / arXiv:1108.4940v3

Rate-distortion theory characterizes the fundamental limits of lossy compression by identifying the minimum rate needed to represent a source while keeping expected distortion at or below a chosen level D. For a source X with known distribution, the rate-distortion function R(D) equals the minimum mutual information I(X; hat X) over all conditional distributions p(hat x|x) satisfying the constraint E[d(X, hat X)] less than or equal to D, where d is a distortion measure such as mean squared error. This expression follows directly from Shannon’s rate-distortion theorem and supplies the greatest lower bound on achievable compression rate for any encoding and reconstruction scheme operating under the same source statistics and fidelity criterion. The resulting R(D) decreases monotonically with increasing D, tracing the precise trade-off between rate and quality; at D equal to zero the function reduces to the source entropy, recovering the lossless case as a boundary. The value of R(D) depends on both the source distribution and the chosen distortion metric, and no practical lossy compressor can beat this bound for the given conditions. Extensions to quantum sources replace the classical mutual information with the regularized entanglement of purification, while secrecy formulations incorporate an adversary’s distortion as an additional figure of merit, yet the core single-letter minimization remains the classical benchmark against which all such generalizations are compared.

Information Theory in Modern Data Science

Perplexity web research on entropy and information measures in ML

Entropy and information measures underpin many data science and machine learning algorithms by quantifying uncertainty, dependence, and distributional mismatch, allowing them to rank features, choose splits in trees, estimate relationships between variables, and optimize probabilistic models. Entropy in Shannon’s sense measures the average uncertainty of a random variable or distribution and serves as a feature or objective in data analysis and machine learning. Mutual information quantifies statistical dependence between variables through entropy reduction and identifies variables most informative about a target. Relative entropy or KL divergence measures how one distribution differs from another, supporting model fitting and comparison of empirical data to a reference. Cross-entropy functions as a loss in classification by assessing how predicted probabilities match the true label distribution. In decision-tree learning, ID3 selects splits via information gain, defined as the difference between a decision attribute’s entropy and its conditional entropy given a feature, which equals mutual information. Entropy-based methods support classification, discrimination, clustering, segmentation, anomaly detection, feature extraction, and algorithm optimization in data analysis and machine learning, where efficient estimation matters for large complex data. Low entropy indicates data that are more structured or predictable, whereas high entropy indicates data that are more random or uncertain, guiding algorithms on what to split on, compress, predict confidently, or align with a target distribution.

Your AI shouldn’t start from zero.

Install this pack and your MIND begins smart — then every answer is grounded in your own knowledge graph.

Try MIND free →