Thursday, June 28, 2012
Maximum Entropy Distributions



Entropy is an important topic in many fields; it has very well known uses in statistical mechanics, thermodynamics, and information theory. The classical formula for entropy is Σi(pi log pi), where p=p(x) is a probability density function describing the likelihood of a possible microstate of the system, i, being assumed. But what is this probability density function? How must the likelihood of states be configured so that we observe the appropriate macrostates?



In accordance with the second law of thermodynamics, we wish for the entropy to be maximized. If we take the entropy in the limit of large N, we can treat it with calculus as S[φ]=∫dx φ ln φ. Here, S is called a functional (which is, essentially, a function that takes another function as its argument). How can we maximize S? We will proceed using the methods of calculus of variations and Lagrange multipliers.



First we introduce three constraints. We require normalization, so that ∫dx φ = 1. This is a condition that any probability distribution must satisfy, so that the total probability over the domain of possible values is unity (since we’re asking for the probability of any possible event occurring). We require symmetry, so that the expected value of x is zero (it is equally likely to be in microstates to the left of the mean as it is to be in microstates to the right — note that this derivation is treating the one-dimensional case for simplicity). Then our constraint is ∫dx x·φ = 0. Finally, we will explicitly declare our variance to be σ², so that ∫dx x²·φ = σ².



Using Lagrange multipliers, we will instead maximize the augmented functional S[φ]=∫(φ ln φ + λ0φ + λ1xφ + λ2x²φ dx). Here, the integrand is just the sum of the integrands above, adjusted by Lagrange multipliers λk for which we’ll be solving.



Applying the Euler-Lagrange equations and solving for φ gives φ = 1/exp(1+λ0+xλ1+x²λ2). From here, our symmetry condition forces λ1=0, and evaluating the other integral conditions gives our other λ’s such that q = (1/2πσ²)½·exp(-x² / 2σ²), which is just the Normal (or Gaussian) distribution with mean 0 and variance σ². This remarkable distribution appears in many descriptions of nature, in no small part due to the Central Limit Theorem.

Maximum Entropy Distributions

Entropy is an important topic in many fields; it has very well known uses in statistical mechanics, thermodynamics, and information theory. The classical formula for entropy is Σi(pi log pi), where p=p(x) is a probability density function describing the likelihood of a possible microstate of the system, i, being assumed. But what is this probability density function? How must the likelihood of states be configured so that we observe the appropriate macrostates?

In accordance with the second law of thermodynamics, we wish for the entropy to be maximized. If we take the entropy in the limit of large N, we can treat it with calculus as S[φ]=∫dx φ ln φ. Here, S is called a functional (which is, essentially, a function that takes another function as its argument). How can we maximize S? We will proceed using the methods of calculus of variations and Lagrange multipliers.

First we introduce three constraints. We require normalization, so that ∫dx φ = 1. This is a condition that any probability distribution must satisfy, so that the total probability over the domain of possible values is unity (since we’re asking for the probability of any possible event occurring). We require symmetry, so that the expected value of x is zero (it is equally likely to be in microstates to the left of the mean as it is to be in microstates to the right — note that this derivation is treating the one-dimensional case for simplicity). Then our constraint is ∫dx x·φ = 0. Finally, we will explicitly declare our variance to be σ², so that ∫dx x²·φ = σ².

Using Lagrange multipliers, we will instead maximize the augmented functional S[φ]=∫(φ ln φ + λ0φ + λ1xφ + λ2x²φ dx). Here, the integrand is just the sum of the integrands above, adjusted by Lagrange multipliers λk for which we’ll be solving.

Applying the Euler-Lagrange equations and solving for φ gives φ = 1/exp(1+λ0+xλ1+x²λ2). From here, our symmetry condition forces λ1=0, and evaluating the other integral conditions gives our other λ’s such that q = (1/2πσ²)½·exp(-x² / 2σ²), which is just the Normal (or Gaussian) distribution with mean 0 and variance σ². This remarkable distribution appears in many descriptions of nature, in no small part due to the Central Limit Theorem.

Tuesday, August 30, 2011
Imagine you had a function P that upon swallowing a subset E of a universal set       Ω will return a number x from the real number line. Keep imagining that P must also obey the following rules:
If P can eat the subset, it will always return a nonnegative number.
If you give P the universe       Ω, it will give you back 1.
 If you collected together disjoint subsets and gave them to P to process, the result would be the same as feeding P each subset individually and adding the answers.
Simple, if odd out of context.
Mathematicians have a curious way of pulling magic of out simplicity.
~
Probability today is studied as a mathematical science based on the three axioms (flavored by set theory) stated above. These are the “first principles” from which many other, derivative propositions have been speculated and proved. The results of the modern study of probability fuel many branches of  engineering, including signals processing in electrical and computer  engineering, the insurance and finance industries, which translate  probabilities into economic movement, and many other enterprises. Along the way it borrowed from the other giants of mathematics, analysis and algebra, and goes on generating new research ideas for itself and other fields. This is the way of math: set down a bunch of rules (preferably simple to start) and see how their consequences play out.
But what is probability? If it is quantitative measure, what is it measuring? How valid is that measure and how could it be checked? Even these are rich questions to probe. A working qualitative description for practitioners might be that probability quantifies uncertainty. It answers with some degree of success such questions as “What is the chance?” or “How likely is this?” If a system contains uncertainty, probability provides the model for handling it, and data gathered from the system can validate or improve the probability model.
According to Wikipedia, there are three main interpretations for probability:
Frequentists talk about probabilities only when dealing with experiments that are random and well-defined. The probability of a random event denotes the relative frequency of occurrence of an experiment’s outcome, when repeating the experiment. Frequentists  consider probability to be the relative frequency “in the long run” of  outcomes.
Subjectivists assign numbers per subjective probability, i.e., as a degree of belief.
Bayesians include expert knowledge as well as experimental data to produce  probabilities. The expert knowledge is represented by a prior  probability distribution. The data is incorporated in a likelihood  function. The product of the prior and the likelihood, normalized,  results in a posterior probability distribution that incorporates all  the information known to date.

~
So let’s reinterpret the math.
Let       Ω be the sample space, the set of all possible outcomes, be Ei be subsets of       Ω which denote different events for different i, and 𝔹 be the set of all events. Then a probability map P is defined as any function from 𝔹 → ℝ satisfying
P(Ei) ≥ 0All probabilities are non-negative.
P(Ω) = 1It is certain that one of the outcomes of Ω will happen.
Ei ∩ Ej = ∅ if i≠j ⇔ P(∑i Ei) = ∑iP(Ei)Probabilities of disjoint events can be added to get the probability of any of them happening.
—
Image generated by Rene Schwietzke using POV-Ray, a raytracing freeware that creates 3D computer graphics.
Further reading:A First Course in Probability (8th ed., 2010), Sheldon Ross.Probability and Statistics (4th ed., 2010), Mark J. Schervish and Morris H. Degroot.

Imagine you had a function P that upon swallowing a subset E of a universal set Ω will return a number x from the real number line. Keep imagining that P must also obey the following rules:

  1. If P can eat the subset, it will always return a nonnegative number.
  2. If you give P the universe Ω, it will give you back 1.
  3. If you collected together disjoint subsets and gave them to P to process, the result would be the same as feeding P each subset individually and adding the answers.

Simple, if odd out of context.

Mathematicians have a curious way of pulling magic of out simplicity.

~

Probability today is studied as a mathematical science based on the three axioms (flavored by set theory) stated above. These are the “first principles” from which many other, derivative propositions have been speculated and proved. The results of the modern study of probability fuel many branches of engineering, including signals processing in electrical and computer engineering, the insurance and finance industries, which translate probabilities into economic movement, and many other enterprises. Along the way it borrowed from the other giants of mathematics, analysis and algebra, and goes on generating new research ideas for itself and other fields. This is the way of math: set down a bunch of rules (preferably simple to start) and see how their consequences play out.

But what is probability? If it is quantitative measure, what is it measuring? How valid is that measure and how could it be checked? Even these are rich questions to probe. A working qualitative description for practitioners might be that probability quantifies uncertainty. It answers with some degree of success such questions as “What is the chance?” or “How likely is this?” If a system contains uncertainty, probability provides the model for handling it, and data gathered from the system can validate or improve the probability model.

According to Wikipedia, there are three main interpretations for probability:

  1. Frequentists talk about probabilities only when dealing with experiments that are random and well-defined. The probability of a random event denotes the relative frequency of occurrence of an experiment’s outcome, when repeating the experiment. Frequentists consider probability to be the relative frequency “in the long run” of outcomes.
  2. Subjectivists assign numbers per subjective probability, i.e., as a degree of belief.
  3. Bayesians include expert knowledge as well as experimental data to produce probabilities. The expert knowledge is represented by a prior probability distribution. The data is incorporated in a likelihood function. The product of the prior and the likelihood, normalized, results in a posterior probability distribution that incorporates all the information known to date.

~

So let’s reinterpret the math.

Let Ω be the sample space, the set of all possible outcomes, be Ei be subsets of Ω which denote different events for different i, and 𝔹 be the set of all events. Then a probability map P is defined as any function from 𝔹 → satisfying

  1. P(Ei) ≥ 0
    All probabilities are non-negative.
  2. P(Ω) = 1
    It is certain that one of the outcomes of Ω will happen.
  3. Ei ∩ Ej = ∅ if i≠j ⇔ P(∑i Ei) = ∑iP(Ei)
    Probabilities of disjoint events can be added to get the probability of any of them happening.

Image generated by Rene Schwietzke using POV-Ray, a raytracing freeware that creates 3D computer graphics.

Further reading:
A First Course in Probability (8th ed., 2010), Sheldon Ross.
Probability and Statistics (4th ed., 2010), Mark J. Schervish and Morris H. Degroot.

Thursday, August 25, 2011
The Stern-Gerlach Experiment
In 1922 at the University of Frankfurt in Frankfurt, Germany, Otto Stern and Walther Gerlach sent a beam of silver atoms through an inhomogeneous magnetic field in their experimental device. They were taking a look at the new concept of quantized spin angular momentum. If indeed the spin associated with particles could only take on two or some other countable number of states, then the atoms transmitted through the other end of their machine should come out as two (or more) concentrated beams. Meanwhile if the quantum theory was wrong, classical physics predicted that the profile of a single smeared-out beam would result on the detector screen, due to the magnetic field deflecting each randomly spin-oriented atom a different amount on a continuous, rather than discrete, scale.
As you can see above, the results of the Stern-Gerlach experiment confirmed the quantization of spin for elementary particles.
Spin and quantum states
A spin-1/2 particle actually corresponds to a qubit
|ψ> = c1|ψ↑> + c2|ψ↓>
a wavefunction representing a particle whose quantum state can be seen as the superposition (or linear combination) of two pure states, one for each kind of possible spin along a chosen axis (such as x, y or z). The silver atoms of Stern and Gerlach’s experiment fit in this description because they are made of spin-1/2 particles (electrons and quarks, which make up protons and neutrons).
Significantly, the constant coefficients c1 and c2 are complex and can’t be directly measured. But the squared moduli ||c1||2 and ||c2||2 of these coefficients represent the probability that a particle in state |ψ> will be observed as spin up or down at the detector.
||c1||2 + ||c2||2 = 1 : it is certain that the particle will be detected in one of the two spin states.
That means when we pass a large sample of particles in identical quantum states through a Stern-Gerlach (S-G) machine and detector, we are actually measuring the probabilities that the particle will adopt the spin up or spin down states along the particular axis of the S-G machine. This follows the relative-frequency interpretation of probability, where as the number of identical trials grows large the relative frequency of an event approaches the true probability that the event will occur in any one trial.
By moving the screen so that either the up or down beam is allowed to pass while the the other is stopped at the screen, we are “polarizing” the beam to a certain spin orientation along the S-G machine axis. We can then place one or more S-G machines with stops in front of that beam and reproduce all the experiments analogous to linear polarization of light.

The Stern-Gerlach Experiment

In 1922 at the University of Frankfurt in Frankfurt, Germany, Otto Stern and Walther Gerlach sent a beam of silver atoms through an inhomogeneous magnetic field in their experimental device. They were taking a look at the new concept of quantized spin angular momentum. If indeed the spin associated with particles could only take on two or some other countable number of states, then the atoms transmitted through the other end of their machine should come out as two (or more) concentrated beams. Meanwhile if the quantum theory was wrong, classical physics predicted that the profile of a single smeared-out beam would result on the detector screen, due to the magnetic field deflecting each randomly spin-oriented atom a different amount on a continuous, rather than discrete, scale.

As you can see above, the results of the Stern-Gerlach experiment confirmed the quantization of spin for elementary particles.

Spin and quantum states

A spin-1/2 particle actually corresponds to a qubit

|ψ> = c1> + c2>

a wavefunction representing a particle whose quantum state can be seen as the superposition (or linear combination) of two pure states, one for each kind of possible spin along a chosen axis (such as x, y or z). The silver atoms of Stern and Gerlach’s experiment fit in this description because they are made of spin-1/2 particles (electrons and quarks, which make up protons and neutrons).

Significantly, the constant coefficients c1 and c2 are complex and can’t be directly measured. But the squared moduli ||c1||2 and ||c2||2 of these coefficients represent the probability that a particle in state |ψ> will be observed as spin up or down at the detector.

||c1||2 + ||c2||2 = 1 : it is certain that the particle will be detected in one of the two spin states.

That means when we pass a large sample of particles in identical quantum states through a Stern-Gerlach (S-G) machine and detector, we are actually measuring the probabilities that the particle will adopt the spin up or spin down states along the particular axis of the S-G machine. This follows the relative-frequency interpretation of probability, where as the number of identical trials grows large the relative frequency of an event approaches the true probability that the event will occur in any one trial.

By moving the screen so that either the up or down beam is allowed to pass while the the other is stopped at the screen, we are “polarizing” the beam to a certain spin orientation along the S-G machine axis. We can then place one or more S-G machines with stops in front of that beam and reproduce all the experiments analogous to linear polarization of light.