Exploring Frequentist Probability vs Bayesian Probability
Table of Contents
Confessions of a moderate Bayesian, part 2
Read Part 1: Confessions of a moderate Bayesian, part 1
Bayesian statistics by and for non-statisticians
https://www.cafepress.com/physicsforums.13280237
Background
One of the continuous and occasionally contentious debates surrounding Bayesian statistics is the interpretation of probability. For anyone who is familiar with my posts on this forum, I am not generally a big fan of interpretation debates. This one is no exception. So I am going to present both interpretations as factually as I can, and then conclude with my personal take on the issue and my approach.
Probability axioms
Probability is a mathematical concept that is applied to various domains. I think that it is worthwhile to point out the mathematical underpinnings in at least a brief and non-rigorous form.
The axioms of probability that are typically used were formulated by Kolomgorov. He started with a complete set of “events” forming a sample space and a measure on that sample space called the probability of the event. Then probability is defined by the following axioms:
- The probability of any event in the sample space is a non-negative real number.
- The probability of the whole sample space is 1.
- The probability of the union of several mutually exclusive events is equal to the sum of the probabilities of the individual events.
Anything that behaves according to these axioms can be treated as a probability. I have glossed over some of the technical details of setting up the sample space and the events, and also it is worth noting that the third axiom can be written in terms of a countably infinite union or a finite union.
Randomness
It is important to recognize that nothing in the axioms of probability requires randomness. That is, the mathematical concept of probability is used to analyze randomness, but that is an application of probability, not probability itself.
Similarly, vectors are used to represent the outcome of a measurement of some quantity like velocity, but nothing in the mathematical definition of a vector requires velocity. Velocity is an application of vectors just as randomness is an application of probability.
Frequentist probabilities
In typical introductory classes, the concept of probability is introduced together with the notion of a random variable that can be repeatedly sampled. A good example is an outcome of flipping a coin. It doesn’t matter too much if we consider a coin-flipping system to be inherently random or simply random due to ignorance of the details of the initial conditions on which the outcome depends. Either way, we can perform the physical experiment of flipping a coin and we can observe that the result of the experiment is either a heads or a tails.
Now, to apply the axioms of probability to this we need to construct a sample space. That is rather easy, our sample space can be ##\{H,T\}## where ##H## is the event of getting heads on a single flip and ##T## is the event of getting tails on a single flip.
Now, we need a way to determine the measure ##P(H)##. For frequentist probabilities, the way to determine ##P(H)## is to repeat the experiment a large number of times and calculate the frequency that the event ##H## happens. In other words, if you do ##N## trials and get ##n_H## heads then $$P(H) \approx \frac{n_H}{N}$$ for large ##N## with equality for a hypothetical infinite ##N##. So a frequentist probability is simply the “long-run” frequency of some event.
This has some nice features. First, it is objective; anyone with access to the same infinite set of data will get the same number for ##P(H)##. Second, it follows the axioms above, so you can either use ##P(H)## and the axioms to calculate ##P(T)## or you can use your data set to get the long-run frequency of tails ##n_T/N##.
It also has some problematic features, the worst of which is the long-run frequency. It is not realistic to get an infinite set of data even for something as inexpensive as flipping a coin, let alone for more expensive experiments where a single data point may cost thousands of dollars and years of time. The best you can do is get an approximation to ##P(H)## and sometimes that approximation can be quite bad.
Bayesian probabilities
The Bayesian concept of probability is more about uncertainty than about randomness. Remember, randomness is an important application of probability, not probability itself. Of course, if something is random, then we will be uncertain about it, but we can be uncertain about things that we don’t consider to be random.
For example, the value of the gravitational constant ##G## in SI units. We wouldn’t generally think of that as being random, but we also do not know it with certainty. We can therefore treat our uncertain knowledge of ##G## as a Bayesian probability. Some of the terminologies remain from the frequentist usage, so we may even call ##G## a random variable, although a purist (which I am not) may insist on calling it a parameter.
Bayesian probabilities obey the standard axioms of probability, so they are full-fledged probabilities, regardless of whether they describe true randomness or other uncertainty. Often they are described in terms of subjective beliefs, however “belief” in this sense is formalized in a way that requires “beliefs” to follow the axioms of probability. This is not how the psychological phenomenon of belief always works.
Bayes’ theorem
From the axioms of probability, it is relatively straightforward to derive Bayes’ theorem from whence Bayesian probability gets its name and its most important procedure: $$P(A|B)=\frac{P(B|A) \ P(A)}{P(B)}$$
For science we usually choose ##A=\text{hypothesis}## and ##B=\text{data}## so that $$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \ P(\text{hypothesis})} {P(\text{data})}$$ This gives us a way of expressing our uncertainty about scientific hypotheses, something that doesn’t make sense in terms of frequentist probability. As importantly, it tells us how to update our scientific beliefs in the face of new evidence.
In this equation ##P(\text{hypothesis})## is the probability that describes our uncertainty in the hypothesis before seeing the data, called the “prior”. ##P(\text{hypothesis}|\text{data})## is our uncertainty in the hypothesis after seeing the data, called the “posterior”. Both are probabilities so they each have probability distribution functions etc.
The frequentist vs Bayesian conflict
For some reason, the whole difference between frequentist and Bayesian probability seems far more contentious than it should be, in my opinion. I think some of it may be due to the mistaken idea that probability is synonymous with randomness. The Bayesian use of probability seems fundamentally wrong to someone who equates the two. But since both types of probability follow the same axioms, mathematically they are both valid and theorems that apply for one apply for the other. In particular, Bayesians don’t have some sort of exclusive rights to Bayes’ theorem.
The uncertainty should be the same as the long-term frequency once you have accumulated that infinite amount of data. And since you never have that infinite amount of data you will always have some uncertainty remaining. So the two types of probability are also complementary to each other. Furthermore, as we have seen, Bayesian methods give us ##P(\text{hypothesis}|\text{data})## and frequentist methods focus on ##P(\text{data}|\text{hypothesis})##, which are also complementary.
Summary
Just as I am not a fan of rigid adherence to scientific interpretations, I am also not a fan of rigid adherence to interpretations of probability. In both cases, I think that it is far more beneficial to learn multiple interpretations and switch between them as needed. When one is particularly suited to a given problem, then use that, and when the other is more suitable then switch. Just as different scientific interpretations produce the same experimental results so they can be used interchangeably, similarly the different interpretations of probability both follow the same axioms and can be used largely interchangeably. They are equivalent in that sense.
I hope this overview has given you a basic understanding of the differences between Bayesian and frequentist probabilities and perhaps a better understanding of the distinction between probability and randomness. And perhaps the odd contention between adherents of these two interpretations can eventually be dismissed as more people become familiar with both and use each when appropriate.
Continue to part 3: How Bayesian Inference Works in the Context of Science
Education: PhD in biomedical engineering and MBA
Interests: family, church, farming, martial arts
http://www.cs.ru.nl/P.Lucas/teaching/CI/efron.pdf
Why Isn't Everyone a Bayesian?
Author(s): B. Efron
Source: The American Statistician, Vol. 40, No. 1 (Feb., 1986), pp. 1-5
Just a note that "incoherent" is nowadays the more usual technical term in English.
Here are two examples from Schlosshauer's review https://arxiv.org/abs/quant-ph/0312059.
"It is a well-known and important property of quantum mechanics that a superposition of states is fundamentally different from a classical ensemble of states, where the system actually is in only one of the states but we simply do not know in which (this is often referred to as an “ignorance-interpretable,” or “proper”ensemble). "
"Most prominently, the orthodox interpretation postulates a collapse mechanism that transforms a pure-state density matrix into an ignorance-interpretable ensemble of individual states (a “proper mixture”)."
So Fisher clearly thinks that it is not necessary to establish “randomness” but merely to have a sample population with a well defined frequency. That fits in well with the frequentist definition of probability as a population frequency. One thing that Fisher doesn’t address there is sampling individual values from the population. Can you still use frequentist probability if the sampling is non-random (e.g. a random number generator with a specified seed)? I suspect that Fisher would say yes, but I am not sure that all prominent frequentists would agree.
So potentially, depending on the individual, there is not much difference between the frequentist and Bayesian interpretation in a deterministic population where we have ignorance.
Where you get a difference is in situations where there is simply no sample population. For example, ##G## or ##\alpha##. Those quantities are not a population, there is only one value but we are uncertain about it. With a frequentist approach ##P(\alpha=1/137)## is somewhere between weird and impossible, whereas a Bayesian would have no qualms about such an expression.
So we see a Frequentist discussing ignorance and knowledge in connection with the concept of probability. That view may not be statistically typical of the population of Frequentists, but it is a view that would allow probabilities to be assigned to the population of numbers generated by a deterministic random number generator – provided that when we take samples, we don't know how to distinguish sub-populations that have statistical characteristics different than the parent population.
Isn't it the same in frequentist thinking that randomness can arise from determinism, ie. from our ignorance of the details of a deterministic process?
Isn’t what the same?
Isn't that the same in frequentist thinking?
It's fair to say that the concept of probability that people originally had in mind involves a situation where there are several "possible" outcomes of some physical phenomena, but only one of the "possible" outcomes "actually" occurs. The concept of probability associated with such a situation involves a "tendency" for certain outcomes to actually happen that can be measured by a number, but the lack of any absolute guarantee that this number will correspond to the observed frequencies of the outcomes that actually do happen. This is still how many people applying probability theory think of probability.
However, such thoughts involve the complicated metaphysical concepts of "possible" as distinct from "actual". There is not yet any ( well known) system of mathematics that formalizes these metaphysical concepts and also provides anything useful for applications that the Kolmogorov approach doesn't already supply.
The Kolomogorov approach ( measure theory) provides a reliable basis for proving theorems about probabilities. The price of this approach is that probability theory is essentially circular. We have theorems that say if certain probabilities are such-and-such then the probabilities of other things are so-and-so. Any interpretation of probability theory as a guarantee of what will actually happen is outside this theory. It falls under whatever field of science deals with the problem to which the theory is applied.
It seems to me that in physics there is a long tradition of attempts to formulate theories of probability on the basis of actual frequencies of outcomes. For example, if we consider tossing a fair coin as a physical event, then such a theory would tell us to consider the "ensemble" of tossed coins. The ensemble must be an actual thing. It may involve all fair coins that have been tossed in past and all that will be tossed in the future, and coins tossed on other planets etc. In this actual ensemble of fair coins there is an actual frequency that have (or will) land heads. So this frequency is a specific number if the ensemble is finite. (If the ensemble isn't finite, we have more conceptual work to do.)
These ensemble theories do not explain taking independent samples from the ensemble unless we add further structure to theory. (For example, why won't the sub-ensemble corresponding to one experimenter's tosses all come out heads?) So we need the ensemble to be distributed in space and time (e.g. among various labs and among various times-of-day) in some way that mimics the appearance of independent trials.
Not necessarily. We are certainly uncertain about random things, but we are also uncertain about some non-random things. Both can be represented as a distribution from which we can draw samples. So the mere act of drawing from a distribution does not imply randomness.
A good example is a pseudorandom number generator. There is nothing actually random about it. But we are uncertain of its next value, so we can describe it using a distribution and draw samples from it.
But if a Bayesian draws samples from a distribution, then wouldn't the Bayesian be using the idea of randomness?
Eg.
https://en.wikipedia.org/wiki/Gibbs_sampling
http://www.mit.edu/~ilkery/papers/GibbsSampling.pdf
Yes. That is what axiomatization does. It abstracts a concept. Then the word “probability” (in that mathematical and axiomatic sense) itself becomes an abstraction representing anything which satisfies the axioms.
I do sympathize with that view, but realistically it is too late in this case. The Kolomgorov axioms are already useful and well accepted, and using the word “probability” to refer to measures which satisfy those axioms is firmly established in the literature.
The best you can do is to recognize that the word “probability”, like so many other words, has multiple meanings. One is the mathematical meaning of anything which satisfies Kolomgorov’s axioms, and the other is the “concept that people originally had in mind”. Then you merely make sure that it is understood which meaning is being used, as you do with any other multiple-meaning word.
But what is probability then about? About anything that satisfies the axioms of probability? My view is that, if a set of axioms does not really capture the concept that people originally had in mind before proposing the axioms, then it is the axioms, not the concept, that needs to be changed.
That one isn’t particularly exotic. It is a simple “balls in an urn” probability but weighted by energy rather than being equally weighted.
However, I am sure that there are other measures that are more surprising or genuinely exotic. The thing is to realize that probability is not about randomness. If something satisfies the axioms then it is a probability even if there is no sense of randomness or uncertainty involved.
Bayes theorem and all of the other theorems of probability would apply. Whether they would be useful is a separate question, but they would surely apply.
$$p_i=\frac{E_i}{\sum_{j=1}^N E_j}$$
satisfies the probability axioms. @Dale any comments?
I think for me that was the big “aha” moment: when I realized that probability and randomness were different things. It doesn’t matter what ##P(A)## represents operationally, if it follows the Kolomgorov axioms then it is a probability. It could represent true randomness, it could represent ignorance, it could represent uncertainty, and I am sure that there are other things it could represent.
I tend to like the idea of uncertainty more than randomness, because I find randomness a lot harder to pin down. It seems to get jumbled up with determinism and other things that you don’t have to worry about for uncertainty.
Begin with a concise definition (from https://en.wikipedia.org/wiki/Inverse_probability, which references the Feinberg paper):
For example, suppose we model 10 tosses a possibly unfair coin as a random variable with binomial distribution with probability ##p## of the coin landing heads. Then the observed data is the 10 results of tossing the coin. The parameter ##p## is not observed. (We can say the effects of ##p## are observed, but the value of ##p## itself is not directly observed.) If we assume a probability model where ##p## is assumed to have a uniform distribution on the interval [0,1] then we have assigned a probability distribution to an unobserved variable, so we are using inverse probability.
Using "inverse probability" is now what we would call assigning a prior distribution to a parameter. The modern terminology "prior distribution" does not emphasize the fact that it is a distribution for a quantity that is not directly observed in the data.
The practical distinction between Frequentists and Bayesians is: Frequentists reject the use of inverse probability and Bayesians employ it.
The correct description of the history of probability and statistics is not that the earliest methods were Frequentist methods and that Bayesian methods were an innovation that came later. Instead, the earliest methods included using "inverse probability"
Frequentism developed in the 1920's when prominent statisticians rejected the use of "inverse probability". I haven't researched why they rejected using inverse probability – whether their reasons were metaphysical or practical – or unique to each individual Frequentist.
The Fequentist style of statistics became the dominant style for decades. (It's an interesting question why this happened – perhaps because Frequentist probabiity models have a simpler structure. They minimize the number of proability distributions involved.)
Bayesian methods were recognized as a distinct style of probability modeling when statisticians began to revive the use of "inverse probability".
Describing the practical difference between Bayesian and Frequentist styles in terms of "inverse probability" is a correct explanation, but it does not delve into the consequences of the decision to use or not to use "inverse probability".
The consequences of rejecting "inverse probability" are usually that we get a probability model can only be used to answer questions of the form "Assuming such-and-such, what is the probability of the data?". Allowing the use of inverse probability can create probability models that answer questions of the form "Given the data, what is the probability of such-and-such?"
Explaining the consequences of using or not using "inverse probability" is a technical matter and requires a technical article. Explaining the practical difference between Bayesian and Frequentist styles in terms of the definition of "inverse probability" can be done without many technical details and starts the reader off on the right foot.
As a bit further reading people might like to look in the Cox Axioms:
https://en.wikipedia.org/wiki/Cox's_theorem
Thanks
Bill
Besides being a mere critic of other posts, I'll make the (perhaps self-evident) points:
Bayesian vs Frequentist can be described in practical terms as a style of choosing probability models for real life problems. People who pick a particular style do not necessarily accept or understand the philosophical views of prominent Bayesians and Frequentists.
The Bayesian style of probability modeling is to use a probability model that answers questions of the form that people most commonly ask. E.g. Given the data, what is the probability that the population has such-and-such properties?
The Frequentist style of probability modeling is to use the minimum number of parameters and assumptions – even if this results in only being able to answer questions of the form: Given I assume the population has such-and-such properties, what is the probability of the data?
Undestanding the distinction between the Bayesian and Frequentist styles is made difficult by the fact that Frequentists use a vocabulary that strongly suggests that they are answering the questions that the Bayesian method is obligated to answer. For example, "There is 90% confidence that the observed mean will be within plus or minus .23 of the population mean" suggests (but does not acutally imply) that "The observed mean is 6.00, therefore there is a .90 probability that the population mean is in the interval [6.00- 0.23, 6.00+0.23]. Similar misinterpretations of the terms like "statistical significance" and "p-value" suggest to laymen, and even students of introductory statistics, that Frequentist methods are telling them something about the probability of some fact given the observed data. But instead Frequentism generally deals with probabilities where the condition is changed to be "Given these facts are assumed , the probability of the observed data is ….".
The biggest obstacle to explaining the practical difference between Bayesian statistics and Frequentist statistics is explaining that the methods answer different questions. The biggest obstacle to explaining that the methods answer different questions is negotiating the treacherous vocabulary of Frequentist statistics to clarify the type of question that Frequentist statistics actually answers. Explaining the difference between Bayesian and Frequentist distinctions in terms of a difference in "subjective" and "objective" probability does not, by itself, explain the practical distinction. A reader might keep the misconception that Frequentist methods and Bayesian methods solve the same problems, and conclude that the difference in the styles only has to do with the different philosophical thoughts swimming about in the minds of two people who are doing the same mathematics.
———
As to an interpretation of probability in terms of observed frequencies, mathematically it can only remain an intuitive notion. The attempt to use probability to say something definite about an observed frequency is self-contradictory except in the trivial case where you assign a particular frequency a probability of 1, or of zero. For example, it would be satisfying to say "In 100 tosses of a fair coin, at least 3 tosses will be heads". That type of statement is an absolute guaranteed connection between a probabilty and an observed frequency. However, the theorems of probability theory provide no such guaranteed connections. The theorems of probability tell us about the probability of frequencies. The best we can get in absolute guarantees are theorems with conclusions like ##lim_{n \rightarrow \infty} Pr( E(n)) = 1 ##. Then we must interpret what such a limit means. Poetically, we can say "At infinity the event ##E(\infty)## is guaranteed to happen". But such a verbal interpretation is mathematically imprecise and, in applications, the concept of an event "at infinity" may or may not make sense.
As a question in physics, we can ask whether there exists a property of situations called probability that is independent of different observers – to the extent that if different people perform the same experiment to test a situation, they (probably) will get (approximately) the same estimate for the probability in question if they collect enough data. If we take the view that we live in a universe where scientists have at least average luck, we can replace the qualifying adjective "probably" with "certainly" and if we idealize "enough data" to be"an infinite amount of data", we can change "approximately" to "exactly". Such thinking is permitted in physics. I think the concept is called "physical probability".
My guess is that most people who do quantum physics believe in physical probability. Prominent Bayesians like de Finetti explicitly reject the existence of such objective probabilities. I haven't researched prominent Frequentists. I don't even know who they are yet, so I don't know if any of them assert physical probabilities are real. The point of mentioning this is that, yes, there is detail involved in explaining the difference between "objective" and "subjective" probability. However, as pointed out above, explaining all this detail does not, by itself, explain the practical distinction between the styles of Bayesian vs Frequentist probability modeling.
In fact, the cause-and-effect relation between a persons metaphysical opinions and their style of probability modeling is, to me, unclear. Historically, how did the connection between the metaphysics of Bayesians and the probability modeling style of Bayesians evolve? Did one preceed the other? Were there people who held Frequentist philosophical beliefs but began using the Bayesian style of probability modeling?
[Just found this: The article https://projecteuclid.org/download/pdf_1/euclid.ba/1340371071 indicates that a Bayesian style of probability modeling existed before the philosophical elaboration of subjective probability. It was called using "inverse probability".]
Isn’t that essentially what you proved above? I don’t understand your point.
If the frequentist definition of probability is circular as you showed then it does seem like it isn’t an objective property of a physical system.
I am not sure what point you are trying to make with your posts. Can you clarify?
Don’t you mean “So we can’t (objectively) assign a probability to the toss of a fair coin or the throw of a fair dice?”
Yes – with the caveat that adopting the views of a prominent person by citing a mild summary of them is different than understanding their details! It can be embarrassing to find yourself using a method when a well known proponent of the method has extreme views. As a moderate Bayesian, would you associate yourself with DeFinneti's:
as quoted in the paper by Nau https://faculty.fuqua.duke.edu/~rnau/definettiwasright.pdf
An interpretation of DeFinetti's position is that we cannot implement probability as an (objective) property of a physical system. So we can't (objectively) toss a fair coin or throw a fair dice ? – or even an unfair coin or unfair dice with some objective physical properties that measure the unfairness.
Aren’t prominent people in a field considered prominent precisely because the consensus in that field is to adopt their view?
This is a good point. But they can certainly objectively test if that decision is supported by the data. (It almost never is for large data sets).
Anyway, your responses here have left me thinking that the standard frequentist operational definition is circular. I had originally thought that the limit I wrote was valid, but you are correct that it is not a legitimate limit. But the replacement you offered uses probability to define probability, so that is circular. Circularity is not necessarily an unresolvable problem, but it at least bears scrutiny.
How are you defining a "Bayesian probability"?
Are you referring to a system of mathematics that postulates some underlying structure for probability and then defines a probability measure in terms of objects defined in that underlying structure?
Those notes show an example of where a Frequentist assumes the existence of a "fixed but unknown" distribution ##Q## and a Bayesian assumes a distribution ##P##, and it is proven that "In ##P## the distribution ##Q## exists as a random object". Apparently both ##P## and ##Q## are parameterized by a single parameter called "the limiting frequency".
Isn't the general pattern for the Bayesian approach to take a parameter ##k## of a distribution ##Q_k## that a Frequentist would assume is "fixed but unknown" and model ##k## as the outcome of a random variable ##P##? That approach makes ##k## and ##Q_k## random objects generated by ##P##.
I don't see how the example in those notes gives a Bayesian any special liberty to turn a Frequentist variable into a Bayesian random variable that a Bayesian would not ordinarily take.
The notes say they demonstrate a "bridge" between the two approaches. I don't know how to interpret that. One guess is that if Bayesian models a situation by assuming ##P## then he finds that a random distribution ##Q_k## "pops out" that can be interpreted giving possible choices for the "fixed but unknown" distribution ##Q_k## that a Frequentist would use. Whereas the typical Bayesian approach would be to start with ##Q_k## and turn ##Q_k## into a random distribution by turning ##k## into a random variable.
I know you mean "coherent" in a different sense, but Bayesian probability is coherent, where "coherent" is a technical term.
Although Bayesians and Frequentists start from different assumptions, Bayesians can use many Frequentist procedures when there is exchangeability and the de Finetti repesentation theorem applies.
http://www.stats.ox.ac.uk/~steffen/teaching/grad/definetti.pdf
Ideally, there is a need for such definitions, but it will be hard to say anything precise. People make subjective decisions without having a coherent system of ideas to justify them. You can look at what prominent Bayesians say versus prominent Frequentists say. Prominent people usually feel obligated to portray their opinions as clear and systematic. But prominent people can also be individualistic, so you might not find any consensus views.
From reading other articles about Frequentist vs Bayesian approaches to statistics, those articles have definite opinions about the differences. However, is there really a consensus view of probability among Frequentists or among Bayesians? Are the authors of this type of article just copy catting what previous authors of this type of article have written? – namely that Bayesians view probability as "subjective" and Frequentists view it as "objective".
I can't see a Bayesian (of any sort) defending an estimate of a probability that is contradicted by a big batch of data. So is it correct to say that Bayesians don't accept the intuitive idea that a probability is revealed as a limiting frequency?
If a Frequentist decides to model a population by a particular family of probability distributions, will he claim that he has made an objective decision?
I think we are running into a miscommunication here. I agree with the point you are making, but it isn’t what I am asking about.
In physics we have the mathematical concept of a vector and the application of a velocity. In order to use velocity vectors you need more than just the axioms and theorems of vectors, you also need an operational definition of how to determine velocity. Here, communication is hampered because we use the word probability to refer to both the mathematical structure and the thing represented by the structure. There needs to be operational definitions of frequentist and Bayesian probability. That is what I am talking about.
I think that Bayesians have a good operational definition of probability. The valid limit you described above would be a circular operational definition for frequentist probability, but unfortunately I don’t know a better one. The one I wrote isn’t circular, but as you correctly pointed out it isn’t a real limit.
I agree. And, as far as I can see, no formal definition of any kind of limit defines the concept of a probability.
As you mentioned in the insight, the mathematical approach to probability defines it via a "measure", which is a certain type of function whose domain is a collection of sets. This theory does not formalize the idea that it is possible to take samples of a random variable nor does it define probability in the context that there is one outcome that "actually" happens in an experiment where there are many "possible" outcomes. So the mathematical theory bypasses the complicated metaphysical concepts of "actuality" and "possibility". It does not formally define those concepts and hence says nothing about them.
Also, as you said, both Frequentists and Bayesians accept the mathematical theory of probability. So any difference in how the two schools formally define probability would have to be based on some method of creating a mathematical system that defines new things that underlie the concept of probability and shows how these new things can be used to define a measure. I recall seeing examples where a formal mathematical model of "degree of belief" or "amount of information" is developed and probability is defined in terms of the mathematical objects in such models. Richard Von Mises had the view that probability can be defined as a "limiting frequency" http://www.statlit.org/pdf/2008SchieldBurnhamASA.pdf but the consensus view of mathematicians is that his approach doesn't pass muster as formal mathematics.
However, I think most practicing statisticians don't think in terms of a precisely defined mathematical structure that underlies probability. The way that typical Frequentists differ from typical Bayesians is in how their imprecise and intuitive notions differ -i.e. in their metaphysical opinions.
No, of course not. But I don’t think that you can use the limit you posted above as a definition for frequency-based probability non-circularly.
I agree more or less. I would say that the issue is not exactly whether a quantity is definite but unknown, but rather whether or not to use probability to represent such a quantity.
E.g. I think that both Bayesians and frequentists would classify ##G## as definite but unknown, but Bayesians would happily assign it a PDF and frequentists would not.
I think that is only slightly different from your take.
Such a limit is used in technical content of The Law Of Large Numbers and frequentists don't disagree with that theorem.
To me, the essential distinction between the frequentist approach and the Bayesian approach boils down to whether certain variables are assumed to represent a "a definite but unknown" quantity versus a quantity that is the outcome of some stochastic process. For example, a frequentist might model a situation as a sequence of bernoulli trials with definite but unknown probability ##p##. In that case, questions like "Given there are 5 successes in 10 benoulli trials, what is the probability that ##.4 < p < .6##?" is almost meaningless because ##p## is not something that has a nontrivial probability distribution. So we can only say that ##Pr(4 < p < .6)## is either 1 or zero, and we don't know which. By contrast, a Bayesian might model the situation as a sequence of benoulli trials peformed after Nature or something else uses a stochastic process to determine ##p## and be bold enough to assume a probability distribution for ##p##. In that scenario, the above question has a meaningful answer.
A frequentist criticism of the Bayesian approach is: Suppose ##p## was indeed the result of some stochastic process. The value of ##p## has already been selected by that process. Are we to base our analysis only on taking a single sample of ##p## from the process?"
A Bayesian criticism of the frequentist approach is "You aren't setting up a mathematical problem that answers questions that people want to ask. People want answers to questions of the form "What is the probability that < some property of the situation> is true given we have observed the data?" The way you model the problem, you can only answer questions of the form "Assuming <some property of the situation> is true then what is the probability of the observed data?"
Nice.
Is that considered problematic by frequentist purists? It seems to define probability in terms of probability.
The interpretation of "##\lim_{N \rightarrow \infty} \frac{ n_h}{N} = P(H)##" in the sense used in calculus would say that for each ##\epsilon > 0 ## there exists and ##M > 0## such that if ##N > M## then ##P(H) – \epsilon < \frac{n_h}{N} < P(H) + \epsilon ##. However, there is no gurantee that this will happen. To assert that it must happen contradicts the concept of a probabilistic experiment. The quantity ##\frac{n_h}{N}## is not a deterministic function of ##N##, so the notation used in calculus for limits of functions does not apply.
For independent trials, the calculus type of limit that does exist, for a given ##\epsilon > 0## is ##lim_{n \rightarrow \infty} Pr( P(H) – \epsilon < S(N) < P(H) + \epsilon) = 1## where ##S## is a deterministic function of ##N##. To compute ##S## we use the probability distribution for ##N## replications of the experiment to compute the probability that there is a number of occurences ##n_h## that makes ##P(H) -\epsilon < \frac{n_h}{N} < P(H) + \epsilon\ ##. The notation " ##n_h##" denotes an index variable for a summation of probabilites. We sum over all ##n_h## that satisfy the above inequality. So ##S## is a function ##N##, not of ##n_h##.
There is no disagreement between Bayesians and frequentists about how such a limit is interpreted.
There are theorems demonstrating that in the long run the Bayesian probability converges to the frequentist probability for any suitable prior (eg non-zero at the frequentist probability)
What do you mean here?
It should be emphasized that the notation "##P(H) = lim_{N \rightarrow \infty} \frac{ n_h} {N}##" conveys an intuitive belief, not a statement that has a precise mathematical definition in terms of the concept in calculus denoted by the similar looking notation ## L = \lim_{N \rightarrow \infty} f(N)##.
In applications of statistics we typically assume that "in the long run" observed frequencies of events will approximately be equal to their probability of ocurrence. ( In applying probability theory to a real life situation, would a Bayesian disagree with that intuitive notion? ) But probability theory itself does not make this assumption. The nearest thing to it is the "Law of Large Numbers", but that law, like most theorems of probability, tells us about the probability of something happening, not about an absolute guarantee that it will.
I will have numerical examples for most of them. This one was just philosophical, so it didn’t really lend itself to examples.
I think that I will have at least two more. The one I am working on now is about Bayesian inference in science. It will include how the Bayesian approach naturally includes Occham’s razor and Popper’s falsifiability. The fourth will be a deeper dive into the posterior distribution and the posterior predictive distribution.
After that, I don’t know.
The Bayesian interpretation is straightforward. It just means that I am not certain that it is going to rain on Thursday, but I think it is likely. More operationally, if I had to bet a dollar either that it would rain on Thursday or that I would get heads on a single flip of a fair coin, then I would rather take the bet on the rain.
To update your probability you need to have a model.
For a concrete example, suppose that the only condition you were looking at is barometric pressure. A typical model might be that the log of the odds of rain is a linear function of the barometric pressure. Then the previous data would be used to estimate the slope and the intercept of that model.
Well, I am a moderate Bayesian, so I do lean towards Bayes in my preferences. But being moderate I also use the frequentist interpretation and frequentist methods whenever convenient or useful.
I just don’t think that my preference is “right” or that someone else’s preference is “wrong”. I use both and even find cases where using both together is helpful.