How to be certain of the assertiveness of the simulation response without comparing it to theoretical probability?

It’s all a matter of confidence.

In the first article Beyond Averages, I used Monte Carlo simulation to experimentally demonstrate the theoretical probability of 31.25% for the event of getting 3 heads when flipping a coin 5 times.

In the implementation, I used only one input, totalSimulacoes, because the total number of simulations affects the statistical confidence of the simulation.

I also did not use any method to define the total number of simulations that were executed. In that demonstration, random amounts of simulations were used.

A very low amount of simulations: 50 simulations
An extremely high one: 1 billion simulations.
A high one: 200 million.

The results are in the box below.

# Few rounds do not bring a good estimate; for example, with 1 round it will either be 0% or 100% Probability.

./flipCoin3on5 1
Simulated 1 rounds of 5 tosses...
Cases with exactly 3 heads: 0
Estimated Probability: 0.0000 (0.00%)

./flipCoin3on5 1
Simulated 1 rounds of 5 tosses...
Cases with exactly 3 heads: 1
Estimated Probability: 1.0000 (100.00%)

# With an insufficient amount of rounds, the data is not reliable.
./flipCoin3on5 50
Cases with exactly 3 heads: 17
Estimated Probability: 0.3400 (34.00%)

./flipCoin3on5 50
Cases with exactly 3 heads: 17
Estimated Probability: 0.3400 (34.00%)

# With too many rounds, we have a great estimate but we waste time.
# 1 Billion rounds => 1 minute
time ./flipCoin3on5 1000000000   
Simulated 1000000000 rounds of 5 tosses...
Cases with exactly 3 heads: 312500797
Estimated Probability: 0.3125 (31.25%)
real    1m8,066s
user    1m8,147s
sys    0m0,080s

# With enough rounds, we have a great estimate 
# with 8/10 executions having an estimated Probability identical to the theoretical Probability.
# Two hundred million rounds => 13 sec
time ./flipCoin3on5 200000000 
Simulated 200000000 rounds of 5 tosses...
Cases with exactly 3 heads: 62506172
Estimated Probability: 0.3125 (31.25%)
real    0m13,721s
user    0m13,732s
sys    0m0,012s

time ./flipCoin3on5 200000000
Simulated 200000000 rounds of 5 tosses...
Cases with exactly 3 heads: 62489409
Estimated Probability: 0.3124 (31.24%)
real    0m13,547s
user    0m13,557s
sys    0m0,024s

In these simulations, with the theoretical probability in hand (31.25%), I allowed myself to vary the totalSimulacoes until I found an acceptable result, as is the case with 200 million simulations, to have a precision of ±0.01% (or 0.0001).

It was also because I had the theoretical probability that the entire empirical idea of an acceptable result, prediction, and ideal value for totalSimulacoes seemed acceptable. It was just a matter of looking at the correctly calculated result of 31.25% and playing “Hunt the thimble”.

Thus, the next question I find interesting is: but what if? What if we didn’t have the theoretical probability?

Without the theoretical probability to compare, it wouldn’t be possible to play hot and cold thinking, “the sample of 50 stayed far from the right result, but the 200 million one stayed close to the calculated number… it must be right enough.”

I would have to abandon accuracy, which is relative (the closeness of a measurement to its true/real value) due to the lack of something to compare it to (theoretical probability), and seek another alternative to meet the need to trust the estimate.

It could be precision…
The first idea would be to start thinking about precision. Precision is the closeness between multiple repeated measurements (repeatability of the result), even if they are wrong.

Is the probability estimated with a totalSimulacoes of 50 precise? In a series of 5 estimates, the results were: 0.34, 0.40, 0.37, 0.20, 0.30. Definitely not!

However, when I use a totalSimulacoes of 1 billion, the series of 5 brings good precision: 0.312497, 0.312493, 0.31254, 0.312496, 0.312497. But each round took ± 9 minutes with flipCoin.go (and ± 3 minutes with flipCoinBigNumber.go).

… or confidence.
The Confidence Level that you defined you need.

Unlike what I sought with the ideas above, where the idea of the “right result” was subjective, the idea of a confidence level is objective.

By executing 2 billion rounds of the game “flip the coin 5 times to see how many times I get 3 heads,” the variation in the estimate decreased. I could increase the number of rounds even further, but until when should I keep increasing this number to be sure?

Using 1 million rounds gave me a small variation, but it didn’t meet my goal of experimentally proving the theoretical answer of 31.25% because it still failed to hit the hundredth.

In the table below, you can see the variation decreasing (the number of stable decimal places increasing), meaning the precision of the response is increasing.

Number of Rounds	Estimate
50	0.2600000
	0.3400000
	0.2800000
1 million	0.3125800
	0.3120200
	0.3128260
	0.3124460
42 million	0.3125884
	0.3125093
	0.3124413
400 Million	0.31250427
	0.31249493

This perception of improvement lacks proof, and for that, we need two tools: the Standard Error formula and the zScore.

The Standard Error (SE)

The standard error measures the dispersion around the sample mean (the sum of all sample values divided by the number of values).

error = sqrt( p * (1 – p) / n )

// variance: p * (1 - p). 
// In the case of a fair coin, the chance of heads or tails is equal to 50% = 0.5
// 0.5 * (1 - 0.5) = 0.5 * 0.5 = 0.25

// / n: number of trials. 
error := math.Sqrt( p * (1 - p) / float64(n))

An interesting question to answer is: Why are we using a p of 0.5? There is an article addressing this topic here: [[How to deal with the Standard Error if I don’t know the theoretical probability?]] The answer, for now, is that we chose to ignore the theoretical probability of 0.3125 in favor of a purely experimental analysis.

And according to the experiments, we have the following standard errors:

Number of Rounds	Standard Error
50	0.0707 => 7.07%
1 million	0.0005 => 0.05%
42 million	0.000077 => 0.0077%
400 Million	0.000025 => 0.0025%

“Precision in Monte Carlo is expensive: the computational cost grows by the square of the desired precision.”

At this point, you can see how much it costs to reduce the error. This is a key point of the problem. Precision does not increase linearly; you need to quadruple the sample size to double the precision! This is the Law of Large Numbers.

Let’s analyze the data:

Confidence Ruler

In practice, the standard error is the unit of a confidence ruler. So, the value of 7.07%, for example, is the typical size of the deviation we expect to see in 50 tosses. Each step on the confidence ruler for the 50-round experiment represents 7.07%.

“But percentage is a relative measure. 7.07% relative to what?”

To the Normal! 🙂

Using the coin example, when tossing an unbiased coin (not altered to land more on one side than the other), the theoretical probability (the normal) says each side has a 50% chance. If you toss this coin 50 times, the theoretical result is that 25 times you will have heads and the other 25 times tails.

But the real world has randomness. The reality is that sometimes you will have 28 heads x 22 tails, 29 tails x 21 heads, and so on randomly. And you will find this variation normal; it’s a matter of luck.

But if the result is 45 heads and 5 tails, you will be suspicious.

The standard error is an objective statistical tool that says whether a variation is normal or strange, as 45 heads and 5 tails seems to be.

With the standard error alone, it is still not possible to find the ideal number of simulations for the experiment, but something is already learned about it:

Round your estimate to the same order of magnitude as your Standard Error.

// if SE = 0.0005 and estimate = 0.31256789 
// the line below treats noise as precision fmt.printf("Probability of %.8f\n") 

// the line below tells the truth about precision because it ignores the experimental noise 
fmt.printf("Probability of %.4f\n") 

// the line below tells the truth and adds important information about rounding behavior. fmt.printf("Probability of %.5f\n") 

Remember the idea of SE being a unit on a ruler? So, if your ruler has a mark every 0.0005, how are you going to measure something smaller than that? Use this idea to remove the noise.

Save computational power.
Use the SE as a stop condition for your simulation. By stating that you want the simulation to stop when it reaches an SE of 0.0005, your simulation begins to vary according to the uncertainty of the event instead of a fixed-size loop. Don’t forget to start the sample with a significant number, or you might have a false positive with small samples (see the table for 50 rounds).

The ZScore

The zScore is a multiplier. It tells you how many standard errors you are away from normality. In the previous image, you have a ruler with a normal curve and some markings on it. The central point of the curve is the theoretical probability; it marks 0.5 as the theoretical chance of landing heads or tails when flipping a coin, and from the center to the edges, you have the standard deviations.

Below is a table of Standard Deviation and zScore. Take a look at the 68-95-99.7 rule to learn more.

Standard Deviation	Z	Interpretation
0	50%	The data is exactly at the mean
+- 1	68%	approximately 68% of the data is between -1 and +1
+- 1.645	90%	approx. 90% of the data is 1.645 standard deviations away from the mean
+- 1.96	95%	approx. 95% of the data is 1.96 standard deviations away from the mean
+- 3	99.7%	approx. 99.7% of the data is between -3 and +3

Cost of Error vs. Sample Cost

In practice, the 95% confidence interval is the most used in scientific and business areas, and this is due to an analysis that takes into account:

The cost of error, or how bad it is if our conclusion is wrong.
The cost of the sample, or how much it costs to obtain data to be more certain.

The higher the cost of the error, the higher the confidence interval used. For example, the pharmaceutical industry uses 95% confidence in the functionality of the drug compared to a placebo; however, they require 99% confidence that the drug is safe and non-toxic when looking at its side effects.

“It is more acceptable for a medicine to have no effect at all than to have a harmful side effect that can lead to death.”

Returning to our example, the confidence interval used is 95%, so the z is 1.96.

Margin of Error

MarginError = zScore x Standard Error.
ME(50 rounds) = 1.96 * 0.0707
ME(50 rounds) = 0.13857 ~= 0.1386 => 13.86%.

This means that when flipping a coin 50 times, any result up to 13.86% different from 50% is considered “normal luck.”

	Mean	Margin of Error	Final Value	%	Result Example
Upper Limit	0.5	+ 0.1386	0.6386	63.86	31 heads x 19 tails
Lower Limit	0.5	– 0.1386	0.3614	36.14	18 heads x 32 tails

Number of Rounds	Standard Error	Margin of Error
50	0.0707 => 7.07%	± 0.13857
1 million	0.0005 => 0.05%	± 0.00098
42 million	0.000077 => 0.0077%	± 0.00015
400 Million	0.000025 => 0.0025%	± 0.000049

In the experiment with 50 rounds, the standard error is 7.07%, which means that if you repeat the experiment enough times, you will find a variation in the result 7.07% distant from the theoretical probability of 50% on each side. The script flipCoin50xSE707.go runs an experiment to validate this theoretical data.

In practice, this means we cannot use only 50 rounds to validate a coin toss event. This is because, with this quantity, “luck” still beats the law of large numbers, and because of that, the precision of the ruler is very low.

The 50-round experiment does not have the desired precision.

To experimentally prove the theoretical probability, we need hundredth-level precision. The theoretical result is 31.25%.

To illustrate, we can say that you can’t use a ruler with 7cm precision to measure something when you want to know the measurement in hundredths of a millimeter, because you will be forced to say it measures 7cm and something. It lacks Precision.

Just as we have rulers, measuring tapes, calipers, micrometers, and other instruments with different precisions depending on the application, we use different sample sizes for different applications.

Now that we’ve discussed the parts necessary to experimentally validate the theoretical result, let’s move to the implementation.

We need a code that:
a) Has a margin of error that guarantees precision as in 0.3125
b) knows when to stop.

The idea is to implement a stop rule (b) using the concept of (a).

Starting from the end, the result:

Sampling Size: 40000000
Estimate : 0.312592 | Margin of Error : 0.0001 
[NOK] Margin of Error 0.000144 > 0.000140.

Sampling Size: 41000000
Estimate : 0.312556 | Margin of Error : 0.0001 
[NOK] Margin of Error 0.000142 > 0.000140.

Sampling Size: 42000000
Estimate : 0.312404 | Margin of Error : 0.0001 
[NOK] Margin of Error 0.000140 > 0.000140.

Sampling Size: 43000000
Estimate : 0.312398 | Margin of Error : 0.0001 
[OK] Margin of Error 0.000139 < 0.000140.
Simulated 43000000 rounds of 5 tosses...
Estimated Probability: 0.3124 (31.24%)

Rule Definition.

const tossesPerTime = 5                                                     
const targetHeads = 3                                                             

func main() {                                                                   

    rand.Seed(time.Now().UnixNano())                                            
    // acceptable delta in estimate 0.0001                                     
    acceptedVariation := 0.00014
    // to avoid false positives with few simulations start with 10,000,000
    totalSimulations := 10000000
    // if the result is unsatisfactory, increase the number of simulations by 1,000,000
    stepSize := 1000000
    // define a 95% confidence level
    confidenceLevel := 1.96

Executing the event to collect results.

// Inside the round function, I toss the coin 5 times and count the result.
headsInExperiment := 0
// toss the coin 5 times.    
for j := 0; j < tossesPerTime; j++ {
    if flipCoin() == 1 {
        headsInExperiment++
    }
}

// If the experiment resulted in exactly 3 heads, we count it as a success
if headsInExperiment == targetHeads {
    successes++
}

Executing the simulation until the rules are satisfied.

“I will only be satisfied with the simulation when I have 95% statistical certainty that the real value is between (Estimate – 0.00014) and (Estimate + 0.00014).”

// ... configuration variables (accepted margin of error, 95% confidence level)

for {
    // 1. Run the massive simulation (e.g., starts with 10 million)
    estimatedProbability = round(totalSimulations)

    // 2. Calculate the Standard Error for a binomial proportion
    // Formula: sqrt( (p * (1-p)) / n )
    standardError := math.Sqrt( (estimatedProbability * ( 1 - estimatedProbability)) / float64(totalSimulations))

    // 3. Calculate the current Margin of Error (based on 95% confidence, Z=1.96)
    currentMarginError := confidenceLevel * standardError

    // ... prints ...

    // 4. Check if the margin of error is already small enough
    if currentMarginError > acceptedVariation {
        // If the error is still large, increase the sample size and try again
        totalSimulations = totalSimulations + stepSize
        fmt.Printf("[NOK] Margin of Error %f > %f.\n", currentMarginError, acceptedVariation)
    } else {
        // If the error is already acceptable, stop the loop.
        fmt.Printf("[OK] Margin of Error %f < %f.\n", currentMarginError, acceptedVariation)
        break
    }
}

Why does acceptedVariation have the value of 0.00014?

If the acceptance were 0.01 (1%), the code would stop very quickly, but the answer would be imprecise (e.g., 31% ± 1%).
If the acceptance were 0.000001, the code would run for hours or days to reach extreme and perhaps unnecessary precision.

Why start the number of rounds with 10 million?

To avoid a false positive with few rounds that could give an error response of 0 error by pure luck.

This was a long journey to answer a seemingly simple question.
This exercise of explaining the “whys” of the code was a journey focused on cementing probability concepts, but it was not an exhaustive journey.

Despite being long, the explanation is simplified and left several points to be explained, which I will address in other texts and link back to this one to allow for broader navigation through the concepts and learnings. And since curiosity is the mother of learning, nothing is better than questions to exemplify these points.

Why was 0.5 used for the coin probability?
Why does noise decrease and bring me closer to the theoretical probability when increasing the number of simulations?
Why was a 95% confidence level used?
Why isn’t level 2 in the confidence level table?
Why was a Confidence Interval used?
Why does the result of the coin’s binomial distribution fit into the result of the normal distribution?
Why is it confusing to talk about Standard Deviation and Standard Error?

You can find the full implementation code in flipCoin3on5_notheoricalKnowledge.go

GitHub repository: https://github.com/iannsp/montecarlo

How to be certain of the assertiveness of the simulation response without comparing it to theoretical probability?

The Standard Error (SE)

Confidence Ruler

The ZScore

Cost of Error vs. Sample Cost

Margin of Error

Like this:

Related

Leave a ReplyCancel reply

Ivo Nascimento

Posts by Ivo Nascimento

Otimização da validação de probabilidade teórica utilizando Monte Carlo

A probabilidade experimental p̂

Como ter certeza da assertividade da resposta da simulação sem comparar com probabilidade teórica?

Beyond Averages

Além das médias

Comments by Ivo Nascimento

How to be certain of the assertiveness of the simulation response without comparing it to theoretical probability?

The Standard Error (SE)

Confidence Ruler

The ZScore

Cost of Error vs. Sample Cost

Margin of Error

Share this:

Like this:

Related

Leave a ReplyCancel reply

Como ter certeza da assertividade da resposta da simulação sem comparar com probabilidade teórica?

A probabilidade experimental p̂

Ivo Nascimento

Posts by Ivo Nascimento

Otimização da validação de probabilidade teórica utilizando Monte Carlo

A probabilidade experimental p̂

Como ter certeza da assertividade da resposta da simulação sem comparar com probabilidade teórica?

Beyond Averages

Além das médias

Comments by Ivo Nascimento

Discover more from Ivo Nascimento