The safestats package contains safe alternatives for tests of two
proportions. These tests are applicable to data collected from two
groups (indicated with “a” and “b” from here), where each data point is
a binary outcome 0 (e.g., ill) or 1 (e.g., cured). For example, group
“a” might refer to the group of patients that are given the placebo,
whereas group “b” is given the drug. The tests highlighted in this
vignette have been designed specifically for an *optional
stopping* setting, where one analyses research data as it comes in
in real-time and wants to make a decision about the hypotheses as
quickly as possible. The standard tests for this setting, which cannot
deal with optional stopping, are Fisher’s exact test or the chi-squared
test.

In this vignette we will first describe how you can design, plan and execute safe experiments, based on desired significance levels, power and minimal clinically relevant differences or odds ratios with the functions in the safestats package. Next, we illustrate how power of experiments can be increased and how experiments can be finished quicker through optional stopping, and how borderline significant experiments can be continued through optional continuation with safe testing. We also show how to build safe confidence intervals and confidence sequences.

The main aim for this vignette is to give practical code examples and some intuition into optional stopping and continuation scenarios through numerical illustrations; details and theoretical backgrounds can be found in our paper (Turner, 2021).

Before we can start an experiment, to be able to do optional stopping
with safe tests, we have to decide on the incoming *data block
size*. A block is a set of data consisting of \(n_a\) outcomes in group \(a\) and \(n_b\) outcomes in group \(b\), for some pre-specified \(n_a\) and \(n_b\). One is allowed to perform a safe
test and look at the results each time a new block is complete.
Therefore, one should choose these blocks as small as possible, enabling
the user to collect information as quickly as possible. For example, if
we consider a balanced design, it is best to choose \(n_a = 1\) and \(n_b = 1\). Imagine a clinical trial
researcher starting up their computer after the weekend and seeing that
data of 15 patients have been processed in a control group, and 12 in a
case group. The researcher could then perform the safe test with the
data of the first 12 patients from each group, as 12 complete “blocks”
with one patient from each group are present.

The safe tests functions expect numeric vectors giving the number of successes per block as arguments. An example of observed data in the correct format for the package with 10 blocks with \(n_a = n_b = 1\) could be:

```
<- c(1,1,1,1,1,1,0,1,1,1)
ya <- c(0,0,1,0,1,0,0,0,0,0) yb
```

We might know the minimal clinically relevant effect size upfront for our test of two proportions. For example, we might only be interested in further researching or developing a drug when the difference (“delta” in the code) in the proportion of cured patients in the treatment group compared to the placebo group is at least 0.3. In practice this implies, for example, that when 20% of patients get cured on average in the placebo group, we want the drug to add at least 30% to this average, so in the treated group 50% of patients should be cured. We could design a safe test that achieves a power of 80% while testing at the relatively standard significance level of 0.05 in this setting:

```
<- designSafeTwoProportions(
balancedSafeDesign na = 1,
nb = 1,
alpha = 0.05, #significance level for testing
beta = 1 - 0.8, #power 80% equals type II error (beta) of 0.2
delta = 0.3
)
```

```
print(balancedSafeDesign)
#>
#> Safe Test of Two Proportions Design
#>
#> na±2se, nb±2se, nBlocksPlan±2se = 1±0, 1±0, 95±5.042694
#> minimal difference = 0.3
#> alternative = twoSided
#> alternative restriction = none
#> power: 1 - beta = 0.8
#> parameter: Beta hyperparameters = standard, REGRET optimal
#> alpha = 0.05
#> decision rule: e-value > 1/alpha = 20
#>
#> Timestamp: 2022-11-23 10:15:05 CET
#>
#> Note: Optimality of hyperparameters only verified for equal group sizes (na = nb = 1)
```

For detecting this difference with a power of at least 80%, while testing at significance level 0.05, we would need to observe 95 data blocks.

A safe test could now be performed with this design object; for this, some mock data are generated below:

```
set.seed(19012022)
<- 0.2
successProbabilityA <- 0.5
successProbabilityB <- rbinom(n = balancedSafeDesign[["nPlan"]]["nBlocksPlan"], size = 1, prob = successProbabilityA)
ya <- rbinom(n = balancedSafeDesign[["nPlan"]]["nBlocksPlan"], size = 1, prob = successProbabilityB)
yb
::plot(x = seq_along(ya), y = cumsum(ya), type = "p", ylim = c(0, max(c(sum(ya), sum(yb)))),
graphicsxlab = "block number", ylab = "# successes",
main = "Total # successes per group over time")
::points(x = seq_along(yb), y = cumsum(yb), type = "p", col = "grey")
graphics::legend(x = "topleft", legend = c("A", "B"), col = c("black", "grey"), pch = 1) graphics
```

Performing the safe test reveals:

```
safe.prop.test(ya = ya, yb = yb, designObj = balancedSafeDesign)
#>
#> Safe Test of Two Proportions
#>
#> data: ya and yb. nObsA = 95, nObsB = 95
#>
#> test: Beta hyperparameters = standard, REGRET optimal
#> e-value = 48222 > 1/alpha = 20 : TRUE
#> alternative hypothesis: true difference between proportions in group a and b is not equal to 0
#>
#> design: the test was designed with alpha = 0.05
#> for experiments with na = 1, nb = 1, nBlocksPlan = 95
#> to guarantee a power = 0.8 (beta = 0.2)
#> for minimal relevant difference = 0.3 (twoSided)
```

In this example, we would reject the null hypothesis that the true
difference between the success probabilities in the groups receiving
either the new treatment or a placebo is equal to 0. The interpretation
of the “e-values” calculated in safe tests is the opposite of that of
p-values: the higher the e-value, the more *evidence* for the
alternative hypothesis we have. The e-value exceeds the significance
threshold of \(1 / \alpha = 1 / 0.05 =
20\), hence, we reject the null hypothesis. More details and the
surprisingly simple proof of the type-I error guarantee of e-values can
be found in the original safe testing paper, on page 3 (Grünwald, 2019).

We might not have enough resources to fund our study to detect the
minimal difference of 0.3. For example, we might only have funds to
treat *50* patients in each group, so *100 in total*. If
this is the case, we could inspect the minimal number of patients we
need for the experiment to achieve a power of 80% at our significance
level per effect size of interest in the worst case:

```
#make the plot voor standard parameter settings
<- list(betaA1 = 0.18,
standardPrior betaA2 = 0.18,
betaB1 = 0.18,
betaB2 = 0.18)
#set a seed for the simulations
set.seed(1082021)
<- simulateTwoProportions(list(standard = standardPrior),
sampleSizePerMinimalDifference alternativeRestriction = "none",
alpha = 0.05,
beta = 0.2,
na = 1,
nb = 1,
deltamin = 0.3,
deltamax = 0.9
)
```

`::plot(sampleSizePerMinimalDifference) graphics`

The simulateTwoProportions function generates experiments based on
the minimal difference corresponding to the x-axis and carries out a
simulation with *optional stopping*, i.e., experiments were
stopped when \(e > 1/\alpha=20\) was
observed. The realised average number of patients in the worst case was
calculated, shown as *worst case*, depicting the worst-case
sample size we need to plan for if we want to reach 80% power. Observe
that the smallest absolute difference detectable with our available
resources is 0.47 if we want the worst-case power guarantee; we might
have to cancel the study, or try to acquire more research funds. With
our current funds, we can not guarantee a high enough power for
detecting the difference between the groups we are interested in. This
implies that, when a non-significant result is obtained, we would be
unsure whether this was caused by our underpowered study, or because
there was really no difference between the groups.

Furthermore, the plot also shows the *expected sample sizes*
under optional stopping. To retrieve these, the simulateTwoProportions
function carries out simulations with *optional stopping* where
experiments were stopped when \(e >
1/\alpha=20\) was observed, or when the worst-case sample size
for the 80% power guarantee was reached. Observe that the difference
between the planned sample size and the expected sample size under
optional stopping is substantial. In section 2, the behaviour of the
safe test for two proportions and Fisher’s exact test under optional
stopping is studied further.

What happens if we want to have \(n_a \neq n_b\) within the data blocks? For example, in a case-control study where we expect to collect twice as many controls as cases? We can incorporate this into our safe design:

```
<- designSafeTwoProportions(
unbalancedSafeDesign na = 2,
nb = 1,
alpha = 0.05,
beta = 0.2,
delta = 0.3
)
```

We can then again perform a safe test each time a new complete data block has become available., i.e., when we have collected data of two new patients in the control group and one in the case group. The data we use for the test again comprises of the number of successes per group per data block, and could for example look something like the following:

```
set.seed(692021)
<- rbinom(n = unbalancedSafeDesign[["nPlan"]]["nBlocksPlan"], size = 2, prob = successProbabilityA)
yaUnbalanced <- rbinom(n = unbalancedSafeDesign[["nPlan"]]["nBlocksPlan"], size = 1, prob = successProbabilityB)
ybUnbalanced
print(yaUnbalanced)
#> [1] 0 1 0 0 0 1 1 1 0 0 0 0 0 0 2 0 1 0 1 1 2 2 0 1 0 1 1 1 0 1 1 0 2 0 0 1 2 1
#> [39] 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 2 0 0 0 1 0 1 1 0 0 0 0 0
#> [77] 0 0 0
print(ybUnbalanced)
#> [1] 1 1 1 0 1 1 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1
#> [39] 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 0 1 1 1
#> [77] 0 0 0
```

Notice that in the control group, we can now observe \(0\), \(1\) or \(2\) successes per data block, as we observe a total of 2 patients per block. We can perform the safe test with this design object and data exactly as before:

```
safeTwoProportionsTest(ya = yaUnbalanced, yb = ybUnbalanced, designObj = unbalancedSafeDesign)
#>
#> Safe Test of Two Proportions
#>
#> data: yaUnbalanced and ybUnbalanced. nObsA = 158, nObsB = 79
#>
#> test: Beta hyperparameters = standard, REGRET optimal
#> e-value = 1.8097 > 1/alpha = 20 : FALSE
#> alternative hypothesis: true difference between proportions in group a and b is not equal to 0
#>
#> design: the test was designed with alpha = 0.05
#> for experiments with na = 2, nb = 1, nBlocksPlan = 79
#> to guarantee a power = 0.8 (beta = 0.2)
#> for minimal relevant difference = 0.3 (twoSided)
```

When the safe design for the unbalanced design is printed, one might notice that a note is printed about Beta prior parameter settings:

```
print(unbalancedSafeDesign)
#>
#> Safe Test of Two Proportions Design
#>
#> na±2se, nb±2se, nBlocksPlan±2se = 2±0, 1±0, 79±4.043125
#> minimal difference = 0.3
#> alternative = twoSided
#> alternative restriction = none
#> power: 1 - beta = 0.8
#> parameter: Beta hyperparameters = standard, REGRET optimal
#> alpha = 0.05
#> decision rule: e-value > 1/alpha = 20
#>
#> Timestamp: 2022-11-23 10:17:06 CET
#>
#> Note: Optimality of hyperparameters only verified for equal group sizes (na = nb = 1)
```

The test for two proportions “learns” from the data as it comes in
during the optional stopping process; the speed with which this happens
is influenced by these Beta prior parameter settings. These have been
optimised with respect to power and a notion of performance called
*regret* for the balanced trial setting; details can be found in
Turner et al., 2021. In numerical experiments,
good performance for various unbalanced designs have been observed for
these standard prior settings. However, one might want to experiment
with different settings, for which the following code could be used. In
this example, we compare an equivalent of a uniform prior to the
standard prior settings of the safe design:

```
#make the plot voor standard parameter settings
<- unbalancedSafeDesign[["betaPriorParameterValues"]]
standardPrior <- list(betaA1 = 2, # pretend we "have seen" 2 success and
uniformLikePrior betaA2 = 2,# 2 failure in group A before starting
betaB1 = 1, # and 1 success and failure in group B
betaB2 = 1)
#set a seed for the simulations
set.seed(1082021)
<- simulateTwoProportions(
sampleSizePerMinimalDifference hyperparameterList = list(standard = standardPrior, uniform = uniformLikePrior),
alternativeRestriction = "none",
alpha = 0.05,
beta = 0.2,
na = 2,
nb = 1,
deltamin = 0.3,
deltamax = 0.9
)
```

`::plot(sampleSizePerMinimalDifference) graphics`

From this plot, it can be observed that worst-case and expected sample sizes are lower when using the uniform-like prior settings for smaller differences between groups. When carrying out the study in the unbalanced setting, it might be profitable to use this non-standard prior.

Above we already shortly mentioned that the safe test *learns*
from the data stream during the optional stopping process. When one
would be quite certain of a difference they expect to detect between the
groups before starting the experiment, this could also be used to speed
up this learning process: the safe test could be “optimized” to detect
differences of only a certain size and larger. For example, we could
restrict the alternative hypothesis of our safe design to all instances
where the difference between the groups is 0.3 or bigger. This can be
easily done by adding one argument to the designSafeTwoProportions
function:

```
<- designSafeTwoProportions(
differenceRestrictedSafeDesign na = 1,
nb = 1,
alpha = 0.05, #significance level for testing
beta = 1 - 0.8, #power 80% equals type II error (beta) of 0.2
delta = 0.3,
alternativeRestriction = "difference" #also available: logOddsRatio
)
```

When can use this new design on the data obtained for the unrestricted balanced safe design above:

```
safe.prop.test(ya = ya, yb = yb, designObj = differenceRestrictedSafeDesign)
#>
#> Safe Test of Two Proportions
#>
#> data: ya and yb. nObsA = 95, nObsB = 95
#>
#> test: Beta hyperparameters = standard, REGRET optimal
#> e-value = 337447 > 1/alpha = 20 : TRUE
#> alternative hypothesis: true difference between proportions in group a and b is greater than 0
#>
#> design: the test was designed with alpha = 0.05
#> for experiments with na = 1, nb = 1, nBlocksPlan = 54
#> to guarantee a power = 0.8 (beta = 0.2)
#> for minimal relevant difference = 0.3 (greater)
```

We see that we now have obtained a much higher e-value, i.e., more evidence for the alternative hypothesis. We can also see in the design that if the true difference between the groups would be 0.3, we would only need to collect 54 data blocks instead of 95. However, one needs to be careful: if the true difference would be smaller than 0.3, we could miss evidence and would need to continue experiments longer than with the unrestricted safe tests: a good illustration can be found in Turner et al., 2021, figure 4.

In this section, we illustrate the concept of optional stopping under various scenarios.

In the figure below the spread of the stopping times among 1,000 simulated experiments is depicted, if the real effect size equals the minimal clinically relevant effect size as planned:

```
#set a seed for the simulations
set.seed(1082021)
<- simulateOptionalStoppingScenarioTwoProportions(
optionalStoppingResult safeDesign = balancedSafeDesign,
M = 1e3,
thetaA = 0.2,
thetaB = 0.5
)
```

```
plotHistogramDistributionStoppingTimes(safeSim = optionalStoppingResult,
nPlan = balancedSafeDesign[["nPlan"]]["nBlocksPlan"],
deltaTrue = 0.3)
```

We designed the safe test such that we had a minimal power of 0.8, would the data truly come from a distribution with an absolute difference of 0.3 between the proportions of cured patients in the groups. Has this power been achieved?

```
"powerOptioStop"]]
optionalStoppingResult[[#> [1] 0.841
```

We have designed the safe test for a minimal clinically relevant effect size, but what would happen if the difference between the groups was even larger in reality, i.e., if the drug had an even bigger effect?

```
#set a seed for the simulations
set.seed(1082021)
<-
optionalStoppingResultTrueDifferenceBigger simulateOptionalStoppingScenarioTwoProportions(
safeDesign = balancedSafeDesign,
M = 1e3,
thetaA = 0.2,
thetaB = 0.7
)
```

```
plotHistogramDistributionStoppingTimes(safeSim = optionalStoppingResultTrueDifferenceBigger,
nPlan = balancedSafeDesign[["nPlan"]]["nBlocksPlan"],
deltaTrue = 0.5)
```

We would stop, on average, even earlier! The power of the experiment also increases:

```
"powerOptioStop"]]
optionalStoppingResultTrueDifferenceBigger[[#> [1] 1
```

We can also illustrate what would happen under optional stopping,
when our *null hypothesis* that there is no difference between
the effect of the drug and the placebo is true:

```
#set a seed for the simulations
set.seed(1082021)
<- simulateOptionalStoppingScenarioTwoProportions(safeDesign = balancedSafeDesign,
optionalStoppingResultTrueDifferenceNull M = 1e3,
thetaA = 0.5,
thetaB = 0.5)
```

```
plotHistogramDistributionStoppingTimes(safeSim = optionalStoppingResultTrueDifferenceNull,
nPlan = balancedSafeDesign[["nPlan"]]["nBlocksPlan"],
deltaTrue = 0)
```

The type I error rate has stayed below 0.05, i.e., the percentage of rejected experiments is:

```
"powerOptioStop"]]
optionalStoppingResultTrueDifferenceNull[[#> [1] 0.026
```

Optional stopping, however, causes Fisher’s exact test to overreject the null. When the null is true, the rate of incorrect null rejections exceeds the tolerable \(\alpha\)-level:

```
<-
optionalStoppingWrongFisher simulateIncorrectStoppingTimesFisher(
thetaA = 0.5,
thetaB = 0.5,
alpha = 0.05,
na = balancedSafeDesign[["nPlan"]][["na"]],
nb = balancedSafeDesign[["nPlan"]][["nb"]],
maxSimStoptime = balancedSafeDesign[["nPlan"]][["nBlocksPlan"]],
M = 1e3,
numberForSeed = 1082021
)
```

```
mean(optionalStoppingWrongFisher[["rejections"]] == 1)
#> [1] 0.196
```

In almost 20% of experiments we would reject the null, which is four times as much as promised! In Turner et al., 2021, you can find a figure which shows how this probability increases even further, if we would include more and more patients in our experiments.

In each of the simulations above, a fraction of the experiments did not lead to the rejection of the null hypothesis. Since safe tests allow for optional continuation, one could decide to plan a replication experiment after such a ‘failed’ first experiment, for example when the e-value is borderline significant. The resulting e-values from these replication studies could then be multiplied to calculate a final e-value.

We are now going to zoom in on two of the optional stopping simulations we carried out above, where the true difference between the groups equalled our design difference (0.3), and where the true difference equalled 0. In the experiment where the true difference was 0.3, we did not reject the null in 15.9% of the studies. If we now imagine the situation we would encounter in reality, where we would not know that we were really sampling from the alternative hypothesis, how high should e-values then be to support starting a replication study? To give ourselves some pointers, we could look at the spread of e-values from studies where the null was not rejected, from our experiments under the null and under the alternative:

```
<-
eValuesNotRejected "eValues"]][!optionalStoppingResult[["allSafeDecisions"]]]
optionalStoppingResult[[<-
eValuesNotRejectedNull "eValues"]][!optionalStoppingResultTrueDifferenceNull[["allSafeDecisions"]]] optionalStoppingResultTrueDifferenceNull[[
```

It can be observed that, when the true difference between the groups equals our design difference, the e-values are spread out between 0 and 20. On the other hand, with our experiment under the null, all e-values were smaller than 2.

Based on this plot we could for example conclude that studies that yielded a final e-value between 10 and 20 look promising; under the null hypothesis, such high e-values were not observed in the spread plot! What would happen if we followed these studies up with a smaller extra study with 60 participants (30 data blocks), and combined the resulting e-values? How many of the initially futile experiments will now lead to rejection of the null hypothesis?

```
<- eValuesNotRejected[eValuesNotRejected >= 10]
promisingEValues <- designSafeTwoProportions(na = 1, nb = 1, nBlocksPlan = 30)
followUpDesign <- simulateOptionalStoppingScenarioTwoProportions(followUpDesign,
followUpResults M = length(promisingEValues),
thetaA = 0.2, thetaB = 0.5)
<- followUpResults[["eValues"]] * promisingEValues newEValues
```

We now have rejected the null hypothesis in:

```
sum(newEValues >= 20)/length(promisingEValues)
#> [1] 0.6
```

60% of the continued experiments.

What happens when we apply this optional continuation when the data are truly generated under the null hypothesis? (note that we relax our bound of initial ‘promising’ e-values here to 1, otherwise there would be no e-values to continue with)

```
<-
promisingEValuesNull >= 1]
eValuesNotRejectedNull[eValuesNotRejectedNull <-
followUpDesign designSafeTwoProportions(na = 1,
nb = 1,
nBlocksPlan = 30)
<-
followUpResults simulateOptionalStoppingScenarioTwoProportions(
followUpDesign,M = length(promisingEValuesNull),
thetaA = 0.5,
thetaB = 0.5
)<-
newEValuesNull "eValues"]] * promisingEValuesNull followUpResults[[
```

We still keep our type-I error probability guarantee: we have not rejected the null in any of the follow-up experiments.

Equivalently to how we were allowed to look at our e-value and decide
to stop at any time in the experiments above, we can also use e-values
to construct *anytime valid confidence sequences*. This means
that each time a data block has come in, we can calculate a confidence
interval using safe statistics. Independent of the stopping rule one
wishes to use, we know that if we construct the confidence sequence with
significance level \(\alpha\), the
probability the true value of the measure is not included in the
confidence interval is bounded by that \(\alpha\) .

For the two-by-two contingency tables, we have developed confidence sequences for two measures of interest: the difference between the success rates in the groups, and the logarithm of the odds ratio between the groups (similar to the relative risk ratio). The mathematical details can be found in section 7.3 of Turner et al., 2021, version 1; we will here instead focus on some practical examples.

Let us again use the example from above, where we wanted to compare
probabilities of being cured for two groups of patients, one receiving a
placebo and one a new treatment. However, now, we do not want to test
*if* there exists a difference between the groups, but we want to
generate an *estimate of that difference*. We again consider the
setting where we plan to collect approximately 100 data blocks, i.e.,
100 patients per group. We re-use the data generated in the previous
experiment, with success probability 0.2 in the placebo group, and 0.5
in the treated group. The code below is used to iterate over the
observed data blocks and store the confidence interval observed at each
time point:

```
plotConfidenceSequenceTwoProportions(ya = ya,
yb = yb,
safeDesign = balancedSafeDesign,
differenceMeasure = "difference",
precision = 100,
trueDifference = 0.3)
```

The above plot shows that after finishing our experiment, \(0\) is not included in the confidence interval (grey dashed line). The true value, 0.3, remains included. The precision indicates how many difference values between -1 and 1 are checked while building the confidence sequence. It is recommended to set this value to 100 (default).

The code below can be used to check that our confidence sequence indeed offers the \(1 - \alpha\) guarantee and includes the difference between the two success probabilities of \(0.3\) in at least 95% of simulated scenarios:

```
<- simulateCoverageDifferenceTwoProportions(successProbabilityA = 0.2,
coverageSimResult trueDelta = 0.3,
safeDesign = balancedSafeDesign,
numberForSeed = 1082021)
```

```
print(coverageSimResult)
#> [1] 0.974
```

Our confidence interval is even a bit conservative: the true difference (delta) is included in 97.4% of the confidence intervals in 1000 simulations.

Due to the mathematical properties of this measure, for the log of the odds ratio, only one-sided confidence intervals can be constructed for the 2x2 setting (more details in Turner et al., 2021). When one hypothesizes/ observes that the log odds ratio is positive (the success probability in the treatment group is bigger than that in the placebo group), a positive lower bound can be given for this odds ratio. Vice versa, if one hypothesizes a negative log odds ratio, an negative upper bound can be given.

Let us again look at the example above, where the success probability in the treatment group was 0.5, and 0.2 in the placebo group. The true log odds ratio for this scenario is:

```
log(0.5/0.5 * 0.8/0.2)
#> [1] 1.386294
```

We can estimate a lower bound for the log of the odds ratio based on the data. We have to provide the interval to search over, and the precision to search with for this lower bound.

```
<- computeConfidenceBoundForLogOddsTwoProportions(ya = ya,
confidenceBoundOdds yb = yb,
safeDesign = balancedSafeDesign,
bound = "lower",
precision = 100,
deltaStart = 0.001,
deltaStop = 2)
```

```
print(confidenceBoundOdds)
#> [1] 0.7682929
```

We see that the true log odds ratio lies above this lower bound, and that an odds ratio of 0 is not included in the confidence interval.

When the user accidentally asks for the “wrong” type of bound to be calculated, for example, for an upper bound when the ML estimate of the logarithm of the odds ratio is positive, a warning message is displayed:

```
computeConfidenceBoundForLogOddsTwoProportions(ya = ya,
yb = yb,
safeDesign = balancedSafeDesign,
bound = "upper",
precision = 10,
deltaStart = -0.01,
deltaStop = -2)
#> Warning in computeConfidenceBoundForLogOddsTwoProportions(ya = ya, yb = yb, : No
#> upper bound could be established; try different bound or smaller deltaStart
#> [1] 0
```

As with the difference confidence sequence, we can visualize how the confidence sequence converges over the data blocks:

```
plotConfidenceSequenceTwoProportions(ya = ya,
yb = yb,
safeDesign = balancedSafeDesign,
differenceMeasure = "odds",
precision = 100,
trueDifference = log(0.5/0.5 * 0.8/0.2),
deltaStart = 0.001,
deltaStop = 2)
```

1. Rosanne Turner, Alexander Ly, and Peter Grünwald. Two-Sample Tests that are Safe under Optional Stopping, with an Application to Contingency Tables, 2021. https://arxiv.org/abs/2106.02693

2. Peter Grünwald, Rianne de Heide and Wouter Koolen. Safe testing, 2019. https://arxiv.org/abs/1906.07801