Formal architecture verification in practice

Introduction

When we want to make sure that our code works as expected, we write tests like unit tests or integration tests to verify that. But these tests can only point out certain errors, they cannot guarantee the absence of errors. Formal architecture verification can do just that within the boundaries of the method. Let’s look at a real life example using TLA+.

Use Case

In one of our projects for the healthcare sector, we use a trust anchor (TA) to verify some certificates (see here for more information about such processes). This TA has a validity timestamp after which it is not supposed to be used anymore. To make sure this trust anchor is always up-to-date, we use a scheduled update service to download the newest available version from a central server. This update service runs every morning at 8am. It downloads the TA from the central server and saves it into our own local database. From there, the service actually verifying the certificates can fetch the updated TA. This process is pictured below.

Now let’s take a closer look at what happens when an update of the TA is triggered. We’ll start with a simple update process that has two possible outcomes: Either everything is fine and the new TA is saved to the database, or something goes wrong and the process is aborted, leaving the currently used TA in the database. We’ll later see that there’s more to it than that, but for now it looks like this:

To verify that this process works as intended, we first need to specify it in TLA+.

Spec

Modeling the update process in TLA+ results in the spec below. Some hints for a better understanding:

/\ and \/ mean “and” and “or” and they can be used to concatenate conditions as well as assignments
step = "start" checks if the initial value of step is "start", whereas step' = "stop" means that step is assigned the new value "stop"
Related variables are grouped in sequences for more convenient access. These are defined on lines 7 and 8.
Variables (and sequences) that don’t change their value in a particular step are marked with the UNCHANGED keyword

----------------------------- MODULE UpdateTA -----------------------------
 
EXTENDS Integers
VARIABLES step, execution_time_minutes, validity_current_ta_minutes, validity_new_ta_minutes, ta_in_use, process_running
CONSTANTS time_per_step_minutes, cron_repetition_minutes
 
time_values == <<validity_current_ta_minutes, validity_new_ta_minutes, execution_time_minutes>>
download_in_progress == <<process_running, ta_in_use>>
currentTa == "current"
newTa == "new"
 
Init == /\ step = "start"
        /\ execution_time_minutes = 0
        /\ validity_current_ta_minutes = 30
        /\ validity_new_ta_minutes = 1440       \* 24h
        /\ ta_in_use = currentTa
        /\ process_running = TRUE
 
LoadAndValidateTa == /\ step = "start"
                     /\ UNCHANGED download_in_progress
                     /\ execution_time_minutes' 
                          = execution_time_minutes + time_per_step_minutes
                     /\ validity_current_ta_minutes' 
                          = validity_current_ta_minutes - time_per_step_minutes
                     /\ validity_new_ta_minutes' 
                          = validity_new_ta_minutes - time_per_step_minutes
                     /\ step' \in {"update", "abort"}
               
UpdateCurrentTa == /\ step = "update"
                   /\ process_running' = ~process_running
                   /\ validity_current_ta_minutes' 
                        = validity_current_ta_minutes - time_per_step_minutes
                   /\ execution_time_minutes' 
                        = execution_time_minutes + time_per_step_minutes
                   /\ validity_new_ta_minutes' 
                        = validity_new_ta_minutes - time_per_step_minutes
                   /\ ta_in_use' = newTa
                   /\ step' = "stop"
 
AbortProcess == /\ step = "abort"
                /\ process_running' = ~process_running
                /\ UNCHANGED time_values
                /\ ta_in_use' = currentTa
                /\ step' = "stop"
 
Next == LoadAndValidateTa \/ UpdateCurrentTa \/ AbortProcess
 
ValidTaInUse == \/ /\ ta_in_use = newTa
                   /\ validity_new_ta_minutes >= cron_repetition_minutes
                \/ /\ ta_in_use = currentTa
                   /\ validity_current_ta_minutes >= cron_repetition_minutes
                \/ /\ ta_in_use = currentTa
                   /\ process_running = TRUE
                   /\ validity_current_ta_minutes 
                        >= (2 * time_per_step_minutes - execution_time_minutes)
 
=============================================================================

Each of the steps in the process has a corresponding step here: Init defines the initial state of the system when the update process starts, LoadAndvalidateTa is where the actual work happens, and UpdateCurrentTa and AbortProcess relate to the two possible outcomes of the process.

The most important part here is the invariant, because defining it forces us to think through the possible valid states of our process and to write them down in a verifyable way. In the example above that’s the part called ValidTaInUse. So what are the valid states here? We came up with three:

The new TA is in the database and the validity timestamp is far enough in the future that the update cron job is guaranteed to run again before then. Otherwise the fact that the TA is no longer valid would not be detected in time.
The current TA is still in use and the validity timestamp is far enough in the future that the update cron job is guaranteed to run again before then. Otherwise we run into the same problem as in the previous scenario.
The current TA is still in use, but the update process is in progress and the validity timestamp is far enough in the future for the process to finish in time.

The constants used in the above spec are defined in the corresponding model checker as follows:

---- MODULE MC ----
EXTENDS UpdateTA, TLC
 
\* CONSTANT definitions @modelParameterConstants:0cron_repetition_minutes
cron_repetition_minutes_value ==
1440        \* 24h
----
 
\* CONSTANT definitions @modelParameterConstants:1time_per_step_minutes
time_per_step_minutes_value ==
1
----
 
=============================================================================

We have to make some assumptions here, but it we can assume that each step of the process will be finished in less than a minute.

Lastly, to make sure that the different steps will be executed in the correct order we use the state variable step. It is reassigned at the end of each step to signal which are the possible next steps. So, after LoadAndValidateTa, both UpdateCurrentTa and AbortProcess are valid choices. When the spec is executed, both paths will be traversed.

Findings

Now we want to run the TLA+ verifier with this spec. There is a dedicated toolbox to do that, but also for example a plugin for IntelliJ Idea.

When we run the spec, it gives us stacktraces for two error paths.

First, if the TA is successfully updated, we end up with a new TA that won’t be valid long enough for the cron job to run again. This violates the first condition.

Second, if an error occurs while loading and validating the new TA, the process will be aborted and we end up with the current TA still in use, although it might not be valid long enough for the cron job to run again. This is in violation of the second condition of our invariant.

Adjusting the spec (and architecture)

Now that we know the weaknesses of our architecture, we can think about how to fix them. One obvious solution to fix the first problem is to increase the frequency of running the cron job. To figure out a value that works, we can declare the constant cron_repetition_minutes as a variable instead and include it in our Init statement like this:

Init == /\ step = "start"
        /\ execution_time_minutes = 0
        /\ validity_current_ta_minutes = 30
        /\ validity_new_ta_minutes = 1440               \* 24h
        /\ cron_repetition_minutes \in 1410..1440       \* 23.5h to 24h
        /\ ta_in_use = currentTa
        /\ process_running = TRUE

This means when we run the spec, it will be executed with each possible value for cron_repetition_minutes, namely 1410 to 1440 minutes. We could also assign validity_new_ta_minutes in the same way. We just need to make sure the ranges don’t get too broad, because the number of times the spec will be executed quickly multiplies (and with that, the time that’s needed for a full run).

Fixing the second problem could mean deleting the current TA if the process is aborted due to an error and the validity timestamp of the current TA is not far enough in the future that the update cron job is guaranteed to run again before then. That way, there would be no invalid TA for verifying certificates. To incorporate that into the spec, we would need to introduce a third state for the ta_in_use variable and another possible step in the process:

AbortAndDelete == /\ step = "abortDelete"
                  /\ validity_current_ta_minutes 
                       < (cron_repetition_minutes + 2 * time_per_step_minutes)
                  /\ process_running' = ~process_running
                  /\ UNCHANGED time_values
                  /\ ta_in_use' = "none"
                  /\ step' = "stop"

This step includes a condition for the value that validity_current_ta_minutes is allowed to have for the process to enter this step. We need to add the reverse condition to AbortProcess and include the new step in our next state relation as well as in the list of steps that are possible after LoadAndValidateTa.

In addition, we have to add another valid state to the invariant:

ValidTaInUse == \/ /\ ta_in_use = newTa
                   /\ validity_new_ta_minutes >= cron_repetition_minutes
                \/ /\ ta_in_use = currentTa
                   /\ validity_current_ta_minutes >= cron_repetition_minutes
                \/ /\ ta_in_use = currentTa
                   /\ process_running = TRUE
                   /\ validity_current_ta_minutes 
                        >= (2 * time_per_step_minutes - execution_time_minutes)
                \/ /\ ta_in_use = "none"
                   /\ validity_current_ta_minutes 
                        < (cron_repetition_minutes + 2 * time_per_step_minutes)

If we now run the spec again with these modifications, it still gives us some error paths. But if we look closely, we’ll find that these only relate to certain initial values of the cron_repetition_minutes variable, so we know which frequencies not to choose for our cron job.

So what we found is that we need another branch of error handling. Now that we verified that it actually fixes our problems, we can adjust our architecture accordingly:

Changing the initial values

We found that adjusting the frequency of our cron job circumvents the problem that the new TA might not be valid until the next run. But this approach only takes us so far. What happens if the validity of the new TA is very low, or even zero? Let’s change our initial state so that the validity is either 0 or 1450 minutes:

Init == /\ step = "start"
        /\ execution_time_minutes = 0
        /\ validity_current_ta_minutes = 30
        /\ validity_new_ta_minutes \in {0, 1450}
        /\ ta_in_use = currentTa
        /\ process_running = TRUE

Now, if we run the spec, we get an error if we start with a validity of 0 minutes. We can’t run the update job that often, instead we’ll handle it the same way as we did above when the remaining validity of the current TA is too low: We’ll delete the exisiting TA so that we don’t end up with an invalid one. To incorporate this into the spec, we add a condition to the update step, so that it can only be executed if the new TA is valid long enough:

UpdateCurrentTa == /\ step = "update"
                   /\ validity_new_ta_minutes 
                        >= (cron_repetition_minutes + 2 * time_per_step_minutes)
                   /\ process_running' = ~process_running
                   /\ validity_current_ta_minutes' 
                        = validity_current_ta_minutes - time_per_step_minutes
                   /\ execution_time_minutes' 
                        = execution_time_minutes + time_per_step_minutes
                   /\ validity_new_ta_minutes' 
                        = validity_new_ta_minutes - time_per_step_minutes
                   /\ ta_in_use' = newTa
                   /\ step' = "stop"

This shows that the results of the verification partly depend on the initial values, so some thought should be put into them.

Now this is the final version of the update process, with errors properly handled:

Conclusion

Formal architecture verification can help us find flaws in our software design before we even wrote a single line of code for the actual implementation. That way, it helps us prevent costly redesigns late in the development process. On the other hand, requirements may still change, due to new government regulations or additional features. In that case, necessary changes in the architecture can be verified against the spec before implementing them. In both cases, by using formal verification we can be more confident in our design. It gives us another layer of verification in addition to the various types of automated tests, as it concerns a different part of the development process. This is especially helpful if the stakes are high, for example if the software in question controls a medical device. More use cases that benefit from the application of formal architecture verification can be found in this previous post.