makeMachine Learning and Knowledge ExtractionMach. Learn. Knowl. Extr.Machine Learning and Knowledge Extraction25044990MDPI10.3390/make3010010make0300010ArticleProperty Checking with Interpretable Error Characterization for Recurrent Neural Networkshttps://orcid.org/0000000216107334MayrFranz*†https://orcid.org/0000000227374382YovineSergio*†ViscaRamiroHayashiYoichiAcademic EditorFacultad de Ingeniería, Universidad ORT Uruguay, 11100 Montevideo, Uruguay; visca@ort.edu.uyCorrespondence: mayr@ort.edu.uy (F.M.); yovine@ort.edu.uy (S.Y.)
This paper presents a novel onthefly, blackbox, propertychecking through learning approach as a means for verifying requirements of recurrent neural networks (RNN) in the context of sequence classification. Our technique steps on a tool for learning probably approximately correct (PAC) deterministic finite automata (DFA). The sequence classifier inside the blackbox consists of a Boolean combination of several components, including the RNN under analysis together with requirements to be checked, possibly modeled as RNN themselves. On one hand, if the output of the algorithm is an empty DFA, there is a proven upper bound (as a function of the algorithm parameters) on the probability of the language of the blackbox to be nonempty. This implies the property probably holds on the RNN with probabilistic guarantees. On the other, if the DFA is nonempty, it is certain that the language of the blackbox is nonempty. This entails the RNN does not satisfy the requirement for sure. In this case, the output automaton serves as an explicit and interpretable characterization of the error. Our approach does not rely on a specific property specification formalism and is capable of handling nonregular languages as well. Besides, it neither explicitly builds individual representations of any of the components of the blackbox nor resorts to any external decision procedure for verification. This paper also improves previous theoretical results regarding the probabilistic guarantees of the underlying learning algorithm.
recurrent neural networksprobably approximately correct learningblackbox explainability1. Introduction
Artificial intelligence (AI) is a flourishing research area with numerous reallife applications. Intelligent software is developed in order to automate processes, classify images, translate text, drive vehicles, make medical diagnoses, and support basic scientific research. The design and development of this kind of systems is guided by quality attributes that are not exactly the same as those that drive the construction of a typical software system. Indeed, a salient one is the degree to which a human being (e.g., a physician) can really understand the actual cause of a decision made by an AI system (e.g., the diagnostic of a disease). Such attribute is called interpretability [1,2,3].
Undoubtedly, artificial neural networks (ANN) are currently the cuttingedge AI models [4]. However, their inherent nature undermines human capability of achieving acceptable comprehension of the reasons of their outputs. A major obstacle towards interpreting their behavior is their deep architectures with millions of neurons and connections. Such overwhelming complexity attempts against interpretability even if ANN structure used in a particular context is known (e.g., convolutional neural networks in computer vision or recurrent neural networks in language translation) and the mathematical principles on which they are grounded are understood [5].
Thoroughly interpreting the functioning of AI components is a must when they are used in the context of safety and securitycritical domains such as intelligent driving [6,7], intrusion, attack, and malware detection [8,9,10,11,12], human activity recognition [13], medical records analysis [14,15], and DNA promoter region recognition [16], which involve using deep recurrent neural networks (RNN) for modeling the behavior of the controlled, monitored, or analyzed systems or data. Moreover, it is paramount to verify their outputs with respect to the requirements they must fulfill to correctly perform the task they have been trained for. Whenever a network outcome does not satisfy a required property, it appears necessary to be able to adequately characterize and interpret the misbehavior, in order to be able to properly correct the fault, which may involve redesigning and retraining the network. Indeed, when it comes to interpreting the error of an RNN with respect to a given requirement, typically expressed as a property over sequences (i.e., a language in a formal sense) it is useful to do it through an operational and visual characterization, as a means for gaining insight into the set of incorrect RNN outputs (e.g., wrong classification of a human DNA region as a promoter) in reasonable time.
One way of checking language properties in the context of RNN devoted to sequence classification, consists in extracting an automaton, such as a deterministic finite automaton (DFA) from the network and resort to automatatheoretic tools to perform the verification task on the extracted automaton. That is, once the automaton is obtained, it can be modelchecked against a desired property using an appropriate modelchecker [17]. This approach can be implemented by resorting to whitebox learning algorithms such as the ones proposed in [18,19,20]. However, RNN are more expressive than DFA [21]. Therefore, the language of the learned automaton is, in general, an approximation of the sequences classified as positive by the RNN. The cited procedures do not provide quantitative assessments on how precisely the extracted DFA characterizes the actual language of the RNN. Nonetheless, this issue is overcome by the blackbox learning algorithm proposed in [22] which learns DFA which are probably correct approximations (PAC) [23] of the RNN. This means that the error between the outputs of the analyzed RNN and the extracted DFA can be bounded with a given confidence.
When applied in practice, this general approach has several important drawbacks. The first one is state explosion. That is, the DFA learned from the RNN may be too large to be explicitly constructed. Another important inconvenience is that when the modelchecker fails to verify the property on the DFA, counterexamples found on the automaton are not necessarily real counterexamples of the RNN. Indeed, since the DFA is an approximation of the RNN, counterexamples found on the former could be false negatives. Last but not least, it has been advocated in [24] that there is also a need for property checking techniques that interact directly with the actual software that implements the network.
To cope with these issues, Reference [25] devised a technique based on the general concept of learningbased blackbox checking (BBC) proposed in [26]. BBC is a refinement procedure where DFA are incrementally built by querying a blackbox. At each iteration, these automata are checked against a requirement by means of a modelchecker. The counterexamples, if any, found by the modelchecker are validated on the blackbox. If a false negative is detected, it is used to refine the automaton. A downside of BBC is that it requires (a) fixing a formalism for specifying the requirements, typically lineartime temporal logic, and (b) resorting to an external modelchecker to verify the property. Moreover, the blackbox is assumed to be some kind of finitestate machine.
Instead, the method proposed in [25] performs onthefly property checking during the learning phase, without using an external modelchecker. Besides, the algorithm handles both the RNN and the property as blackboxes and it does not build, assume, or require them to expressed in any specific way. The approach devised in [25] focuses on checking language inclusion, that is, whether every sequence classified by the RNN belongs to the set of sequences defined by the property. This question can be answered by checking language emptiness: the requirement is satisfied if the intersection of the language of the RNN and the negation of the property is empty, otherwise it is not. Language emptiness is tackled in [25] by learning a probably approximately correct DFA. On one hand, if the learning algorithm returns an empty DFA, there is a proven upper bound on the probability of the language to be nonempty, and therefore of the RNN not satisfying the property. In other words, the property is probably true with probabilistic guarantees given in terms of the algorithm parameters. On the other, if the output is a nonempty DFA, the language is ensured to be nonempty. In this case, the property is certainly false. Besides, the output DFA is an interpretable characterization of the error.
The contribution of this paper is twofold. First, we revise and improve the theoretical results of [25]. We extend the approach to checking not only language inclusion but any verification problem which can be reduced to checking emptiness. Besides, we provide stronger results regarding the probabilistic guarantees of the procedure. Second, we apply the method to other use cases, including checking contextfree properties and equivalence between RNN.
The structure of the paper is the following. Section 2 reviews probably approximately correct learning. Section 3 introduces onthefly blackbox propertychecking through learning. Section 4 revisits the framework proposed in [25] and shows the main theoretical results. These include improvements with respect to the previously known probabilistic guarantees of the underlying learning algorithm. Section 5 describes the experimental results obtained in a number of use cases from different application areas. Section 6 discusses related works. Section 7 presents the conclusions.
2. Probably Approximately Correct Learning
Let us first give some preliminary definitions. There is a universe of examples which is denoted X. Given two subsets of examples X,X′⊆X, the differenceX\X′ is the set of x∈X such that x∉X′, or equivalently, the set X∩X′¯, where X¯⊆X is the complement of X. Their symmetric difference, denoted X⊕X′, is defined as X\X′∪X′\X. Examples are assumed to be identically and independently distributed (i.i.d.) according to an unknown probability distribution D over X.
A conceptC is a subset of X. A concept class C is a set of concepts. Given an unknown concept C∈C, the purpose of a learning algorithm is to output a hypothesis H∈H that approximatesC, where H, called hypothesis space, is a class of concepts possibly different from C.
The prediction errorE of a hypothesis H with respect to the unknown concept C measured in terms of the probability distribution D is the probability of an example x∈X, drawn from D, to be in symmetric difference of C and H. Formally:ED,C(H)=Px∼Dx∈C⊕H
An oracleEXD,C draws i.i.d examples from X following D, and associates the labels according to whether they belong to C. An example x∈X is labeled as positive if x∈C, otherwise it is labeled as negative. Repeated calls to EXD,C are independent of each other.
A Probably Approximately Correct (PAC) learning algorithm [23,27,28] takes as input an approximation parameter ϵ∈(0,1), a confidence parameter δ∈(0,1), a target concept C∈C, an oracle EXD,C, and a hypothesis space H, and if it terminates, it outputs an H∈H which satisfies Px∼Dx∈C⊕H≤ϵ with confidence at least 1−δ. Formally:PED,C(H)>ϵ<δ
The output H of a PAClearning algorithm is said to be an ϵapproximation of C with confidence at least 1−δ, or equivalently, an (ϵ,δ)approximation of C.
Typically, EXD,C is indeed composed of a sampling procedure that draws an example x∼D and calls a membership query oracle MQC to check whether x∈C. Besides EX and MQ, a PAClearning algorithm may be equipped with an equivalence query oracle EQD,C. This oracle takes as input a hypothesis H and a sample size m and answers whether H is an (ϵ,δ)approximation of C by drawing a sample S⊂X of size m using EXD,C, i.e., S∼Dm, and checking whether for all x∈S, x∈C iff x∈H, or equivalently, S∩(C⊕H)=∅.
We revisit here some useful results from [25].
Let C∈C and H∈H such that H is an (ϵ,δ)approximation of C. For any subset X⊆C⊕H, we have that Px∼Dx∈X≤ϵ with confidence 1−δ.
For any subset X⊆C⊕H, it holds that Px∼Dx∈X≤Px∼Dx∈C⊕H. It follows that Px∼Dx∈C⊕H≤ϵ implies Px∼Dx∈X≤ϵ. Now, for any S⊆X satisfying S∩(C⊕H)=∅, we have that S∩X=∅. Hence, any sample S∼Dm drawn by EQD,C that ensures Px∼Dx∈C⊕H≤ϵ with confidence 1−δ also guarantees Px∼Dx∈X≤ϵ with confidence 1−δ. ☐
Let C∈C and H∈H such that H is an (ϵ,δ)approximation of C. For any X⊆X:Px∼Dx∈C∩H¯∩X≤ϵPx∼Dx∈C¯∩H∩X≤ϵwith confidence at least 1−δ.
From Lemma 1 because C∩H¯∩X and C¯∩H∩X are subsets of C⊕H. ☐
Given an unknown concept C∈C, and a known property P∈H to be checked on C, we want to answer whether C⊆P holds, or equivalently C∩P¯=∅. One way of doing it in a blackbox setting consists in resorting to a modelchecking approach. That is, first learn a hypothesis H∈H of C with a PAClearning algorithm and then check whether H satisfies property P. We call this approach postlearning verification. In order to be feasible, there must be an effective procedure for checking H∩P¯=∅.
Assume an algorithm for checking emptiness exists. Proposition 2 from [25], proves that whichever the outcome of the decision procedure for H∩P¯, the probability of the same result not being true for C is smaller than ϵ, with confidence at least 1−δ.
Let C∈C and H∈H such that H is an (ϵ,δ)approximation of C. For any P¯∈H:
if H∩P¯=∅ then Px∼Dx∈C∩P¯≤ϵ, and
if H∩P¯≠∅ then Px∼Dx∈C¯∩H∩P¯≤ϵ,
with confidence at least 1−δ.
If H∩P¯=∅ then P¯=H¯∩P¯. Thus, C∩P¯=C∩H¯∩P¯ and from Proposition 1(3) it follows that Px∼Dx∈C∩H¯∩P¯≤ϵ, with confidence at least 1−δ.
If H∩P¯≠∅, from Proposition 1(4) we have that Px∼Dx∈C¯∩H∩P¯≤ϵ, with confidence at least 1−δ. ☐
When applied in practice, an important inconvenience of this approach is that whenever P is found by the modelchecker not to hold on H, even if with small probability, counterexamples found on H may not be counterexamples in C. Therefore, whenever that happens, we would need to resort to EX to draw examples from H∩P¯ and call MQ to figure out whether they belong to C in order to trying finding a concrete counterexample in C.
From a computational perspective, in particular in the application scenario of verifying RNN, we should be aware that the learned hypothesis could be too large and that the running time of the learning algorithm adds up to the running time of the modelchecker, thus making the overall procedure impractical.
Last but not least, this approach could only be applied for checking properties for which there exists a modelchecking procedure in H. In our context, it will prevent verifying nonregular properties.
3.2. OntheFly Property Checking through Learning
To overcome the aforementioned issues, rather than learning an (ϵ,δ)approximation of C, Ref. [25] proposed to use the PAClearning algorithm to learn an (ϵ,δ)approximation of C∩P¯∈C. This approach is called onthefly property checking through learning.
Indeed, this idea can be extended to cope with any verification problem which can be expressed as checking the emptiness of some concept Ψ(C)∈C, which in the simplest case is C∩P¯. In such context, we have the following, more general, result.
Let C∈C, Ψ(C)∈C and H∈H such that H is an (ϵ,δ)approximation of Ψ(C). Then:
if H=∅ then Px∼Dx∈Ψ(C)≤ϵ, and
if H≠∅ then Px∼Dx∈H\Ψ(C)≤ϵ,
with confidence at least 1−δ.
Straightforward since Px∼Dx∈Ψ(C)⊕H≤ϵ, with confidence at least 1−δ, by the fact that H is an (ϵ,δ)approximation of Ψ(C). ☐
Proposition 3 proves that checking properties during the learning phase yields the same theoretical probabilistic assurance as doing it afterwards on the learned model of the target concept C. Nevertheless, from a practical point of view, onthefly property checking through learning has several interesting advantages over postlearning verification. First, no model of C is ever explicitly built which may result in a lower computational effort, both in terms of running time and memory. Therefore, this approach could be used in cases where it is computationally too expensive to construct a hypothesis for C. Second, there is no need to resort to external modelcheckers. The approach may even be applied in contexts where such algorithms do not exist. Indeed, in contrast to postlearning verification, an interesting fact in onthefly checking is that in the case the PAClearning algorithm outputs a nonempty hypothesis, it may actually happen that the oracle EX draws an example belonging to Ψ(C) at some point during the execution, which constitutes a concrete, real evidence of Ψ(C) not being empty with certainty.
4. OntheFly PropertyChecking for RNN
In this section we further develop the general principle of onthefly property checking in the context of RNN. More precisely, the universe X is the set of words Σ* over a set of symbols Σ, the target concept inside the blackbox is a language C⊆Σ* implemented as an RNN, and the hypothesis class H is the set of regular languages or equivalently of deterministic finite automata (DFA).
4.1. Bounded<inlineformula><mml:math id="mm142" display="block"><mml:semantics><mml:msup><mml:mi>L</mml:mi><mml:mo>*</mml:mo></mml:msup></mml:semantics></mml:math></inlineformula>: An Algorithm for Learning DFA from RNN
DFA can be learned with L* [29], an iterative algorithm that incrementally constructs a DFA by calling oracles MQ and EQ. PACbased L* satisfies the following property.
(From [29]). (1) If L* terminates, it outputs an (ϵ,δ)approximation of the target language. (2) L* always terminates if the target language is regular.
L* may not terminate when used to learn DFA approximations of RNN because, in general, the latter are strictly more expressive than the former [21,30,31]. That is, there exists an RNN C for which there is no DFA A with the same language. Therefore, it may happen that at every iteration i of the algorithm, the call to EQ for the ith hypothesis Ai fails, i.e., Si∩(Ai⊕C)≠∅, where Si∼Dm is the sample set drawn by EQ. Hence, L* will never terminate for C.
To cope with this issue, BoundedL* has been proposed in [22]. It bounds the number of iterations of L* by constraining the maximum number of states of the automaton to be learned and the maximum length of the words used to calling EX, which are typically used as parameters to determine the complexity of a PAClearning algorithm [32]. For the sake of simplicity, we only consider here the bound n imposed on the number of states. This version of BoundedL* is shown in Algorithm 1.
Algorithm 1: BoundedL*
BoundedL* works as follows. Similarly to L*, the learner builds a table of observations, denoted OT, by interacting with the teacher. This table is used to keep track of which words are and are not accepted by the target language. OT is built iteratively by asking the teacher membership queries through MQ. OT is a finite matrix Σ*×Σ*→{0,1}. Its rows are split in two. The “upper” rows represent a prefixclosed set words and the “lower” rows correspond to the concatenation of the words in the upper part with every σ∈Σ. Columns represent a suffixclosed set of words. Each cell represents the membership relationship, that is, OT[u][v]=MQ(uv). We denote λ∈Σ* the empty word and OTi the value of the observation table at iteration i.
The algorithm starts by initializing OT0 (line 1) with a single upper row OT0[λ], a lower row OT0[σ] for every σ∈Σ, and a single column for the empty word λ∈Σ*, with values OT0[u][λ]=MQ(u).
At each iteration i>0, the algorithm makes OTiclosed (line 7) and consistent (line 10). OTi is closed if, for every row in the bottom part of the table, there is an equal row in the top part. OTi is consistent if for every pair of rows u,v in the top part, for every σ∈Σ, if OTi[u]=OTi[v] then OTi[uσ]=OTi[vσ].
Once the table is closed and consistent, the algorithm proceeds to build the conjectured DFA Ai (line 13) which accepting states correspond to the entries of OTi such that OTi[u][λ]=1.
Then, BoundedL* calls EQ (line 14) to check whether Ai is PACequivalent to the target language. For doing this, EQ draws a sample Si∼Dμi of size μi defined as follows [29]:μi=1ϵiln2−lnδ
If Si∩(Ai⊕C)=∅, the equivalence test EQ succeeds and BoundedL* terminates producing the output DFA Ai. Clearly, in this case, we conclude that Ai is an (ϵ,δ)approximation of the blackbox.
For any C⊆Σ*, if BoundedL*(n,ϵ,δ) terminates with a DFA A which passes EQ, then A is an (ϵ,δ)approximation of C.
Straightforward from Property 1(1). ☐
If Ai and C are not equivalent according to EQ, a counterexample is produced. If States(Ai)<n, the algorithm uses this counterexample to update the observation table OT (line 16) and continues. Otherwise, BoundedL* returns Ai together with the counterexample.
4.2. Analysis of the Approximation Error of Bounded<inlineformula><mml:math id="mm256" display="block"><mml:semantics><mml:msup><mml:mi>L</mml:mi><mml:mo>*</mml:mo></mml:msup></mml:semantics></mml:math></inlineformula>
Upon termination, BoundedL* may output an automaton A which fails to pass EQ. In such cases, A and the target language eventually disagree in k>0 sequences of the sample S drawn by EQ. Therefore, it is important to analyze in detail the approximation error incurred by BoundedL* in such case. In order to do so, let us start by giving the following definition:ϕi(k)=(μi−k)−1μiϵ+lnμik
for all i∈N,i≥1. Notice that for all k∈[0,μi), ϕi(k)≥ϵ, and ϕi(0)=ϵ.
For any target concept C, if BoundedL*(ℓ,n,ϵ,δ) returns a DFA A with k∈NEQdivergences, such that ϵ˜(k)∈(0,1), then A is an (ϵ˜(k),δ)approximation of C, whereϵ˜(k)=max{ϕi(k)∣1≤i≤n}
Let K(Si)=Si∩(A⊕C) for Si∼Dμi. Using the same arguments as [29], we have that:
PED,C(A)>ϵ˜(k)≤∑i=1nPSi∼DμiK(Si)=k;ED,C(A)>ϵ˜(k)
Now, for every 1≤i≤n:
PSi∼DμiK(Si)=k;ED,C(A)>ϵ˜(k)=μik1−ED,C(A)μi−kED,C(A)k<μik1−ϵ˜(k)μi−k
Using the inequality 1−u<e−u, it follows that:
PSi∼DμiK(Si)=k;ED,C(A)>ϵ˜(k)<μike−ϵ˜(k)(μi−k)
Therefore, by Equations (6) and (7):
ϕi(k)=(μi−k)−1μiϵ+lnμik≤ϵ˜(k)
By definition of μi (Equation (10)), this entails:
−ϵ˜(k)(μi−k)+lnμik≤−μiϵ≤−iln2+lnδ
Then,
μike−ϵ˜(k)(μi−k)≤2−iδ
Thus, from Equations (8)–(10), it follows that:
PED,C(A)>ϵ˜(k)<∑i=1n2−iδ<δ
Hence, A is an (ϵ˜(k),δ)approximation of C. ☐
It is important to notice that this result improves the kind of “forensics” analysis developed in [22], which concentrates on studying the approximation error of the actual DFA returned by BoundedL* on a particular run, rather than on any outcome of the algorithm, as it is stated by Theorem 1.
4.3. Characterization of the Error Incurred by the RNN
Let us recall that the blackbox checking problem consists in verifying whether Ψ(C)=∅. Solving this task with onthefly checking through learning using BoundedL* as the learning algorithm yields a DFA which is a PACapproximation of Ψ(C). Indeed, the output DFA serves to characterize the eventual wrong classifications made by the RNN C in an operational and visual formalism. As a matter of fact, BoundedL* ensures that whenever the returned regular language is nonempty, the language in the blackbox is also nonempty. This result is proven below.
For any C⊆Σ* and i>1, if BoundedL*(n,ϵ,δ) builds an automaton Ai≠∅ at iteration i, then C≠∅.
Suppose Ai≠∅. Then, Ai has at least one accepting state. By construction, ∃u∈Σ* such that OTi[u][λ]=1. For this to be true, it must have occurred a positive membership query for u at some iteration j∈[1,i], that is, MQj(u)=1. Hence, u∈C. This proves that C≠∅. ☐
This result is important because it entails that whenever the output for the target language C∩P¯ is nonempty, C does not satisfy P. Moreover, for every entry of the observation table such that OT[u][v]=1, the sequence uv∈Σ* is a counterexample.
For any C,Ψ(C)⊆Σ*, if BoundedL*(n,ϵ,δ) returns a DFA A≠∅, then Ψ(C)≠∅. Besides, ∀u,v∈Σ* if OT[u][v]=1 then uv∈Ψ(C).
Straightforward from Proposition 4. ☐
Indeed, from Proposition 4, it could be argued that BoundedL* for Ψ(C) could finish as soon as OT has a positive entry, yielding a witness of Ψ(C) being nonempty. However, stopping BoundedL* at this stage would prevent providing a more detailed, explanatory, even if approximate, characterization of the set of misbehaviors.
Theorem 1 and Corally 2 can be combined to show the theoretical guarantees yielded by BoundedL* when used for blackbox property checking through learning.
For any C,Ψ(C), if BoundedL*(ℓ,n,ϵ,δ) returns a DFA A with k∈NEQdivergences and ϵ˜(k)∈(0,1), then:
A is an (ϵ˜(k),δ)approximation of Ψ(C).
If A≠∅ or k>0, then Ψ(C)≠∅.
Straightforward from Theorem 1.
By Corollary 2, it follows that A≠∅ implies Ψ(C)≠∅. Let A=∅ and k>0. By the fact that k>0, we have that A⊕Ψ(C)≠∅. Since A=∅, it results that ∅⊕Ψ(C)=Ψ(C). Hence, Ψ(C)≠∅. ☐
5. Case Studies
In this section we apply the approach presented in the previous sections to a number of case studies. The teacher is given Ψ(C). For instance, in order to verify language inclusion, that is, to check whether the language of the RNN C is included in some given language P (the property), Ψ(C) is C∩P¯. The complement of P is actually never computed, since the algorithm only requires evaluating membership. That is, to answer MQ(u) on C∩P¯ for a word u∈σ*, the teacher evaluates P(u), complements its output, and evaluates the conjunction with the output of C(u). It is straightforward to generalize this idea to any Boolean combination of C with other concepts P1,…,Pr. Every concept Pj may be any kind of property, even a nonregular language, such as a contextfree grammar, or an RNN.
We carried out controlled experiments where RNN were trained with sample datasets from diverse sources such as: known automata, context free grammars, and domain specific data as a way of validating the approach. However, it is important to remark that contextfree grammars or DFAs are artifacts only used with the purpose of controlling the experiments. In real application scenarios, they are not assumed to exist at all. Unless otherwise stated, RNN consisted of a twolayer network starting with a singlecell threedimensional LSTM layer [33] followed by a twodimensional dense classification layer with a softmax activation function. The loss function was categorical crossentropy. They were trained with Adam optimizer, with a default learning rate of 0.5, using twophase early stopping, with an 80%20% random split for trainvalidation of the corresponding datasets. The performance of trained RNN was measured on test datasets. Symbols of the alphabet were represented using onehot encoding. We stress the fact that knowledge of the internal structure, training process, or training data (except for the alphabet) is by no means required by our approach. This information is provided in the paper only to describe the performed controlled experiments.
We applied our approach in three kinds of scenarios.
First, we studied RNN trained with sequences generated by contextfree grammars (CFG) and checked regular and nonregular properties. In addition, we compared two different RNN trained with sequences from the same language specification, in order to check whether they are actually equivalent. Here, Ψ is a Boolean combination of the RNN under analysis.
Second, we checked regular properties over RNN trained with sequences of models of two different software systems, namely a cruise controller and an ecommerce application. The former deals with the situation where postlearning modelchecking finds the DFA extracted from the RNN to not satisfy the property, but it is not possible to replay the produced counterexample on the RNN. In the latter, we injected canary bad sequences in the training set in order to pinpoint they end up being discovered by onthefly blackbox checking.
Third, we studied domainspecific datasets, from system security and bioinformatics, where the actual datagenerator systems were unknown, and no models of them were available. In one of these case studies the purpose is to analyze the behavior of an RNN trained to identify security anomalies in Hadoop file system (HDFS) logsfrom. The experiment revealed the fact that the RNN could mistakenly classify a log as normal when it is actually abnormal, even if the RNN incurred in no false positives on the test dataset during the training phase. The DFA returned by BoundedL* served to gain insight on the error. In the last case study, we studied an RNN that classifies promoter DNA sequences as having or not a TATAbox subsequence. Here, postlearning verification was unfeasible because BoundedL* did not terminate in reasonable time when asked to extract a DFA from the RNN. Nevertheless, it successfully checked the desired requirement using onthefly blackbox checking through learning.
5.1. ContextFree Language Modeling
Parenthesis prediction is a typical problem used to study the capacity of RNN for contextfree language modeling [34].
First, we randomly generated 550,000 sequences upto length 20 labeled as positive or negative according to whether they belong or not to the following 3symbol Dyck1 CFG with alphabet {(,),c}:S→STTSTT→(T)()T→c
The RNN was trained using a subset of 500,000 samples until achieving 100% accuracy on the remaining validation set of 50,000 sequences. The following properties were checked:
The set of sequences recognized by the RNN C is included in the Dyck 1 grammar above. That is, Ψ1(C)=C∩S¯. Recall that S¯ is not computed, since only membership queries are posed.
The set of sequences recognized by the RNN C are included in the regular property P=(c)*. In this case, Ψ2(C)=C∩P¯.
The set of sequences recognized by the RNN C are included in the contextfree language Q=(m)n with m<n. Here, Ψ3(C)=C∩Q¯. Again, Q¯ is not computed.
Experimental results are shown in Table 1 and Table 2. For each (ϵ,δ), five runs were executed. All runs finished with 0divergence EQ. Execution times are in seconds. The mean sample size refers to the average EQ test size at the last iteration of each run. Figures show that on average, the running times exhibited by of onthefly property checking were typically smaller than those achieved just to extract an automaton from the RNN. It is important to remark that cases (1) and (3) fall in an undecidable playground since checking whether a regular language is contained in a contextfree language is undecidable [35]. For case (1), our technique could not find a counterexample, thus giving probabilistic guarantees of emptiness, that is, of the RNN to correctly modeling the 3symbol parenthesis language. For cases (2) and (3), PAC DFA of the intersection language are found in all runs, showing the properties are indeed not satisfied. Besides, counterexamples are generated orders of magnitude faster (in average) than extracting a DFA from the RNN alone.
Second, we randomly generated 550,000 sequences upto length 20 labelled as positive or negative according to whether they belong or not to the following 5symbol Dyck2 CFG with alphabet {(,),[,],c}:S→STTSTT→(T)()T→[T][]T→c
The RNN was trained on 500,000 samples until achieving 99.646% accuracy on the remaining validation set of 50,000 sequences. This RNN was checked against its specification. For each (ϵ,δ), five runs were executed, with a timeout of 300 s. Experimental results are shown in Table 3 and Table 4. For each configuration, at least three runs of onthefly checking finished before the timeout and one was able to find, as expected, the property was not verified by the RNN, exhibiting a counterexample showing it did not model the CFG and yielding a PAC DFA of the wrong classifications.
5.2. Checking Equivalence between RNNs
Following Theorem 2, we present a case where it is of interest to check two RNNs against each other. An RNN N1 is trained with data from a given language L, and a second RNN N2 is trained with sequences from a language L′ contained L. If both networks, when checked against L are found compliant with it, the following question arises: Are the networks equivalent? And, if the answer is negative, can the divergences be modeled? In order to answer those questions, the property to be checked is expressed as a Boolean composition Ψ(N1,N2)=N1≡N2.
To illustrate this use case, an RNN N1 was trained with data from Tomita’s 5th grammar [36] (Figure 1) until it reached a 100% accuracy both in all data. Similarly, a second network N2, with the same characteristics, was trained until complete overfitting with sequences from a sublanguage (Figure 2).
The architecture of the networks is depicted in Figure 3 (Network sketches have been generated using Keras utilities https://keras.io/api/utils/model_plotting_utils/, accessed on 5 February 2021). For each layer, its type, name (for clarity), and input/output shapes are shown. In all cases, the first component of the shape vector is the batch size and the last component is the number of features. For threedimensional shapes, the middle element is the length of the sequence. “?” means that the parameter is not statically fixed but dynamically instantiated at the training phase. The initial layer is a twodimensional dense embedding of the input. This layer is followed by a sequencetosequence subnetwork composed of a 64dimensional LSTM chained to a 30dimensional dense layer with a ReLU activation function. The network ends with a classification subnetwork composed of a 62dimensional LSTM connected to a twodimensional dense layer with a softmax activation function. This architecture has a total of 42,296 coefficients.
Each network has been trained in a single phase with specific parameters summarized in Table 5. This is the reason why batch size and sequence length have not been fixed in Figure 3 and therefore appear as “?”. The training process of both networks used sets of randomly generated sequences labeled as belonging or not to the corresponding target language. These sets have been split in two parts: 80% for the development set and 20% for the test set. The development set has been further partitioned into 67% for train and 33% for validation.
When checking both networks for inclusion in Tomita’s 5th grammar both of them were found to verify the inclusion, passing PAC tests with ϵ=0.001 and δ=0.0001. However, when the verification goal was to check N1≡N2, the output was different. In such scenario, onthefly verification returned a nonempty DFA, showing that the networks are indeed not equivalent. Figure 4 depicts the DFA approximating the language of their disagreement, that is, the symmetric difference N1⊕N2. After further inspection, we found out that N2 does not recognize the empty word λ.
5.3. An RNN Model of a Cruise Control Software
Here, we analyze an RNN trained with sequences from the model of a cruise controller software [37] depicted in Figure 5. In the figure, only the actions and states modeling the normal operation of the controller are shown. All illegal actions are assumed to go to a nonaccepting sink state. The training dataset contained 200,000 randomly generated sequences and labeled as normal and abnormal according to whether they correspond or not to executions of the controller (i.e., they are recognized or not by the DFA in Figure 5). All executions have a length of at most 16 actions. The accuracy of the RNN on a test dataset with 16,000 randomly generated sequences was 99.91%.
The requirement P to be checked on the RNN is the following: a break action can occur only if action gasacc has already happened and no other break action has occurred in between. P is modeled by the DFA illustrated in Figure 6.
In this experiment, we compare both approaches, namely our onthefly technique vs. postlearning verification.
Every run of onthefly verification through learning terminates with perfect EQ tests conjecturing that C∩P¯ is empty. Table 6 shows the metrics obtained in these experiments (running times, EQ sample sizes, and ϵ˜) for different values of the parameters ϵ and δ.
Table 7 shows the metrics for extracting DFA from the RNN. The timeout was set at 200 s. For the first configuration, four out of five runs terminated before the timeout producing automata that exceeded the maximum number of states. Moreover, three of those were shown to violate the requirement. For the second one, there were three out of five successful extractions with all automata exceeding the maximum number of states, while for two the property did not hold. For the third configuration, all runs hit the timeout. Actually, the RNN under analysis classified all the counterexamples returned by the modelchecker as negative, that is, they do not belong to its language. In other words, there were false positives. In order to look for true violating sequences, we generated 2 million sequences with EX for each of the automata H for which the property did not hold. Indeed, none of those sequences was accepted simultaneously by both the RNN under analysis and H∩P¯. Therefore, it is not possible to disprove that the RNN is correct with respect to P as conjectured bye onthefly blackbox checking. It goes without saying that postlearning verification required considerable more computational effort as a consequence of its misleading verdicts.
The cruise controller case study illustrates an important benefit of our approach vs. postlearning verification: every counterexample produced by onthefly property checking is a true witness of Ψ(C) being nonempty, while this is certainly false for the latter.
5.4. An RNN Model of an ECommerce Web Site
In this case study, the target concept is an RNN trained with the purpose of modeling the behavior of a web application for ecommerce. We used a training dataset of 100,000 randomly generated sequences of length smaller than or equal to 16, using a variant of the model in [22,38] to tag the sequences as positive or negative. Purposely, we have modified the model so as to add canary sequences not satisfying the properties to be checked. The RNN achieved 100% accuracy on a test dataset of 16,000 randomly generated sequences. We overfitted to ensure faulty sequences were classified as positive by the RNN. The goal of this experiment was to verify whether onthefly blackbox checking could successfully unveil whether the RNN learned these misbehaviors.
We analyzed the regular properties shown in Figure 7, where labels aPSC, eSC, and bPSC model the actions (associated with their corresponding buttons) of adding products to the shopping cart, removing all products from the shopping cart, and buying products in the shopping cart, respectively. Requirement P1, depicted in Figure 7a, states that the ecommerce site must not allow a user to buy products in the shopping cart if the shopping cart does not contain any product. Property P2, depicted in Figure 7b, requires the system to prevent the user to perform consecutive clicks on the buy products button.
Table 8 shows the metrics obtained for extracting automata. All runs terminated with an EQ with no divergences. Therefore, the extracted automata were (ϵ,δ)approximations of the RNN. Although we did not perform postlearning verification, these metrics are helpful to compare the computational performance of both approaches.
For each property Pj, j∈{1,2}, the concept inside the blackbox is Ψj(C) is C∩P¯j. As shown in Table 9, the onthefly method correctly asserted that none of the properties were satisfied. It is worth noticing that all experiments terminated with perfect EQ, i.e., k=0. Therefore, the extracted DFA were (ϵ,δ)approximations of Ψj(C). The average running time to output an automaton of the language of faulty behaviors is bigger than the running time for extracting an automaton of the RNN alone. Nevertheless, the first witness of Ψj(C) (i.e., the first witness of nonemptiness) was always found by onthefly checking in comparable time.
Figure 8 shows an automaton of Ψ1(C) built by the onthefly algorithm. For instance, it reveals that the RNN classifies as correct a sequence where the user opens a session (label event os), consults the list of available products (label gAP), and then buys products (bPSC), but the shopping cart is empty: q1;os;q4;gAP;q3;bPSC. Indeed, it provides valuable information about possible causes of the error which are helpful to understand it and correcting it, since it makes apparent that every time gAP occurred in an open session, the property was violated.
Figure 9 depicts an automaton for Ψ2(C). A sequence showing that P2 is not satified is: q1;os;q5;gAP;q4;bPSC;q3;bPSC. Notice that this automaton shows that P1 is violated as well, since state q3 is reachable without any occurrence of aPSC.
5.5. An RNN for Classifying Hadoop File System Logs
This experiment concerns the analysis of an RNN trained to find anomalies in logs of an application software based on Hadoop Distributed File System (HDFS). Data used in this case study come from [39]. Logs are sequences of natural numbers ranging from 0 to 28 which correspond to different kinds of logged messages. That is, the set of symbols is Σ={0,…,29}. The training dataset consists of 4856 normal logs of different lengths. We built an autoregressive network that predicts the probability distribution of symbols at each position in the sequence. Symbols are onehot encoded. The LSTM layer outputs a 128dimensional vector which is passed to a 29dimensional dense layer that outputs the probability distribution of the next symbol. That is, for every position t∈[0,T−1], where T is the length of the sequence, the network outputs a vector vt∈[0,1]29, whose ith position holds the predicted probability vt(i)=P[σt=i∣σ0…σt−1] of number i to be the tth symbol in the sequence [40]. Figure 10 shows a sketch of the architecture. This network has 84,637 parameters. The activation function of the last layer is a softmax and the loss function is the corresponding categorical crossentropy. For the sake of readability, we fixed the sequence length in Figure 10. However, in the actual architecture this parameter is not statically defined.
For each log in the training set we obtained all complete subsequences of length T=10 by sliding a window of size 10 from start to end. Overall, there were a total of 56,283 of such subsequences which were split in 80% (36,020 samples) for training and 20% (9006 samples) for validation. A single training phase of five epochs was performed using a learning rate of 10−3 and a batch size of 30.
In order to build a classifier, the RNN is used to predict the probability of a log. Then, a log is considered to be normal if its predicted probability is beyond a threshold of 2×10−7. Otherwise, it is tagged as anomalous. The performance of the classifier was tested on a perfectly balanced subset of 33,600 samples taken from the test dataset of [39]. No false positives were produced by the classifier which incurred in an overall error of 2.65%.
During an exploratory analysis of the training dataset, we made the following observations. First, there were a subset of numbers, concretely {6, 7, 9, 11–14, 18, 19, 23, 26–28}, that were not present in the normal logs used for training. Let us call this set A for anomalous message types. Second, many logs have a subsequence containing numbers 4 and 21, such that their accumulated count was at most 5, that is, #4+#21≤5. We analyzed the classifier with the purpose of investigating whether the RNN actually learned these patterns as characteristic of normal logs.
Based on those observations, we defined the following properties. The first statement, P1, claims that the classifier always labels as anomalous any log containing a number in A. The second one, P2, says that every log satisfying #4+#21≤5 is classified as normal. As in the case study of the ecommerce, for each property Pj, j∈{1,2}, the concept inside the blackbox is Ψj(C) is C∩P¯j, where C is the classifier. It is worth mentioning that C is indeed the composition of an RNN with the decision function that labels logs according to the probability output by the RNN.
Table 10 shows the results obtained with onthefly checking through learning. As in previous experiments, five runs of the algorithm were executed for each configuration. All runs terminated with perfect EQ tests. Hence, all output hypotheses were (ϵ,δ)approximations of Ψj(C).
On one hand, property P2 is satisfied by C with PAC guarantees. On the contrary, all runs of the algorithm for Ψ2(C) returned a nonempty automaton and a set of the logs that violate P2. Therefore, we conclude that C actually classifies as normal some logs containing numbers in A. Figure 11 depicts the automaton obtained for Ψ1(C). It helps to understand the errors of C. For example, it reveals that C labels as normal a log that contains an occurrence of a number in A in its prefix of length 2. This behavior is captured by paths q0q1q2, q0q1q6, and q0q4q2. Indeed, this outcome highlights the importance of verification, since it revealed a clear mismatch with the results observed on the test dataset where C all logs containing numbers in A were labelled as anomalous because C reported no false positives whatsoever.
5.6. An RNN for Recognizing TATABoxes in DNA Promoter Sequences
DNA promoter sequences are in charge of controlling gene activation or repression. A TATAbox is a promoter subsequence with the special role of indicating other molecules the starting place of the transcription. A TATAbox is a subsequence having a length of six base pairs (bp). It is located upstream close to the gene transcription start site (TSS) from positions −30 bp to −25 bp (TSS is located at +1 bp). It is characterized by the fact that the accumulated number of occurrences of A’s and T’s is larger than that of C’s and G’s.
Recently, RNNbased techniques for recognizing TATAbox promoter regions in DNA sequences have been proposed [16]. Therefore, it is of interest to check whether an RNN classifies as positive sequences having a TATAbox and as negative those not having it. In terms of a formal language, the property can be characterized as the set of sequences u∈{A,T,C,G}* with a subsequence v of length 6 from −30 bp to −25 bp such that #A+#T>#C+#G, where #σ is the number of occurrences of σ∈{A,T,C,G} in v.
For that purpose, we trained an RNN until achieving 100% accuracy on the training data consisting of 16,455 aligned TATA and nonTATA promoter sequences of human DNA extracted from the online database EPDnew (https://epd.epfl.ch/index.php, accessed 5 February 2021). All sequences have a total length of 50 and end at the TSS. Overall, there were 2067 sequences with TATA boxes and 14,388 sequences without. The LSTM layer had a 128dimensional output. In this case, training was performed on a single phase with a learning rate of 10−3 and a batch size of 64. No validation nor test sets were used. Figure 12 shows a graphical sketch of the model. The input dimension is given by the batch size, the length of the sequence, and the number of symbols.
Table 11 shows the results obtained only with the onthefly approach. Indeed, every attempt to learn a DFA of the RNN C caused BoundedL* to terminate with a timeout. Therefore, this case study illustrates the case where postlearning verification is not feasible while onthefly checking is. It turns out that all executions concluded that the empty language was an (ϵ,δ)approximation of the blackbox Ψ(C). Thus, C verifies the requirement with PAC guarantees. It is worth noticing that in the last reported experiment, with ϵ and δ equal to 0.0001, the sample used for checking equivalence was about an order of magnitude bigger than the dataset used for training.
6. Related Work
Regular inference on RNN can be considered to be a kind of rule extraction technique [41], where the rules that are extracted are represented by a DFA. Several different approaches for extracting automata out of RNN have been proposed. The method developed in [19,20] resorts to quantizing the hidden values of the network states and to using clustering for grouping them into automata states. The algorithm discussed in [18] combines L* and partition refinement. The equivalence query compares the proposed hypothesis with an abstract representation of the network obtained by refining a partition of its internal states. Those techniques are white box as they rely on some level of knowledge of the internal structure of the network. They can be applied for postlearning verification but they are not directly usable for onthefly blackbox property checking. None of them provide provable PACguarantees on the generated automata.
There are a number of works that perform whitebox, compositional, automatatheoretic verification of temporal properties by learning assumptions but require an external decision procedure [42,43,44]. Verification of regular properties of systems modeled as nonregular languages (expressed as automata equipped with FIFO queues) by means of learning DFA is proposed in [45]. However, the algorithm is whitebox, it relies on a statebased representation of the FIFO automaton, and it requires being able to compute successor states of words by transitions of the target automata, which is by no means feasible for RNN. Our approach also differs from [46], since this work proposes an iterative technique for regular modelchecking based on TrakhtenbrotBarzdin passive learning algorithm [47] which requires generating complete datasets of positive and negative sequences.
Regarding BBCbased approaches, onthefly property checking through learning differs from onthefly BBC [26] which consists on a strategy for seeking paths in the automaton of the requirement. In this context, it is worth mentioning test case generation with learning based testing (LBT) [48]. LBT works by incrementally constructing hypotheses of the system under test (SUT) and modelchecking them against a requirement. The counterexamples returned by the external modelchecker become the test cases. LBT does not rely on PAClearning and does not provide provable probabilistic guarantees on the hypothesis. Somehow, this issue has been partially studied in [49] but at the price of relaxing the blackbox setting by observing and storing the SUT internal state.
Whitebox verification and testing of safety properties on feedforward (FFNN) and convolutional (CNN) neural networks based on Linear Programming (LP) and Satisfiability Modulo Theories (SMT) has been explored in several works, for instance [50,51,52,53]. Reluplex [51] is a problemspecific SMT solver which handles ReLU constraints. The method in [52] exhaustively searches for adversarial misclassifications, propagating the analysis from one layer to the other directly through the source code. Several works have approached the problem of checking robustness, which is a specific property that evaluates ANN resilience to adversarial examples. DeepSafe [54] is a whitebox tool for checking robustness based on clustering and constraint solvers. A blackbox approach for robustness testing is developed in [55]. Those approaches have been applied for image classification with deep convolutional and dense layers but not for RNN over symbolic sequences.
In the case of RNN, a whitebox, postlearning approach for adversarial accuracy verification is presented in [56]. The technique relies on extracting DFA from RNN but does not provide PAC guarantees. Besides, no reallife applications have been analyzed but only RNN trained with sequences of 0 s and 1 s from academic DFA [36]. In [57] whitebox RNN verification is done by generating a series of abstractions. Specifically, the method strongly relies on the internal structure and weights of the RNN to generate a FFNN, which is proven to compute the same output. Then, reachability analysis is performed resorting to LP and SMT. RNSVerify [58] implements whitebox verification of safety properties by unrolling the RNN and resorting to LP to solve a system of constraints. The method strongly relies on the internal structure and weight matrices of the RNN. Overall, these techniques are whitebox and are not able to handle arbitrary properties over sequences. Moreover, they do not address the problem of producing interpretable characterizations of the errors incurred by the RNN under analysis.
A related but different approach is statistical model checking (SMC) [59,60]. SMC seeks to check whether a stochastic system satisfies a (possibly stochastic) property with a probability beyond some threshold. However, in our context, both the RNN is deterministic and the property are deterministic. That is, any sequence u∈Σ* either satisfies Ψ(C) or not. Moreover, our technique works by PAClearning an arbitrary language expressed as a formula Ψ(C), where C is an RNN.
7. Conclusions
This paper explores the problem of checking properties of RNN devoted to sequence classification over symbolic alphabets in a blackbox setting. The approach is not restricted to any particular class of RNN or property. Besides it is onthefly because it does not construct a model of the RNN on which the property is verified. The key idea is to express the verification problem on an RNN C as a formula Ψ(C) such that its language is empty if and only if C does not satisfy the requirement and apply a PAClearning algorithm for learning Ψ(C). On one hand, if the resulting DFA is empty, the algorithm provides PACguarantees about the language Ψ(C) being itself empty. On the other, if the output DFA is not empty, it provides an actual sequence of C that belongs to Ψ(C). Besides, the DFA itself serves as an approximate characterization of the set of all sequences in Ψ(C). For instance, our method can be used to verify whether an RNN C satisfies a lineartime temporal property P by checking C∩P¯. Since the approach does not require computing the complement, it can also be applied to verify nonregular properties expressed, for instance, as contextfree grammars, and to check equivalence between RNN, as illustrated in Section 5.
Onthefly checking through learning has several advantages with respect to performs postlearning verification. When the learnt language that approximates Ψ(C) is nonempty, the algorithm provides true evidence of the failure by means of concrete counterexamples. In addition, the algorithm outputs an interpretable characterization of an approximation of the set of incorrect behaviors. Besides, it allows checking properties, with PAC guarantees, for which no decision procedure exists. Moreover, the experimental results on a number of case studies from different application domains provide empirical evidence that the onthefly approach typically outperforms postlearning verification if the requirement is probably approximately satisfied.
Last but not least, Theorem 1 provides an upper bound of the error incurred by any DFA returned by BoundedL*. Hence, this paper also improves the previously known theoretical results regarding the probabilistic guarantees of this learning algorithm [22].
Author Contributions
F.M. and S.Y. equally contributed to the theoretical, experimental results and writing. R.V. contributed to prototyping and experimentation. All authors have read and agreed to the published version of the manuscript.
Funding
This research was partially funded by ICT4V—Information and Communication Technologies for Verticals grant number POS_ICT4V_2016_1_15, and ANII—Agencia Nacional de Investigación e Innovación grant numbers FSDA_1_2018_1_154419 and FMV_1_2019_1_155913.
Data Availability Statement
Data sources used in this work were referenced throughout the article.
Conflicts of Interest
The authors declare no conflict of interest.
ReferencesMillerT.Explanation in artificial intelligence: Insights from the social sciencesBiranO.CottonC.V.Explanation and Justification in Machine Learning: A SurveyAhmadM.TeredesaiA.EckertC.Interpretable Machine Learning in HealthcareLeCunY.BengioY.HintonG.Deep learningRibeiroM.T.SinghS.GuestrinC.“Why Should I Trust You?”: Explaining the Predictions of Any ClassifierScheinerN.AppenrodtN.DickmannJ.SickB.Radarbased Road User Classification and Novelty Detection with Recurrent Neural Network EnsemblesKocićJ.JovičićN.DrndarevićV.An EndtoEnd Deep Neural Network for Autonomous Driving Designed for Embedded Automotive PlatformsKimJ.KimJ.ThuH.L.T.KimH.Long short term memory recurrent neural network classifier for intrusion detectionYinC.ZhuY.FeiJ.HeX.A deep learning approach for intrusion detection using recurrent neural networksPascanuR.StokesJ.W.SanossianH.MarinescuM.ThomasA.Malware classification with recurrent networksRhodeM.BurnapP.JonesK.Early Stage Malware Prediction Using Recurrent Neural NetworksVinayakumarR.AlazabM.SomanK.PoornachandranP.VenkatramanS.Robust Intelligent Malware Detection Using Deep LearningSinghD.MerdivanE.PsychoulaI.KropfJ.HankeS.GeistM.HolzingerA.Human activity recognition using recurrent neural networksChoiE.BahadoriM.T.SchuetzA.StewartW.F.SunJ.Doctor AI: Predicting Clinical Events via Recurrent Neural NetworksPhamT.TranT.PhungD.VenkateshS.Predicting healthcare trajectories from medical records: A deep learning approachOubounytM.LouadiZ.TayaraH.ChongK.T.DeePromoter: Robust promoter predictor using deep learningClarkeE.M.GrumbergO.PeledD.A.WeissG.GoldbergY.YahavE.Extracting Automata from Recurrent Neural Networks Using Queries and CounterexamplesWangQ.ZhangK.OrorbiaA.G.IIXingX.LiuX.GilesC.L.An Empirical Evaluation of Rule Extraction from Recurrent Neural NetworksWangQ.ZhangK.OrorbiaA.G.IIXingX.LiuX.GilesC.L.A Comparison of Rule Extraction for Different Recurrent Neural Network Models and Grammatical ComplexityMerrillW.Sequential neural networks as automataMayrF.YovineS.Regular Inference on Artificial Neural NetworksValiantL.G.A Theory of the LearnableOdenaA.OlssonC.AndersenD.GoodfellowI.J.TensorFuzz: Debugging Neural Networks with CoverageGuided FuzzingMayrF.ViscaR.YovineS.Onthefly BlackBox Probably Approximately Correct Checking of Recurrent Neural NetworksPeledD.VardiM.Y.YannakakisM.Black box checkingAngluinD.Computational Learning Theory: Survey and Selected BibliographyBenDavidS.ShalevShwartzS.AngluinD.Learning Regular Sets from Queries and CounterexamplesSiegelmannH.T.SontagE.D.On the Computational Power of Neural NetsSuzgunM.BelinkovY.ShieberS.M.On Evaluating the Generalization of LSTM Models in Formal LanguagesHeinzJ.de la HigueraC.van ZaanenM.Formal and Empirical Grammatical InferenceHochreiterS.SchmidhuberJ.Long ShortTerm MemoryHaoY.MerrillW.AngluinD.FrankR.AmselN.BenzA.MendelsohnS.Contextfree transductions with neural stacksHopcroftJ.E.MotwaniR.UllmanJ.D.Introduction to automata theory, languages, and computationTomitaM.Dynamic Construction of Finite Automata from examples using HillclimbingMeinkeK.SindhuM.A.LBTest: A LearningBased Testing Tool for Reactive SystemsMertenM.Active Automata Learning for Real Life ApplicationsDuM.LiF.ZhengG.SrikumarV.DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep LearningBengioY.DucharmeR.VincentP.JanvinC.A Neural Probabilistic Language ModelCravenM.W.Extracting Comprehensible Models from Trained Neural NetworksCobleighJ.M.GiannakopoulouD.PăsăreanuC.S.Learning assumptions for compositional verificationAlurR.MadhusudanP.NamW.Symbolic compositional verification by learning assumptionsFengL.HanT.KwiatkowskaM.ParkerD.Learningbased compositional verification for synchronous probabilistic systemsVardhanA.SenK.ViswanathanM.AghaG.Actively learning to verify safety for FIFO automataHabermehlP.VojnarT.Regular Model Checking Using Inference of Regular LanguagesTrakhtenbrotB.A.BarzdinI.M.MeinkeK.Learningbased testing: Recent progress and future prospectsMeijerJ.van de PolJ.Sound blackbox checking in the LearnLibPulinaL.TacchellaA.Challenging SMT solvers to verify neural networksKatzG.BarrettC.W.DillD.L.JulianK.KochenderferM.J.Reluplex: An Efficient SMT Solver for Verifying Deep Neural NetworksHuangX.KwiatkowskaM.WangS.WuM.Safety Verification of Deep Neural NetworksEhlersR.Formal Verification of PieceWise Linear FeedForward Neural NetworksGopinathD.KatzG.PasareanuC.S.BarrettC.W.DeepSafe: A DataDriven Approach for Assessing Robustness of Neural NetworksWickerM.HuangX.KwiatkowskaM.Featureguided blackbox safety testing of deep neural networksWangQ.ZhangK.LiuX.GilesC.L.Verification of Recurrent Neural Networks Through Rule ExtractionKevorchianA.Verification of Recurrent Neural NetworksAkintundeM.E.KevorchianA.LomuscioA.PirovanoE.Verification of RNNBased Neural AgentEnvironment SystemsLegayA.LukinaA.TraonouezL.M.YangJ.SmolkaS.A.GrosuR.Statistical Model CheckingAghaG.PalmskogK.A Survey of Statistical Model CheckingFigures and Tables
DFA recognizing Tomita’s 5th grammar.
DFA recognizing a sublanguage of Tomita’s 5th grammar.
Sketch of the architecture used for Tomita’s 5th grammar and its variant.
DFA approximating N1⊕N2.
Cruise controller: DFA model.
Cruise controller: Property P.
Ecommerce system: Automata of the analyzed requirements.
Ecommerce system: Automaton for Ψ1(C).
Ecommerce system: Automaton for Ψ2(C).
Sketch of the architecture of the language model of Hadoop Distributed File System (HDFS) logs.
Hadoop file system logs: Automaton for Ψ1(C) obtained with ϵ=0.01 and δ=0.01.
Sketch of the architecture of the TATABox classification network
make0300010t001_Table 1
Dyck 1: Probably approximately correct (PAC) deterministic finite automata (DFA) extraction from recurrent neural networks (RNN).
Parameters
Running Time (in s)
Mean Sample Size
ϵ
δ
min
max
mean
0.005
0.005
1.984
7.205
3.072
1899
0.0005
0.005
3.713
10.445
5.997
20,093
0.00005
0.005
7.982
30.470
9.997
203,007
0.00005
0.0005
8.128
36.621
9.919
249,059
0.00005
0.00005
9.625
41.884
12.185
295,111
make0300010t002_Table 2
Dyck 1: Onthefly verification of RNN.
Ψ
Parameters
Running Time (in s)
First
Positive MQ
Mean
Sample Size
ϵ
δ
min
max
mean
Ψ1
0.005
0.005
0.004
0.012
0.006

1476
0.0005
0.005
0.051
0.125
0.067

14,756
0.00005
0.005
0.682
0.833
0.747

147,556
0.00005
0.0005
1.164
1.595
1.340

193,607
0.00005
0.00005
1.272
1.809
1.386

239,659
Ψ2
0.005
0.005
0.031
34.525
5.762
0.099
1948
0.0005
0.005
0.397
37.846
10.245
0.084
20,370
0.00005
0.005
4.713
30.714
6.547
0.825
206,473
Ψ3
0.005
0.005
0.025
0.966
0.302
0.006
1899
0.0005
0.005
0.267
1.985
0.787
0.070
20,093
0.00005
0.005
4.376
6.479
4.775
0.764
203,007
make0300010t003_Table 3
Dyck 2: PAC DFA extraction from RNN.
Parameters
Running Time (in s)
Mean Sample Size
Mean ϵ˜
ϵ
δ
Min
Max
Mean
0.005
0.005
2.753
149.214
19.958
1795
0.00559
0.0005
0.005
23.343
300.000
105.367
18,222
0.04432
0.00005
0.005
42.518
139.763
77.652
186,372
0.16248
make0300010t004_Table 4
Dyck 2: Onthefly verification of RNN.
Parameters
Running Time (in s)
First
Positive MQ
Mean
Sample Size
Mean ϵ˜
ϵ
δ
Min
Max
Mean
0.005
0.005
0.004
122.388
24.483
90.285
1504
0.00618
0.0005
0.005
55.084
300.000
215.508
42.462
16,604
0.00895
0.00005
0.005
0.695
324.144
158.195
4.545
166,040
0.00005
make0300010t005_Table 5
Training parameters used for Tomita’s 5th grammar and its variant.
Network
Dataset Size
Batch Size
Sequence Length
Learning Rate
N1
5K
30
15
0.01
N2
1M
100
10
0.001
make0300010t006_Table 6
Cruise controller: Onthefly blackbox checking.
Parameters
Running Times (in s)
First
Positive MQ
Mean
Sample Size
ϵ
δ
Min
Max
Mean
0.01
0.01
0.003
0.006
0.004

669
0.001
0.01
0.061
0.096
0.075

6685
0.0001
0.01
0.341
0.626
0.497

66,847
make0300010t007_Table 7
Cruise controller: Automaton extraction.
Parameters
Running Times (in s)
Mean Sample Size
Mean ϵ˜
ϵ
δ
Min
Max
Mean
0.01
0.01
11.633
200.000
67.662
808
0.07329
0.001
0.01
52.362
200.000
135.446
8071
0.22684
0.0001
0.01





make0300010t008_Table 8
Ecommerce: PAC DFA extraction from RNN.
Parameters
Running Times (in s)
Mean Sample Size
ϵ
δ
Min
Max
Mean
0.01
0.01
16.863
62.125
36.071
863
0.001
0.01
6.764
9.307
7.864
8487
0.0001
0.01
18.586
41.137
30.556
83,482
make0300010t009_Table 9
Ecommerce: Onthefly verification of RNN.
Ψ
Parameters
Running Times (in s)
First
Positive MQ
Mean
Sample Size
ϵ
δ
Min
Max
Mean
Ψ1
0.01
0.01
87.196
312.080
174.612
3.878
891
0.001
0.01
0.774
203.103
102.742
0.744
9181
0.0001
0.01
105.705
273.278
190.948
2.627
94,573
Ψ2
0.01
0.01
0.002
487.709
148.027
80.738
752
0.001
0.01
62.457
600.000
428.400
36.606
8765
0.0001
0.01
71.542
451.934
250.195
41.798
87,641
make0300010t010_Table 10
Hadoop file system logs: Onthefly verification.
Prop
Parameters
Running Times (in s)
First
Positive MQ
Mean
Sample Size
ϵ
δ
Min
Max
Mean
Ψ1
0.01
0.01
209.409
1,121.360
555.454
5.623
932
0.001
0.001
221.397
812.764
455.660
1.321
12,037
Ψ2
0.01
0.01
35.131
39.762
37.226

600
0.001
0.001
252.202
257.312
254.479

8295
make0300010t011_Table 11
TATAbox: Onthefly verification of RNN.
Parameters
Running Times (in s)
Mean Sample Size
ϵ
δ
Min
Max
Mean
0.01
0.01
5.098
5.259
5.168
600
0.001
0.001
65.366
66.479
65.812
8295
0.0001
0.0001
865.014
870.663
867.830
105,967
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.