mathematicsMathematicsMathematicsMathematics2227-7390MDPI10.3390/math9091010mathematics-09-01010ArticleEffect of Probability Distribution of the Response Variable in Optimal Experimental Design with Applications in Medicine †https://orcid.org/0000-0002-8662-3385Pozuelo-CamposSergio*‡https://orcid.org/0000-0001-8165-5858Casero-AlonsoVíctor‡https://orcid.org/0000-0002-0956-2531Amo-SalasMariano‡KlebanovLevAcademic EditorDepartment of Mathematics, University of Castilla-La Mancha, 13071 Ciudad Real, Spain; victormanuel.casero@uclm.es (V.C.-A.); Mariano.Amo@uclm.es (M.A.-S.)Correspondence: Sergio.Pozuelo@uclm.es
This paper is an extended version of a published conference paper as a part of the proceedings of the 35th International Workshop on Statistical Modeling (IWSM), Bilbao, Spain, 19–24 July 2020.
In optimal experimental design theory it is usually assumed that the response variable follows a normal distribution with constant variance. However, some works assume other probability distributions based on additional information or practitioner’s prior experience. The main goal of this paper is to study the effect, in terms of efficiency, when misspecification in the probability distribution of the response variable occurs. The elemental information matrix, which includes information on the probability distribution of the response variable, provides a generalized Fisher information matrix. This study is performed from a practical perspective, comparing a normal distribution with the Poisson or gamma distribution. First, analytical results are obtained, including results for the linear quadratic model, and these are applied to some real illustrative examples. The nonlinear 4-parameter Hill model is next considered to study the influence of misspecification in a dose-response model. This analysis shows the behavior of the efficiency of the designs obtained in the presence of misspecification, by assuming heteroscedastic normal distributions with respect to the D-optimal designs for the gamma, or Poisson, distribution, as the true one.
elemental information matrixgamma distributionpoisson distributionD-optimizationmisspecification1. Introduction
To obtain optimal designs, it is common to assume a homoscedastic normal distribution of the response variable and under this assumption there is vast literature focused mainly on nonlinear models. However, there are also papers that use probability distributions different from a normal distribution [1,2,3,4,5,6,7]. At this point, it is important to remember that the probability distribution of the response variable is assumed, on many occasions, from the nature of the experiment to be performed. However, there are usually no prior observations to allow this assumption to be checked.
There are very few available references that set out a general framework for optimal experimental design for any probability distribution of the response variable. Ref. [8] present a method to compute the D-optimal designs for Generalized Linear Models with a binary response allowing uncertainty in the link function, ref. [9] study the Generalized Linear Model from the perspective of optimal experimental design, ref. [10] present the “elemental information matrix” for different probability distributions, and [11] compute optimal designs based on the maximum quasi-likelihood estimator to avoid the misspecification in the probability distribution of the response. The aim of this paper is to analyze the effect of misspecification in the probability distribution in optimal design. In other words, it allows those cases to be identified in which it is important to pay special attention to the assumed probability distribution. In this study, apart from theoretical results, real applications involving the linear quadratic model and a dose-response model are considered. For the latter, we focus on the well-known Hill model, widely used to describe dependence between the concentration of a substance and a variety of responses in biochemistry, physiology or pharmacology. From the point of view of optimal experimental design, this model is studied in many papers [12,13,14,15]. Specifically, ref. [13] study the effect of some drugs which inhibit the growth of tumor cells providing D-optimal designs under the assumption of the response variable follows a heteroscedastic normal distribution with a given structure for the variance.
The article is organized as follows. Section 2 introduces the model used and the theory of optimal experimental design. Section 3 presents the structure of the variance of the heteroscedastic normal distribution and proves a general theoretical result. Section 4 focuses on the linear quadratic model and provides some theoretical results for gamma or Poisson distributions. This section also shows applications of these results to real examples found in the literature. Finally, the 4-parameter Hill model is studied in Section 5. Assuming the heteroscedastic normal distribution, as in [13], an efficiency analysis is performed, considering the Poisson, or gamma, as the true probability distribution. A sensitivity analysis with respect to a parameter of the variance structure is also performed. The paper concludes with a summary and conclusions section.
2. Model and Optimal Experimental Design
The model of interest to the practitioner is expressed in a general way as
E[y]=g−1(η(x;θ)),
where y is the response variable, following a probability distribution with pdf d(y;ρ), where ρ is the vector of parameters of the assumed distribution, η(x;θ) is the regression function (linear or nonlinear in the parameters), x is the vector of controllable variables and θ the vector of unknown parameters that must be estimated. Lastly, g is the link function relating the regression function to the mathematical expectation of the response. Ref. [16] carry out an in-depth study of the link function and Generalized Linear Models. In line with these authors, this paper considers the canonical link function for the probability distributions involved in the study, as it guarantees that the maximum likelihood estimators of the model parameters, θ^, are sufficient.
An exact design of size n is defined as a set of values of the explanatory variables, x1,…,xn, in which some may be repeated. These values belong to a compact set called design space X, which is usually a subset of RN. However, the real applications, examples and results in this study consider the one-dimensional case. Assuming that only q of these values are distinct, we may consider the set x1,…,xq and associate with it a probability measure defined by w1,…,wq, where each wi represents the proportion of experiments carried out under the condition xi. This suggests a more general definition of approximate design as a probability measure ξ over the design space X:ξ=x1…xqw1…wq∈Ξ,∑i=1qwi=1,
where ξ(xi)=wi and Ξ represents the set of all approximate designs.
The scenario studied in this work is the estimation of a single parameter of the probability distribution of the response, with the rest being fixed. Thus, the elemental information matrix (EIM), introduced by [10], is scalar and is defined as
ν(η(x;θ))=−E∂2logd(y;η(x;θ))∂η(x;θ)2,
which contains information about the probability distribution of the response variable y, given by the pdf d(y;ρ). The relationship between the parameters to estimate, ρ, of the probability distribution and the regression function η(x;θ) is established by the link function, g, shown in (1). Table 1 sets out the canonical link function, the mathematical expectation of the response variable as a function of η(x;θ) and the EIM for the probability distributions used in this paper, some of which are derived in Section 3.
The single-point information matrix in x∈X is given by
I(x;θ)=−E∂2logd(y;η(x;θ))∂θi∂θj=ν(η(x;θ))fT(x;θ)f(x;θ),∀i,j=1,…,m,
where ν(η(x;θ)) is the EIM defined in (2) and
fT(x;θ)=∂η(x;θ)∂θ.
Finally, the Fisher information matrix (FIM) is defined for the approximate design with probability measure ξ as
M(ξ;θ)=∫XI(x;θ)ξ(x)dx.
The FIM establishes a connection between optimal experimental design and the Generalized Regression Model. The standard form of FIM under the normality hypothesis can be generalized to any probability distribution by including the EIM. By definition, the inverse of the FIM is asymptotically proportional to the variance and covariance matrix of estimators of θ, the parameters of the model. This matrix may depend on these parameters, so nominal values for them are necessary and therefore locally optimal designs can be obtained. By Carathéodory’s theorem, it is known that for any design there is always another with the same information matrix of at most m(m+1)/2+1 different points, where m is the number of unknown parameters to be estimated for the model η(x;θ) [17]. Therefore, it is sufficient to seek designs with finite support.
Optimization criteria express functions of the FIM that allow this matrix to be optimized in different ways. Consider the criterion function Φ as a real convex bounded function defined over the space of the information matrix M=M(ξ):ξ∈Ξ. A design ξ* will then be Φ-optimal if ξ*=argminξ∈ΞΦ(M(ξ;θ)). A number of studies, for example Chapter 10 of [18], give the criteria most commonly used in the literature. This paper uses the D-optimality criterion, whose goal is to minimize the volume of the confidence ellipsoid of θ^, the estimators of θ. This criterion may be expressed by
ΦD(M(ξ;θ))=log|M−1(ξ;θ)|.
In practice this criterion is equivalent to maximizing the determinant of the information matrix. The General Equivalence Theorem (see [19]) is a tool that allows optimality of a given design under a specific criterion to be checked. The sensitivity function φ(x;ξ,θ) is defined as a directional derivative
φ(x;ξ;θ)=limα→0+∂∂αΦ[M((1−α)ξ+αξ¯x;θ)],
where ξ¯x is an arbitrary design centered on a point x. Given an optimal design, ξ*, we find that φ(x;ξ*,θ)≥0, and the equality is found in the support points of the optimal design. The sensitivity function for the D-optimization criterion is given by
φ(x;ξ,θ)=m−ν(η(x;θ))fT(x;θ)M−1(ξ;θ)f(x;θ).
The efficiency allows any design ξ to be compared to the Φ-optimal design ξ*,
effΦ(ξ|ξ*)=Φ(M(ξ*;θ))Φ(M(ξ;θ)).
Also, if Φ is positively homogeneous, the value of the efficiency can be interpreted practically. If the efficiency value is 0.7, this means that the Φ-optimal design can be used to obtain the same information, or equivalently, the same statistical inference of the estimators of the model parameters, with a saving of 30% of the observations. For D-optimization criterion, which is positively homogeneous, D-efficiency is calculated as follows:effD(ξ|ξ*)=|M(ξ;θ)||M(ξ*;θ)|1/m.
This expression will be termed “efficiency” from here on, as there is no possible confusion.
3. Variance Structure and EIM for a Heteroscedastic Normal Distribution
In most applications in the context of optimal experimental design, the homoscedastic normal distribution is used. However, when the response follows the gamma or the Poisson distribution the variance depends on the explanatory variable. To compare in a fair way with these distributions it is considered the heteroscedastic normal distribution with a variance structure given by
Var[y]=kE[y]2r,
where k∈R+ and r∈R are constants and E[y]=η(x;θ). Thus, taking k=1, for a value of r=0.5 the variance structure for the heteroscedastic normal distribution is similar to that of the Poisson distribution (Var[y]=E[y]). On the other hand, with k=1/α and r=1, the structure of the variance for the heteroscedastic normal distribution is Var[y]=E[y]2/α, similar to the variance of the gamma distribution, Γ(α,β), when parameter α is constant. Finally, the case r=0 corresponds to the homoscedastic normal distribution.
Then, using (2), the EIM for the heteroscedastic normal distribution with variance given by (5) is
νN(η(x;θ);r,k)=2r2η(x;θ)2+1kη(x;θ)2r.
Let η(x;θ)>0 be the function of some regression model, for any optimization criterion Φ based on the FIM, then the Φ-optimal designs for the heteroscedastic normal distribution with r=1 in the variance defined in (5) and for the gamma distribution with constant α coincide. Also, the Φ-optimal design obtained is independent of α and k.
Taking r=1 in the variance given by (5), the EIM for the heteroscedastic normal distribution is νN(η(x;θ))=(2k+1)/(kη(x;θ)2), while the EIM for the gamma distribution is νΓ(η(x;θ))=α/η(x;θ)2, and so
MN(ξ;θ)=2k+1k1η(x;θ)2fT(x;θ)f(x;θ)∝αη(x;θ)2fT(x;θ)f(x;θ)=MΓ(ξ;θ).
Therefore the Φ-optimal design calculated with any of the matrices will agree. Also, the parameters k and α are constants, multiplied in each expression of the FIM, and so do not affect Φ-optimal design. □
The form of the EIMs of heteroscedastic normal (r=0.5) and Poisson distribution are hardly proportional. Therefore, in this case, there is no possible similar result to Theorem 1.
4. Linear Quadratic Model
The linear quadratic model is considered in many studies which assume different probability distributions, such as gamma or Poisson distributions (for instance, refs. [1,4]). The regression function of the model is given by
η(x;θ)=θ0+θ1x+θ2x2,x∈X
The aim of this section is to provide D-optimal designs for this model when the response variable follows first a gamma and then a Poisson distribution. It also discusses the influence of misspecification for an assumed heteroscedastic normal distribution.
4.1. Gamma Distribution
Gamma models are suitable when the response is non-negative, continuous, skewed and heteroscedastic [7]. The introduction of the cited reference mentions several papers with real applications. From the point of view of optimal experimental design some papers could be cited, for example [6,20] for the case of multivariate gamma models, and [4] for the univariate case. In the present study, this last reference is revisited as an example of the applicability of the following results.
Let η(x;θ)=θ0+θ1x+θ2x2+…θpxp>0 be the function of a linear regression model of order p≥1, where x is defined on a design space X=[xl,xu]. If the response variable follows a gamma distribution with constant parameter α, the D-optimal design is supported in p+1 equally weighted points with x1=xl and xp+1=xu. It can be expressed byξΓ*=x1x2…xpxp+11/(p+1)1/(p+1)…1/(p+1)1/(p+1).
For the linear quadratic model (p=2), the D-optimal design isξΓ*=xlx2xu1/31/31/3,where x2∈(xl,xu) is a root of the linear quadratic equation(θ1+θ2(xl+xu))x22−(2θ2xlxu−2θ0)x2−(θ0(xl+xu)+θ1xlxu)=0.
Thus, it will be one of the solutions ofx2=2θ2xlxu−2θ0±(2θ2xlxu−2θ0)2+4(θ1+θ2(xl+xu))(θ0(xl+xu)+θ1xlxu)2(θ1+θ2(xl+xu)).
Particularizing the sensitivity function given in (3) using the EIM for the gamma distribution (Table 1) gives
φ(x;ξ,θ)=(p+1)−αη(x;θ)2f(x)TM−1(ξ;θ)f(x).
By the General Equivalence Theorem, if ξΓ* is the D-optimal design, φ(x;ξΓ*,θ)≥0 must be satisfied for all x∈[xl,xu], and there must be equality in the support points of the design. It is, therefore, necessary to study the zeros of the function
g(x)=(p+1)η(x;θ)2−αf(x)TM−1(ξΓ*;θ)f(x),
which is a 2p-order polynomial and its zeros coincide with the zeros of φ(x;ξΓ*;θ). First, the number of support points of the D-optimal design must be greater or equal to the number of unknown parameters in the model, m=p+1, in order for the FIM to be regular. Suppose, then, that the D-optimal design ξΓ* has p+2 support points. In this case, there will be at least p internal points with multiplicity two for the sensitivity function and its derivative to vanish, and the polynomial g(x) will have at least 2p+2 roots, contradicting its order, which is 2p. Therefore, the D-optimal design cannot have more than nor fewer than p+1 points, and so must have exactly p+1 points. Now suppose that one extreme of X is not a support point of the design. Then it is assumed, without loss of generality, that the support points of the optimal design x1,…,xp+1 satisfies xl<x1<…<xp+1=xu. The points x1,x2,…,xp are roots of multiplicity 2 of g(x), and by Rolle’s Theorem, there exist c1∈(x1,x2), c2∈(x2,x3),…cp∈(xp,xp+1) such that g′(ci)=0,i=1,…p. Therefore, g′(x) vanishes at 2p points, once again contradicting the order of the polynomial g′(x), of order 2p−1. By analogous reasoning, for the case xl=x1<…<xp+1<xu, the conclusion is that the D-optimal design should have the two extremes in its support, and by the above, p−1 internal points.
Finally, D-optimal design is equally weighted because the weights can be separated out in the optimization of the determinant in the way
|M(ξ;θ)|=∏i=1p+1ν(η(xi;θ))F(x1,…xp+1)w1…wp+1
where
F(x1,…,xp+1)=∏i=1,j=2i<jp+1(xi−xj)2
only depends on the support points. Thus, the maximum product of the p+1 weights, which are restricted to being positive and summing to 1, is reached for wi=1/(p+1).
For p=2, the internal point of the design is found by solving, with x1=xl and x3=xu, the equation
∂|M(ξ;θ)|∂x2=2(x2−xl)(x2−xu)(xl−xu)2a(x2;θ)w1w2w3η(x2;θ)3η(xl;θ)2η(xu;θ)2=0
where a(x2;θ)=(θ1+θ2(xl+xu))x22−(2θ2xlxu−2θ0)x2−(θ0(xl+xu)+θ1xlxu). To solve Equation (7) is equivalent to solve a(x2)=0, which is a linear quadratic equation with roots
x2=2θ2xlxu−2θ0±(2θ2xlxu−2θ0)2+4(θ1+θ2(xl+xu))(θ0(xl+xu)+θ1xlxu)2(θ1+θ2(xl+xu)).
By the previous results, only one of the two roots can be on the interval (xl,xu). □
Let η(x;θ)=θ0+θ1x+θ2x2+…θpxp>0 be the function of a linear regression model of order p≥1, where x is defined on a design space X=[xl,xu]. If the response variable follows a heteroscedastic normal distribution, with r=1 in the variance defined by Equation (5), then ξN*=ξΓ*.
This is a direct consequence of Theorems 1 and 2. □
By the hypothesis of Theorem 2, the following specific cases exist where the internal point of the design, x2, does not depend on the values of the parameters θ of the model:
If θ1=−θ2(xl+xu) and θ0≠θ2xlxu, Equation (6) is linear and gives x2=(xl+xu)/2. In this case, the designs for the gamma distribution with constant α and the homoscedastic normal (r=0 in (5)) agree.
If θ0=θ2xlxu and xlxu>0, then x22=xlxu. Therefore x2=±xlxu, where x2 is the point found on the interval (xl,xu).
If θ1=−θ0(xl+xu)/(xlxu) with xl,xu≠0 and xl+xu≠0, then x2=0 or x2=2xlxu/(xl+xu).
The cases can be computed by algebraic manipulation from Equation (6). □
In [4], Bayesian, A- and D-optimal designs are computed for linear models assuming gamma distribution. In the case of linear quadratic model D-optimal designs are computed for different nominal values. Some of them are not equally weighted or even they are supported in two points (singular designs). This might seem in contradiction to Theorem 2 above. However, it happens only for the nominal values θ(0) for which the linear predictor η(x;θ)≤0 for, at least, one x∈X. If η(x;θ)=0, a problem occurs in the definition of EIM (ν(η(x;θ))=1/η(x;θ)2). On the other hand, the case η(x;θ)<0 does not make mathematical sense since η(x;θ)−1=E[y]=α/β, where α,β>0 are the parameters of the gamma distribution (see Table 1).
For all nominal values θ(0) for which η(x;θ(0))>0, Theorem 2 can be applied to obtain D-optimal designs. Thus, both extremes of the design space are included in D-optimal designs, all of which are equally weighted, and the inner points, x2, are obtained by solving (6) (Table 2). The D-optimality condition is verified by the General Equivalence Theorem, through the sensitivity function (3). In addition, for the nominal values θ(0)=(0.3,−0.3,0.3), the first condition of Corollary 2 is satisfied. Thus, it can be shown that the D-optimal design is supported in the midpoint of X, which agrees with the D-optimal design for a homoscedastic normal distribution.
4.2. Poisson Distribution
Generalized Linear Models for Poisson distribution are widely used in the literature. Special attention is paid to linear quadratic models in oncology [21,22,23,24]. A reference involving optimal designs and Poisson distribution is [1], where different linear regression models are considered.
Let η(x;θ)=θ0+θ1x+θ2x2 be the function of the linear quadratic regression model, with x defined on the design space X=[xl,xu], and the response variable following a Poisson distribution. Then, for the 3-point D-optimal design, we have the following sufficient conditions:
If θ2<0 and θ1+2xlθ2<4/(xu−xl), the lower extreme of X, xl, is included in the D-optimal design.
If θ2<0 and θ1+2xuθ2>0, the upper extreme of X, xu, is included in the D-optimal design.
Also, if both extremes of X are included in the design, the internal point x2 will be the solution, included in X, of the cubic equation−2θ2x23+2θ2(xu+xl)−θ1x22+θ1(xu+xl)−2xlxuθ2−4x2+[2xu+xl(2−xuθ1)]=0.
Consider the 3-point D-optimal design
ξP*=x1x2x31/31/31/3
with xl≤x1<x2<x3≤xu. The design is equally weighted because the weights can be separated out in the optimization of the determinant (see Proof of Theorem 2).
The explicit expression of the derivative with respect to x1 is
∂|M(ξ;θ)|∂x1=127exp∑i=13η(xi;θ)(x2−x1)(x3−x1)(x3−x2)2×(4x1−2x2−2x3)+(x2−x1)(x3−x1)(θ1+2x1θ2),
If ∂|M(ξ;θ)|/∂x1<0 on [xl,x2), then the maximum of the determinant will be reached at x1=xl. Thus
∂|M(ξ;θ)|∂x1<0⇔θ1+2x1θ2<2x2+2x3−4x1(x2−x1)(x3−x1).
If we consider θ2<0, we have θ1+2x1θ2≤θ1+2xlθ2 and the inequalities
2x2+2x3−4x1(x2−x1)(x3−x1)>4(x2−x1)(x2−x1)(x3−x1)=4(x3−x1)>4(xu−xl).
are satisfied.
Therefore, the inequality (9) is true if the following is satisfied
θ1+2xlθ2<4(xu−xl).
Also,
∂|M(ξ;θ)|∂x3=127exp∑i=13η(xi;θ)(x3−x1)(x3−x2)(x2−x1)2×(4x3−2x1−2x2)+(x3−x1)(x3−x2)(θ1+2x3θ2),
and if ∂|M(ξ;θ)|/∂x3>0 on (x2,xu] the maximum will be found at x3=xu. This gives
∂|M(ξ;θ)|∂x3>0⇔2x1+2x2−4x3(x3−x1)(x3−x2)<θ1+2x3θ2.
If θ2<0, we have θ1+2xuθ2≤θ1+2x3θ2. Thus,
2x1+2x2−4x3(x3−x1)(x3−x2)<0,
and so if 0<θ1+2xuθ2, the inequality in (10) is satisfied.
Finally, if xl and xu are in the support of the design, like in Theorem 2, the internal point will be a solution of the equation ∂|M(ξ;θ)|/∂x2=0, which is equivalent to the cubic equation given by (8). □
As mentioned above, the linear quadratic model plays an important role in oncology, and optimal experimental design has an important practical role in determining the best doses for carrying out the experiment and fitting the model. To illustrate the previous result, we consider the example in [1] where the response variable y explains the number of living cells in a system and the explanatory variable x is the dose of an injected oncology drug. Hence, the expected number of living cells for any dose xi is given by
λi=E[yi]=eθ0+θ1xi+θ2xi2,xi≥0.
From the context of the problem, the relationship between x and y must be inverse: the higher the dose inoculated the lower the number of living cells and vice versa. For the examples, θ1≤0 and θ2<0 are considered to satisfy this relationship for all x∈X. Furthermore, to consider a high dose would not be realistic, as the number of living cells could be very low and might compromise the survival of the system. Let λc=eθ0 be the mean of the number of surviving cells for a control dose (x=0). Then, the expected survival proportion for any dose xi is λi/λc≥c, where c∈(0,1] is the minimal survival proportion. The value of c is a characteristic for each system and for the context of the problem. For this study we consider c=0.4. When θ12/θ2≥−4logc, the survival proportion is not less than the minimal survival proportion in the design space X=[0,xu], where xu is expressed as a function of the parameters of the model (see details in [1]).
Based on the above, the first condition of the Theorem 3 is satisfied and therefore a control dose x=0 is always included on the D-optimal design. Table 3 shows D-optimal designs when the response variable follows heteroscedastic normal (with r=0.5 in (5)) or Poisson distributions. The nominal values considered fulfill the relationship in [1]. Moreover, all D-optimal designs are supported on the upper extreme, so only the inner points x2 of D-optimal designs are shown. For a Poisson distribution the point x2 may be computed by solving Equation (8) of Theorem 3. Finally, an efficiency study is carried out. The efficiencies of the designs are calculated by adapting (4) as
effD(ξA|ξT)=|MT(ξA;θ)||MT(ξT;θ)|1/m,
where ξA is the D-optimal design for the probability distribution assumed by the researcher (for this example, heteroscedastic normal with r=0.5), while ξT is the D-optimal design and MT is the FIM, both for the true probability distribution (in this example, a Poisson distribution). The last column of Table 3 shows that efficiencies. Unlike the results obtained for the gamma distribution, where the D-optimal designs coincide with the heteroscedastic normal distribution when the relationship between mean and variance agrees (Corollary 1), there is a non-negligible loss efficiency, around 20% or more, in this case. It is noteworthy that the inner point of the Poisson distribution is lower than that for a heteroscedastic normal distribution for the designs computed.
5. Extended Hill Model
The Hill model is a dose-response model commonly used in practice to describe the relationship between the concentration of a drug and its effect. Several papers [12,14,15] have addressed this issue from the point of view of optimal experimental design. This model may explain both discrete and continuous responses, such as counting cells [25] or the effect of a drug on cell growth [13], among many others. Here we focus on the 4-parameter Hill model.
If we consider x to be the dose of an administered drug, the function of the regression model which explains the effect can be expressed as
η(x;θ)=(Econ−b)xIC50s1+xIC50s+b,
where θ=(Econ,b,IC50,s) are the parameters to be estimated. The parameter Econ is the effect on the control, i.e., where there is no dose. The parameter b corresponds to the asymptotic value of the response when the concentration of the drug tends to infinity and IC50 corresponds to the dose at which a response would be found equal to the middle of the effect range, Econ−b. Finally, the parameter s is a form parameter: if s>0, η(x;θ) will be strictly increasing, and if s<0, strictly decreasing. Thus, when the parameter b>0 and s<0 the drug has an inhibitory effect where b implies that the whole cell population is not destroyed, as shown in Figure 1. This is the case considered in this paper. Here it is studied from two perspectives simultaneously, where the gamma, or Poisson, is the true distribution of the response variable and the practitioner assumes a heteroscedastic normal distribution with the variance structure given by (5).
Ref. [13] bring together different maximum likelihood estimations of the parameters of (12) for different types of drugs. Table 4 shows these nominal values and the 4-point D-optimal designs obtained for different probability distributions of the response variable: ξΓ (gamma distribution with constant α), ξN (heteroscedastic normal distribution with variance structure given by (5)) and ξP (Poisson distribution). By Theorem 1, when r=1 in (5) ξΓ and ξN coincide. However, the designs ξP and ξN with r=0.5 are distinct, even though both comparisons show a similar relationship between the mean and the variance. Table 4 shows only the inner points (intermediate doses) of the D-optimal designs, as the extremes of the design space X=[0,Dmax] are included in all the cases studied. The maximum dosage Dmax was given by the value 1000·IC50, except for the drug AG2009, since the authors considered this dosage to be impractical. It can be seen that the D-optimal design leads to experimenting with three very low doses, and at the maximum dose (Dmax), except for drug AG2009, where the doses are more spread out.
The last column of Table 4 shows that the efficiency computed by (11) of the D-optimal designs when a heteroscedastic normal distribution with r=0.5 is assumed with respect to the Poisson distribution, is around 73%, except for the drug AG2009, whose efficiency is higher. Again, in this practical case there is a considerable loss of efficiency in estimating the model parameters, with regard to misspecification in the probability distribution. All D-optimal designs in this section have been computed using the Wynn-Fedorov’s algorithm [26].
Sensitivity Analysis
The main aim of this section is to study the effect of the relationship between E[y] and Var[y], characterized by the parameter r in (5), on the efficiency. So, a sensitivity analysis of this parameter is done. Ref. [11] studies the influence of misspecification in the structure of the variance in an analysis carried out for the gamma distribution and the heteroscedastic normal distribution separately. Here, a similar study was carried out with a point of view in which a practitioner considers a heteroscedastic normal response, but the true distribution of the response is gamma, or Poisson. For both distributions, efficiencies, using (11), are computed by comparing D-optimal designs for heteroscedastic normal distribution with the D-optimal design for the true probability distribution, as a function of the values of r.
The efficiencies achieved for different drugs are shown in Figure 2. It can be seen how, when the true distribution is gamma (Figure 2a), the efficiency is 1 for r=1 (dot), since in this case the designs coincide as proven in Theorem 1. However, when the true distribution is the Poisson (Figure 2b), maximum efficiency is not obtained for r=0.5 (dots), as might be expected. It is achieved in this case for negative values of r, close to r=0, and so it would have been better, in terms of efficiency, for the practitioner to have assumed the homoscedastic normal distribution rather than heteroscedastic normal with r=0.5. Furthermore, it does not reach the value 1 for any value of r. Finally, for this model in the neighborhood of r=0 (homoscedastic normal distribution) opposite effects are produced on the efficiency for each of the distributions: greater efficiencies when the true distribution is the Poisson, and lower in the case of the gamma. It is important to highlight that there is no analytic explanation for this effect, and it is motivated by the model and nominal values.
For that, the effect of r on the trend of the efficiency is studied depending on the values taken by η(x;θ). Again, in this analysis, it is assumed that y follows a heteroscedastic normal distribution with variance structure given by (5) when the distribution of y is the Poisson or the gamma and a misspecification takes place.
First, for sufficiently large values of r and η(x;θ)>1, ∀x∈X, given that η(x;θ)−2r/k≈0, we have
νN(η(x;θ),r,k)=2r2/η(x;θ)2+η(x;θ)−2r/k≈2r2/η(x;θ)2∝α/η(x;θ)2=νΓ(η(x;θ)).
Thus, when the true probability distribution is a gamma distribution, Figure 3a (solid line) shows how, on increasing the value of r the efficiency tends to 1. On the other hand, the lower the value of r, the greater the difference between νN(η(x;θ);r,k) and νΓ(η(x;θ)), therefore the efficiency tends to 0 as can be seen in Figure 3a. However, if 0<η(x;θ)≤1 (dashed line), ∀x∈X, the effect of r on the trend of the efficiencies of the designs obtained for the heteroscedastic normal distribution when the true distribution is a gamma distribution is the opposite. As it is shown in Figure 3a, if r increases, the efficiency tends to 0, and if r decreases, the efficiency tends to 1.
When the true distribution is a Poisson distribution there is no direct comparison between its EIM and the EIM of a heteroscedastic normal distribution. However, it can be seen in Figure 3b how the efficiency reaches a maximum for a particular value of r and loses efficiency for values away from that value. This is because the study looks at values of s<0, and so η(x;θ) and νP(η(x;θ)) are monotonic. Therefore, the maximum efficiency is at the value of r where the distance between νN(η(x;θ)) and νP(η(x;θ)) is minimal (independently of whether η(x;θ)>1 or 0<η(x;θ)≤1). Although the 4-parameter Hill model is taken as an example and the graphs in Figure 3 are obtained based on that model, the whole study on the trend of r on the efficiency is general for any regression function satisfying the inequalities.
Finally, it is interesting to point out differences between the graphs in Figure 2 and Figure 3. First, the trends in the efficiencies as a function of r do not coincide. This is because, for the drugs in the study for the 4-parameter Hill model, the inequalities η(x;θ)>1 or 0<η(x;θ)≤1 are not satisfied on the design spaces considered in the examples. Secondly, when the true distribution is the Poisson distribution, maximum efficiency in Figure 3b is obtained for a value close to r=0 when 0<η(x;θ)≤1, as also it takes place in Figure 2b, while for the case with η(x;θ)>1 maximum efficiency is obtained close to r=−3, i.e., for the same model, the nominal values defined by the context of the problem affect the loss of efficiency.
6. Summary and Conclusions
This study has been carried out to analyze the effect of misspecification in the probability distribution of the response variable. We measure that effect by calculating the efficiency of the optimal design obtained with an assumed or working distribution compared to that obtained with the true probability distribution. The typical case is when a researcher assumes a normal distribution, even a heteroscedastic one, for the response variable of his or her problem, but at a greater depth, another distribution is more appropriate, for example a gamma (or Poisson) distribution. When there is misspecification in the probability distribution, there is a loss of efficiency which depends both on the assumed probability distribution and on the regression function η(x;θ).
We provide some theoretical results, as well as practical ones. The first is quite general, valid for any regression function and any criterion based on FIM which guarantees that there is no loss of efficiency when the response variable follows a gamma distribution, and there is assumed to be a heteroscedastic normal distribution with r=1 in the variance structure given by (5). For the linear quadratic model, analytical results are obtained on computing the optimal design for Poisson and gamma distributions. These theoretical results have been used in real applications from the literature, providing designs useful for practitioners.
Finally, the 4-parameter Hill model was used to illustrate and quantify the loss of efficiency. Assuming a heteroscedastic normal distribution, taking values close to r=0 in (5), between about 18% and 25% efficiency is lost for all the drugs looked at in the study when the true distribution is a gamma distribution. Thus, in this case, the usual assumption of normality and homoscedasticity (r=0) of the response variable is not a good option. However, when the true distribution is the Poisson, the loss of efficiency is less severe, reaching maximum values of efficiency for values close to r=0 for all the drugs. This is a striking case, as one might expect maximum efficiency to be achieved at the value r=0.5, which leads to the same relationship between the mean and the variance for the heteroscedastic normal and the Poisson distributions.
It is worth finishing this paper by mentioning that the EIM is an essential tool, as it collects information both about the regression function and the probability distribution of the response variable. As already mentioned, to assume the homoscedastic normal distribution when obtaining optimal designs may lead to a great loss of efficiency. Nonetheless, the examples given show that this will depend on the true distribution of the response variable and on the model function chosen. The existence of uncertainty about the probability distribution of the response variable will therefore lead to the future goal of obtaining robust designs to reduce this uncertainty.
Author Contributions
Conceptualization, M.A.-S., V.C.-A. and S.P.-C.; methodology, M.A.-S., V.C.-A. and S.P.-C.; software, M.A.-S., V.C.-A. and S.P.-C.; formal analysis, M.A.-S., V.C.-A. and S.P.-C.; investigation, M.A.-S., V.C.-A. and S.P.-C.; writing—original draft preparation, M.A.-S., V.C.-A. and S.P.-C.; writing—review and editing, M.A.-S., V.C.-A. and S.P.-C. All authors have read and agreed to the published version of the manuscript.
Funding
All authors were sponsored by Ministerio de Economía y Competitividad and fondos FEDER MTM2016-80539-C2-1-R and by Junta de Comunidades de Castilla-La Mancha SBPLY/17/ 180501/000380.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
ReferencesWangY.MyersR.H.SmithE.P.YeK.D-optimal designs for Poisson regression modelsGarcía-CamachaI.Martín-MartínR.The Construction of Locally D-Optimal Designs by Canonical Forms to an Extension for the Logistic ModelAmo-SalasM.Delgado-MárquezE.FilováL.López-FidalgoJ.Optimal designs for model discrimination and fitting for the flow particlesAminenjadM.JafariH.Bayesian A- and D-optimal designs for gamma regression model with inverse link functionCasero-AlonsoV.López-FidalgoJ.TorsneyB.A computer tool for a minimax criterion in binary response and heteroscedastic simple linear regression modelsIdaisO.Locally optimal designs for multivariate generalized linear modelsIdaisO.SchwabeR.Analytic solutions for locally optimal designs for gamma models having linear predictors without interceptWoodsD.C.LewisS.M.EcclestonJ.A.RussellK.G.Designs for Generalized Linear Models With Several Variables and Model UncertaintyStufkenJ.YangM.Optimal designs for generalized linear modelsAtkinsonA.C.FedorovV.V.HerzbergA.M.ZangR.Elemental information matrices and optimal experimental design for generalized regression modelsShenG.HyunS.W.WongW.K.Optimal designs based on the maximum quasi-likelihood estimatorBezeauM.EndrenyiL.Design of Experiments for the Precise Estimation of Dose-Response Parameters: The Hill EquationKhinkisL.A.LevasseurL.FaesselH.GrecoW.R.Optimal Design for Estimating Parameters of the 4-Parameter Hill ModelFangH.B.RossD.D.SausvilleE.TanM.Experimental design and interaction analysis of combination studies of drugs with log-linear dose responsesSperrinM.ThygesenH.SuT.HarbronC.WhiteheadA.Experimental designs for detecting synergy and antagonism between two drugs in a pre-clinical studyMcCullaghP.NelderJ.A.KarlinS.StuddenW.J.Optimal Experimental DesignsAtkinsonA.DonevA.N.TobiasR.D.KieferJ.WolfowitzJ.The equivalence of two extremum problemsGaffkeN.IdaisO.SchwabeR.Locally optimal designs for gamma modelsTuckerS.L.Tests for the fit of the linear-quadratic model to radiation isoeffect dataRoch-LefèvreS.Martin-BodiotC.GrègoireE.DesbréeA.RoyL.BarquineroJ.F.A mouse model of cytogenetic analysis to evaluate caesium137 radiation dose exposure and contamination level in lymphocytesMcMahonS.J.The linear quadratic model: Usage, interpretation and challengesShuryakI.CornforthM.N.Accounting for overdispersion of lethal lesions in the linear quadratic model improves performance at both high and low radiation dosesMinkinS.Experimental Design for Clonogenic Assays in ChemotherapyWynnH.P.The sequential generation of D-optimum experimental designsFigures and Tables
Graph of the regression function η(x;θ) of the 4-parameter Hill model with nominal values corresponding to the drug TMTX, shown in Table 4.
Efficiencies when comparing the designs obtained for the heteroscedastic normal distribution with variance given by (5) and different values of r when the true distributions are the gamma (a) and Poisson (b), for the 4-parameter Hill model using the nominal values of Table 4. The graphs for the drugs MTX, AG2032 and AG2034 are similar to the graph for the drug TMTX.
Study of efficiency trend when comparing the designs obtained for the heteroscedastic normal distribution as a function of the parameter r of (5) considering the gamma (a) or Poisson (b) distributions as the true distributions, for the 4-parameter Hill model in X=[0,1500]. Solid lines assume Econ=10 and b=1 (therefore η(x;θ)>1), while for dashed lines Econ=1 and b=0.1 (0<η(x;θ)≤1). Both cases use the nominal values IC50=550 and s=−2.
mathematics-09-01010-t001_Table 1
Density function, link function, expectation of the response variable as a function of η=η(x;θ) and the EIM for the probability distributions used in this paper.
Distribution
pdf, d(y;ρ)
g(E[y])
E[y]
EIM
N(μ,σ2)constant σ2
12πσ2exp−(y−μ)22σ2
Identity
μ=η
1σ2
N(μ,kμ2r)
12kπμ2rexp−(y−μ)22kμ2r
Identity
μ=η
2r2η2+1kη2r
P(λ)
λye−λy!
Log
λ=eη
eη
Γ(α,β) constant α
βαΓ(α)yα−1e−yβ
Reciprocal
αβ=1η
αη2
mathematics-09-01010-t002_Table 2
Locally D-optimal designs {x1=0,x2,x3=1} equally weighted obtained for the linear quadratic regression model when the probability distribution of the response is gamma. The nominal values of the parameters of the model are those considered in [4].
θ0
0.3
0.3
0.3
0.3
0.3
1
θ1
0.3
2
5
10
−0.3
1
θ2
0.3
0.3
0.3
0.3
0.3
−0.3
x2
0.366
0.254
0.188
0.144
0.5
0.434
mathematics-09-01010-t003_Table 3
Locally D-optimal designs {x1=0,x2,x3=xu} equally weighted obtained for the linear quadratic model when the probability distribution of the response is heteroscedastic normal or Poisson. The nominal values are θ(0)=(0.95,−1,θ2) and the minimal survival fraction is c=0.4. The last column shows the efficiency when comparing the design for the heteroscedastic normal distribution with r=0.5 to the Poisson distribution.
ξN(r=0.5)
ξP
θ2
xu
x2
x2
effD(ξN|ξP)
−1/50
0.9001
0.7017
0.3993
0.7724
−1/20
0.8778
0.6826
0.3895
0.7748
−1/10
0.8449
0.6546
0.3751
0.7785
−1/5
0.7911
0.6091
0.3515
0.7845
−1
0.5799
0.4354
0.2585
0.8073
mathematics-09-01010-t004_Table 4
Locally D-optimal designs {x1=0,x2,x3,x4=Dmax} equally weighted for the 4-parameter Hill model for different drugs and probability distributions. The nominal values Econ(0)=1.70 and b(0)=0.137 were considered for all the drugs. Columns 2–4 show the nominal values of the parameters and columns 5–10 show the internal points of the D-optimal designs ξN, ξΓ and ξP. The last column shows the efficiency when comparing the design for the heteroscedastic normal distribution with r=0.5 to the Poisson distribution.
Nominal Values
ξΓ=ξN(r=1)
ξN(r=0.5)
ξP
Drug
IC50(0)
s(0)
Dmax
x2
x3
x2
x3
x2
x3
effD(ξN|ξP)
TMTX
0.00895
−1.79
8.95
0.00918
0.03568
0.00748
0.03010
0.00407
0.01283
0.729
MTX
0.0223
−2.74
22.3
0.02265
0.05502
0.01982
0.04922
0.01330
0.02817
0.728
AG2032
0.453
−0.825
453
0.07837
0.15728
0.07057
0.14411
0.05159
0.09299
0.728
AG2034
0.0774
−3.49
77.4
0.43634
7.70106
0.28694
5.46295
0.08152
0.96714
0.743
AG2009
111
−1.03
1500
63.1061
432.616
49.5552
361.701
23.4549
156.114
0.836
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.