Nonparametric Statistical Inference of Dependent Censoring Data Based an Assumed Copula 

Author  WangChunJie 
Tutor  ShiNingZhong; WangDeHui 
School  Jilin University 
Course  Probability Theory and Mathematical Statistics 
Keywords  Interval censored data Current status data Copula Dependent censoring Nonparametric estimator 
CLC  O212.7 
Type  PhD thesis 
Year  2012 
Downloads  122 
Quotes  0 
Statistical analysis problem of analyzing all kinks of data about lifetime, survivaltime or failure time arises in a number of applied felds,such as engineering, biomedical,public health, epidemiology,economics and demography.From product durability studyto all kinks of human’s disease study, we need survival analysis’s application. Censoreddata arises when an individual’s life length is known to occur only in a certain periodof time. Possible censoring schemes are right censoring, where all that is known is thatthe individual is still alive at a given time, left censoring when all that is known isthat the individual has experienced the event of interest prior to the start of the study,or interval censoring, where the only information is that the event occurs within someinterval. Truncation schemes are left truncation, where only individuals who survive asufcient time are included in the sample and right truncation, where only individualswho have experienced the event by a specifed time are included in the sample. Theissues of censoring and truncation are dealt more carefully. A common feature of thesedata sets is they contain either censored or truncated observations. Interval censoringis a type of censoring that has become increasingly common in the areas that producefailure time data. In the past20years or so, a voluminous literature on the statisticalanalysis of intervalcensored failure time data has appeared. Interval censored data aredivided into two types, one is current status data, that is case I intervalcensored data;another interval censored data is Case II data. If the failure time and the observationtime can be assumed to be independent, several methods have been developed forthe problem. But in practice are often faced with the failure time variable T andthe observation time variable C or U, V does not meet the independence assumption,we call dependent censoring. Copula theory in modeling the dependence betweenvariables play an increasingly large role. Here we will focus on the situation where theindependent assumption does not hold for two kinds of interval censoring data typeand propose estimation procedures for the distribution of failure time data under the Copula model framework and apply these methods to tumorigenicity experiments andAIDS data. Here we introduce the main results of this paper.First, Chapter2discusses nonparametric estimation of a survival function whenone observes only current status data (McKeown and Jewell,2010; Sun,2006; Sunand Sun,2005). In this case, each subject is observed only once and the failure timeof interest is observed to be either smaller or larger than the observation or censoringtime. If the failure time and the observation time can be assumed to be independent,several methods have been developed for the estimation. Here we will focus on thesituation where the independent assumption does not hold, which often occurs in,for example, tumorigenicity experiments. For the problem, two simple estimationprocedures are proposed under the Copula model framework. The estimates allowone to perform sensitivity analysis or identify the shape of a survival function amongother uses. A simulation study performed indicate that the two methods work well andthey are applied to a motivating example from a tumorigenicity study. Nonparametricestimation of a survival function is often the frst task in performing the failure timedata analysis and many procedures have been developed for the problem (Kalbfeisch&Prentice,2002; Hu&Lawless,1996; Hu et al.,1998; Klein&Moeschberger,2002).However, most of the existing procedures are for rightcensored failure time data. Thispaper discusses nonparametric estimation of a survival function when one observesonly current status data (McKeown&Jewell,2010; Sun,2006; Zhang et al.,2005).In this situation, each subject is observed only once and the failure time of interest isobserved to be either smaller or larger than the observation time. Several procedureshave been developed for the estimation problem in the literature if the failure time andthe observation time can be assumed to be independent. In this paper, we will focus onthe situation where the independent assumption does not hold, for which there seemsno nonparametric estimation procedure available.This study was motivated by the analysis of tumorigenicity experiments, in whichcurrent status data routinely occur and the estimation of tumor prevalence function isoften required (Keiding,1991). This is because that in this situation, the failure timeof interest is usually the time to tumor onset, which is commonly not observed. Insteadonly the death or sacrifce time of an animal, serving as the observation time here, isknown (Hoel&Walburg,1972; Lagakos&Louis,1988). If the tumor is nonlethal, thenit is usually reasonable to assume that the time to tumor onset and the death time areindependent. Of course, if the tumor is lethal, the estimation is straightforward as wewould have rightcensored data on the time to tumor onset. On the other hand, it iswellknown that most types of tumors are between lethal and nonlethal, meaning that the time to tumor onset and the death time are related. Lagakos and Louis (1988)discussed several examples of such data and pointed out that for the problem, thesensitivity type analysis has to be performed as the correlation between the two timevariables, often referred to as lethality, cannot be estimated. The proposed estimateswould make the sensitivity analysis possible. Another example will be discussed below.As mentioned above, several procedures have been proposed for nonparametricestimation of the survival or cumulative distribution function based on current statusdata when the failure time of interest and the observation time can be assumed to beindependent (Sun,2006). In the following, such data will be referred to as independentcurrent status data and otherwise as dependent current status data. For example,in this case, the maximum likelihood estimate of the survival function can be easilyderived by using the pooladjacentviolator algorithm for the isotonic regression. Alsofor independent current status data, many authors have considered other analysis issues on them such as treatment comparison and regression analysis (Sun,2006). Fordependent current status data, on the other hand, there exists only limited literaturein general except in the feld of tumorigenicity experiments. Among others, Zhang etal.(2005) considered regression analysis of general dependent current status data andproposed an estimating equationbased inference procedure. A large literature existsfor tumorigenicity experiments and in this case, a common approach is to apply thethreestate model to describe the tumor process. However, with respect to estimationof tumor growth and prevalence function, most of the existing methods are parametricprocedures or assume that tumors are lethal or nonlethal. It is obvious that a nonparametric estimate would be very useful for identifying the shape of the prevalencefunction or model checking among other uses.Several approaches are commonly used to deal with correlated failure times orthe failure time data with dependent censoring (Hougaard,1986). One is the frailtyapproach that models the correlation by using some latent variables and another isthe Copula model approach. Let T and X denote the two possibly related randomvariables, F and G their marginal distributions, respectively, and H their joint distribution. Then it has been shown (Nelsen,2006) that there exists a Copula functionC(u, v), defned on I2with C(u,0)=C(0, v)=0, C(u,1)=u and C(1, v)=v,such thatH(t, x)=C(F (t), G(x)), t≥0, x≥0.(1)If F and G are continuous, C is unique. Furthermore, for any given Copula functionC and marginal distribution functions F and G, the function H defned in equation(1) is the corresponding distribution function. Amonth others, Zheng and Klein (1995) employed the Copula model approach for estimation of a survival function in the case of dependent rightcensored data. In the following, we will adopt the same approach. However, it should be noted that the two situations are quite different as in the latter, the data structure is much more complex and the observed relevant information is extremely less.In the following, we will assume that T and X represent the failure time of interest and the observation time, respectively. The main goal of this paper is to discuss nonparametric estimation of F under model equation (1). In Section2, we will show that if the Copula function C is known, the survival function F of interest can be uniquely identified. In Section3, we will present two simple consistent estimates of F and both can be easily obtained. The first one is developed for the Archimedean Copula functions and has a closed form, while the second one is for general Copula functions. Section4gives some numerical results and in Section5, we apply the methods to a set of current status data arising from a tumorigenicity experiment. Section6concludes with some discussion and remarks. In the following, we will assume that both T and X are continuous variables.Consider a survival study that involves n independent subjects and gives the observed data{Xi,δi=I(Xi≥Ti); i=1,...,n}, the i.i.d. replications of{X,δ I(X≥T)}. Let F, G and H be defined as before and suppose that P(Ti=Xi)=0. It is easy to see that given the observed current status data, one can directly estimate the following three functions P1(x)=P(X≤x)=G(x), p2(x)=P(X> x, T<X),(0≤x≤∞) and P3(x)=P(X≤x,T<X),(0≤x≤∞). In the following, we will show that the marginal distribution function F of T is uniquely determined by these functions.Let C be the Copula function defined in (1). Then we have C(u,v)=H(F1(u),G1(v)), u,v∈[0,1]. Let μc denote the probability measure corresponding to C. Then a simple calculation gives p2(x)=μC(Ax), where Ax={(u,v):0≤u≤FG1(v), G(x)≤v<≤1. The following theorem establishes the identifiability of F for the situation considered here.Theorem1Suppose the marginal distribution functions F and G are continuous and strictly increasing over (0,∞). Also suppose μc(E)>0for any open set E in [0,1]×[0,1].Then given the Copula function C, the marginal distribution function F is uniquely determined by p1(x) and p2(x) or p1(x) and p3(x).In this subsection, we will assume that C is an Archimedean Copula function given by Cφ(u,v)=φ1[φ(u)+φ(v)],φ∈φ,(2) where φ denotes the class of functions φ:[0,1]→[0,∞] with continuous first and second derivatives and satisfying φ(1)=0,φ’<0,φ">0,0<t<1. Then we can show the following theorem.Theorem2Suppose that the conditions in Theorem1hold and also suppose that φ(t)→∞andφ’(t)→∞when t→0.Then F can be expressed as f(t)=φ1{φ(φ’)1(g(t)φ’(G(t))/D1t))φ(G(t))}(3) where D1(t)=dP(T<X,X<t)/dt, G{t)=P(X≤t) and g(t)=G’{t).we consider estimation of the marginal function F for general Copula functions. Let G, ps and g be denned as before. By Theorem1, F can be uniquely determined by G and p3given C. So it is natural to develop an estimate of F by using G and p3.Let0<X1<X2<…<Xk denote a sequence of fixed time points. Their selection will be discussed below. To derive the estimate, note the equation P3(x)=∫0x P(T<XX=y)dG(y)=∫0x Cv(F(y),G(y)) dG(y),(4) where Cv=dC(u,v)/dv. This suggests that a natural estimate of F can be derived by replacing G and p3in the equation above with their empirical estimates. More specifically, let F2denote the resulted estimate of F. Then at Xj, F2(xj) can be determined by solving the equation with replacing F(xl) by F2(xl) for l<j. If taking C to be the Frank Copula function, given in equation (7) below, then the equation above becomesSecondly, Chapter3gives the simple estimation procedures based the Copula in interval censoring data with informative censoring. Nonparametric estimation of a survival function is one of the most commonly asked questions in the analysis of failure time data and for this, a number of procedures have been developed under various types of censoring structures (Kalbfleisch and Prentice,2002). In particular, several algorithms are available for intervalcensored failure time data with independent censoring mechanism (Sun,2006; Turnbull,1976). In this paper, we consider the intervalcensored data where the censoring mechanism may be related to the failure time of interest, for which there does not seem to exist a nonparametric estimation procedure. It is wellknown that with informative censoring, the estimation is possible only under some assumptions. To attack the problem, we take a Copula model approach to model the relationship between the failure time of interest and censoring variables and present a simple nonparametric estimation procedure. The method allows one to conduct a sensitivity analysis among others.Statistical analysis of intervalcensored failure time data has recently attracted a great deal of attention (Finkelstein,1986; Sun,2006; Zhang and Sun,2010). By intervalcensored data, we usually mean the failure time data in which the failure time of interest is observed to belong to some intervals instead of to be exactly known. Such data often occur in many fields including clinical trials and longitudinal studies. One common example occurs in medical or health studies that entail periodic followup. In this situation, an individual due for the prescheduled observations for a clinically observable change in disease or health status may miss some observations and return with a changed status. Accordingly, we only know that the true event time is greater than the last observation time at which the change has not occurred and less than or equal to the first observation time at which the change has been observed to occur, thus giving an interval which contains the real (but unobserved) time of occurrence of the change. In this paper, we will discuss nonparametric estimation of a survival function in these situations. Furthermore, the censoring mechanism may be related to the failure time of interest.A more specific and wellknown example of intervalcensored data was discussed in Finkelstein (1986) among others. The data arose from a retrospective study on early breast cancer patients who had been treated at the Joint Center for RadiationTherapy in Boston between1976and1980. During the study, the patients were giveneither radiation therapy alone or radiation therapy plus adjuvant chemotherapy andsupposed to be seen at clinic visits every4to6months. However, actual visit timesdifer from patient to patient, and times between visits also vary. At visits, physiciansevaluated the cosmetic appearance of the patient such as breast retraction, a responsethat has a negative impact on overall cosmetic appearance. With respect to the timeto breast retraction, some patients did not experience breast retraction during thestudy, thus giving rightcensored observations. For the other patients, the observationsare intervals given by the last clinic visit time at which breast retraction had not yetoccurred and the frst clinic visit time at which breast retraction was detected. Thatis, only intervalcensored data were observed for the time to breast retraction. Anotherexample will be discussed below.Nonparametric estimation of a survival function is often the frst task in performing the failure time data analysis and many procedures have been developed for theproblem (Kalbfeisch&Prentice,2002; Hu&Lawless,1996; Hu et al.,1998; Klein&Moeschberger,2003). However, most of the existing procedures are for rightcensoredfailure time data. Several procedures are also avaliable for intervalcensored data (Gentlemand and Geyer,1994; Sun,2006; Turnbull,1976; Wellner and Zhan,1997). Forexample, the simplest procedure is perhaps the selfconsistency algorithm given byTurnbull (1976). However, all of the available procedures are for the situation of independent censoring mechanism. It does not seem to exist a nonparametric estimationprocedure for the situation where the censoring mechanism may be related to the failuretime of interest or informative. It is wellknown that with the presence of informativecensoring, the survival function may not be identifable unless under some assumptions about the relationship between the failure time of interest and censoring variables(Zheng and Klein,1995). Nevertheless, a nonparametric estimate would be very usefulas it would allow one to, for example, conduct some sensitivity analysis or identify theshape of the underlying survival function.Two approaches are commonly used to deal with failure time data with dependentcensoring (Hougaard,1986). One is the frailty approach that models the relationshipby using some latent variables (Zhang et al.,2005) and the other is the Copula modelapproach. In this paper, we will take the Copula model approach, which will bebriefy described in Section3.1along some notation. In Section3.2, we will presenta nonparametric estimation procedure for intervalcensored data where the censoringvariables may be related to the failure time variable of interest. The procedure can be easily implemented and the key idea behind it is to divide the observed data into two sets of current status data, for which the estimation is relatively easy. Section3.3gives some numerical results and in particular, the method is illustrated by a set of intervalcensored data arising from an AIDS study.In this chapter, we will define the data structure and describe some notation and assumptions that will be used throughout the paper. In particular, the Copula model will be briefly discussed.Consider a failure time study and let T denote the failure time of interest. Suppose that instead of observed exactly, the observation on T is characterized by two random variables U and V with U<V such that {U, V,5=I(≤TU),δ2=I(U<T≤V}. That is, T is known only to be smaller than U, between U and V, or greater than V. Define W=V－U. Then for representing an intervalcensored observation, an alternative to the approach above is{U,W,δ1,δ2}. Let F1, F2and F3denote the marginal distributions of T, U and W, respectively. It is obvious that one can easily estimate F2and F3by their empirical estimates. In the following, we will suppose that T, U and W may be related and the main goal is to estimate F1.As mentioned above, the Copula model provides a very flexible approach to model the relationship among correlated random variables (Hougaard,1986; Nelsen,2006). Define I=[0,1]. A kdimensional Copula is a function C from Ik to I such that for all (u1,...,uk) E Ik, C(ui,...,uk)=0if at least one coordinate is0and if all coordinates are1except uj, then C(u1,...,Uk)=Uj. In particular, if C is a twodimensional Copula function, we have that C(u,0)=C(0,v)=0, C(u,1)=u, C(1,v)=v for all u,v∈I2, and C(u2,v2)C(u2,v1)C(u1,v2)+C(u1,v1)≥0, for all u1, u2, v1, v2in I such that u1<u2and vi<v2.One of the important properties about the Copula model is given by the Sklar theorem. It states that if H is an kdimensional distribution function with the marginal distributions G1,..., Gk, then there exists an kdimensional Copula function C such that H(x1,...,xk)=C(Gi(χ1),...,Gk(xk)),(6) for all (χ1,...,χk). Furthermore, if G1,...,Gk are continuous, then C is unique. On the other hand, for any given kdimensional Copula function C and k onedimensional distribution funtions G1,...,Gk,the function H defined above is a distribution function with the marginal distributions G1,…,Gk. It is easy to see that a key feature of the above expression is that the Copula function C defines or characterizes the correlation or the relationship among the k concerned random variables.Another advantage of the Copula model approach is its flexibility as there exist many different Copula functions.Among them,one class of the Copula functions that are commonly used is the Archimedean Copula functions with the threedimensional function defined as C3(u1,u2,u3)=φ1(φ(u1)+φ(u2)+φ(u3)),φ∈Φ,(7) whereΦdenotes the class of functionsφ:Ⅰ→[0,∞]with continuous first and second derivatives and satisfying φ(1)=0,φ’<0,φ">0,0<t<1Note that given C3,we have C(1,2)(u1,u2)=C3(u1,u2,1) φ1(φ(u1)+φ(u2)+φ(1))=φ1(φ(u1)+φ(u2)). That is,the resulting twodimensional marginal distribution function C(1,2)(u1,u2)is still an Archimiedean Copula function with the same generator function φ. It is easy to see that the same also holds for the marginal distribution functions C(1,3)(u1,u3), C(2,3)(u2,u3).In the following,we will assume that the joint distribution function H of T,U and W can be described by a Copula function as in(6)and discuss the estimation of the marginal distribution function F1of T.The similar idea was used by,among others Zheng and Klein(1995)for estimation of a survival function based on dependent rightcensored time data data.It should be noted that the data structure considered here is much more complex and also the observed relevant information here is extremely less. Now we will consider estimation of the marginal distribution function F1.Suppose that the observed data are n i.i.d.replications of{U,V,δ1=I(T≤U),δ2=I(U<T≤V} and expressed as{Ui,Vi,δi1=I(Ti≤Ui),δi2=I(Ui<Ti≤Vi；i=1,...,n).For the problem,following Zhu et al.(2008),we propose to divide the observed data into two sets of current status data as∑1={Ui,δi1;i=1,...,n),∑2={Vi,δi3=δi1+δi2；i=1,...,n}, and to estimate F1based on each of the two data sets separately.To present the estimation procedure,let C denote the Copula function defined in(6)for the joint distribution function H of T,U and V and C(1,2)(u1,u2)and C(1,3)(u1,u3)the resulting marginal joint distribution functions of(T,U)and(T,V), respectively.Define p1(x)=P(T≤U,U≤x) and P2(y)=P(T≤V,V≤y). Then one can show that and where C2(1,2)(u1,u2)=ac(1,2)(u1,u2)/au1and C3(1,3)=ac(1,3)(u1,u3)/au3.As mentioned before,one can easily estimate F2and F3by their empirical estimates since we have complete data on the Ui’s and Vi’s. This is actually also true for p1(x)and p2(x)and thus it is natural to estimate F1based on(8)and(9)by replacing F2,F3, p1(x)and p2(x)with their empirical estimates.Specifically,let x0=0<x1<x2<…<xk denote a sequence of fixed time points.Their selection will be discussed below.Also let F2and F3denote the empirical estmates of F2and F3,respectively.Define f2(xj)=F2(xj)F2(xj1),f3(xj) F3(xj)F3(xj1),j=1,...,k, and the empirical estimates of dF2,dF3,p1(x)and p2(x),respectively.Then based on the equation(8),it is natural to define an estimate F1(1) of F1as follows:Given F1(1)(x1),…,F1(1)(xj1),define the estimate F1(1)(xj)as the solution to the equation with replacing F1(ul)by F1(1)(ul)for l=1,…,j1,j=1,…,k.Similarly based on the equation (9),another natural estimate F1(2)of F1can be defined as:Given F1(2)(x1),…,F1(2)(xj1),define the estimate F1(2)(xj)as the solution to the equation with replacing F1(il)by F1(2)(ul)for l=1,…,j1,j=1,…,k. It is obvious that given the F1(1)(xj)’s and F1(2)(xj)’s,one can estimate F1simply by F1(xj)=(F1(1)(xj)+F1(2)(xj))/2for j=1,...,k.Note that the estimate F1given above is defined only at the xj’s. For its value between the xj’s,it is natural to define them through the linear extrapolation.It can be easily shown that the estimate proposed above is consistent and nondecreasing for large n.