Censoring is a selection phenomenon on data which occurs when we can only obtain precise measurements for certain measurement values. The classical example is a scale which can only measure weights up to a certain threshold, say \(y_{\mathrm{max}}\). A scale that can measure any weight would produce the set of measurements \(\{y_1^*, y_2^*, \ldots, y_n^*\}\) where \(n\) is the sample size. We call these the latent values or true values. Our imperfect scale would then produce the observed values, \[
y_i = \begin{cases}
y_i^*, & y_i^* \leq y_{\mathrm{max}} \\
y_{\mathrm{max}}, & y_i^* > y_{\mathrm{max}}
\end{cases}; \quad i = 1, \ldots, n.
\] Specifically, this is an example of right censoring, where there is an upper limit of detection, or maximum value that we can observe with precision. Right censoring is also incredibly common in epidemiology and biomedical studies: when we conduct an observational study or clinical trial, we expect to observe individuals who have not experienced the outcome of interest before the trial ends. These individuals have a time-to-event with a lower bound of the length of the trial (the event could have occurred the moment we stopped observing) and an infinite upper bound, because the event may never occur. Such an individual’s survival time would be right-censored, and many common survival analysis methods can account for right-censored data (assuming an appropriate parametric distribution for the outcome).
We can also have the opposite case, where we can detect any theoretical measurement above a certain value. In this case, the data are said to have a lower limit of detection and this phenomenon is called left censoring. For example, imagine we are testing the concentration of lead in the tap water of several buildings. Our test cannot detect lead levels below 1 part per billion (ppb), but can detect any larger amount of lead. In that case, our observed values would instead look like this: \[
\begin{aligned}
y_i &= \begin{cases}
y_i^*, & y_i^* \geq y_{\mathrm{min}} \\
y_{\mathrm{min}}, & y_i^* < y_{\mathrm{min}}
\end{cases};\\
i &= 1, \ldots, n; \\
y_{\min} &= 1 \text{ ppb;}
\end{aligned}
\] where \(y_{\mathrm{min}}\) is the lower limit of detection. I don’t know anything about how lead concentration tests work, but suppose that at a certain point our test “saturates”: we know that the concentration is at or above, say, 5 parts per million (ppm), but the test cannot detect any higher values. Then some of our observations would be left-censored (below the lower LOD) and some observations would be right-censored (above the upper LOD). So then the lead level data that we observe would look this:
\[
\begin{aligned}
y_i &= \begin{cases}
y_{\mathrm{min}}, & y_i^* < y_{\mathrm{min}} \\
y_i^*, & y_{\max} \geq y_i^* \geq y_{\mathrm{min}} \\
y_{\max}, & y_i^* > y_{\max}
\end{cases};\\
i &= 1, \ldots, n; \\
y_{\min} &= 1 \text{ ppb;} \\
y_{\max} &= 5 \text{ ppm;}
\end{aligned}
\] Note that for a continuous random variable like (presumably) the level of lead in drinking water, it doesn’t really matter which categories include the boundary values, because \(P(Y_i^* = y_i^*) = 0\) for any value of \(y_i^*\) including \(y_{\min}\) and \(y_{\max}\). If the censored variable is not absolutely continuous, then this distinction may need to be clarified and depends on the observation process.
Finally, we can have interval censoring, where we know a data value is within some interval, but we do not know precisely where the value lies within that interval. We could imagine a lead test that looks like this (whether such a test could exist in reality, I am unsure). Suppose that the test involves dipping a test strip into water, and the test strip changes color based on the amount of lead present in the water. However, only a discrete number of shades can be visibility distinguished, so even though the underlying variable of interest is continuous (again, the lead concentration), we only observe a discrete set of outcomes. For example, if our outcome can only reliably discern 6 different concentrations, the data might look like this. The table below also includes a coding scheme in the “Data Value” column showing how we might write down the data.
In general there is not a unique or standardized way to write down interval censored data, and writing down the data model will heavily depend on the observation process and the coding scheme for the values. For this specific example, we could write down the observation model like this:
\[
y_i = \begin{cases}
0, & 1 > y_i^* \\
5, & 10 > y_i^* \geq 1 \\
50, & 100 > y_i^* \geq 10 \\
500, & 1000 > y_i^* \geq 100 \\
2500, & 5000 > y_i^* \geq 1000 \\
5000, & y_i^* \geq 5000
\end{cases}; \quad i = 1, \ldots, n.
\] It is possible to write the math model in a more compact form as well — we will discuss this further in a detailed section on interval censoring.
Another example of this is influenza HAI titer, which we will dedicate an entire case study to later on. Briefly, the values are typically reported as 10, 20, 40, etc., but a value of 10 does not mean the precise value of the measurement should be 10, it means the true value is between 10 and 20. If we assume our titer is measured on the log scale and has no limits of detection, we could write \[
y_i = \lfloor y_i^* \rfloor; \quad i = 1, \ldots, n,
\] because we only perform a discrete number of dilutions. This gives us the interval value for \(y_i\) as \[y_i \in \left[\lfloor y_i^* \rfloor, \lfloor y_i^* + 1 \rfloor\right).\]
A given variable can be subject to all of these types of censoring simultaneously: for example, HAI titers are interval censored in this way, but they also have lower limits of detection and upper limits of detection as well (though the upper limits are rarely important in practice because they can be arbitrarily increased during the assay). However, a particular observation of this variable can only be subject to one type of censoring at a time: e.g., if an observation is below the detection limit, that value is left censored, it cannot simultaneously be right censored or interval censored.
Notably, the distinction between “types” of censoring in this way is useful for several analytic methods, but is not strictly necessary. All censored values can be implicitly treated as interval censored data, where the lower endpoint for a left censored value is negative infinity, and the upper endpoint for a right censored value is positive infinity. Thus, we could write the data generating process for HAI titers with LOD as \[
y_i = \begin{cases}
y_{l}, & \ y_i^* < y_{\text{min}}\\
\lfloor y_i^* \rfloor, &\ y_{\text{min}} \leq y_i^* < y_{\text{max}} \\
y_{i}, & y_i^* \leq y_{\text{max}}
\end{cases},
\] where \(y_{\text{min}}\) is the lower limit of detection and \(y_{\text{max}}\) is the upper limit of detection. Here, \(y_l \leq y_{\min}\) and \(y_u \geq y_{\max}\) are arbitrary constants that can be the same as the limits of detection or not, so long as we know which censoring category the given \(y_i\) value falls into. We can write down the value for the censored values however we want as long as we know what interval the points fall into.
To express the DGP in interval notation, we would write \[
y_i \in \begin{cases}
\left(-\infty, y_{l}\right), & y_i^* < y_{\text{min}}\\
\left[\lfloor y_i^* \rfloor, \lfloor y_i^* + 1 \rfloor\right), & y_{\text{min}} \leq y_i^* < y_{\text{max}} \\
\left[y_{u}, \infty\right), & y_i^* \leq y_{\text{max}}
\end{cases}.
\] Note also that assuming \(y^*_i\) is drawn from an absolutely continuous distribution (e.g. normal or lognormal, etc.), the final likelihood model will be equivalent regardless of which intervals are open or closed. This model would allow us to put all of the censored observations into the likelihood function in the same framework without having to worry about sorting the observations into buckets w.r.t. the type of censoring.
The probability of each observation \(y_i\) can then be expressed as the probability that the random variable \(Y\) takes on a realization inside the given interval. That is, if \(y_i^*\) is detectable (greater than the lower LoD, but less than the upper LoD), we would write \[P(Y_i = y_i) = P\left(\lfloor y_i^* + 1 \rfloor \geq Y_i^* \geq \lfloor y_i^* \rfloor \right).\] Recall that every random variable \(Y\) has a cumulative density function, or CDF, which is defined as \[F_Y(y) = P(Y \leq y).\] Using the definitions of all these probability things, we can say (proof omitted) that \[
\begin{aligned}
P(Y_i = y_i) &= P\left(\lfloor y_i^* + 1 \rfloor \geq Y_i^* \geq \lfloor y_i^* \rfloor \right) \\
&= P(Y_i^* \leq \lfloor y_i^* + 1 \rfloor) - P(Y_i \leq \lfloor y_i^* \rfloor) \\
&= F_Y(\lfloor y_i^* + 1 \rfloor) - F_Y(\lfloor y_i^* \rfloor).
\end{aligned}
\] This applies to the nondetectable measurements as well, but we need to take the limit as \(Y_i \to \pm \infty\).Fortunately, this is a defining property of the CDF as you can see on the linked wikipedia page. So we can write \[
\begin{aligned}
P(Y_i = y_\min) &= \lim_{a \to -\infty}P\left(y_{\text{min}} \geq Y_i^* \geq a \right) \\
&= F(y_\min) - \lim_{a \to -\infty} F(a) \\
&= F(y_\min) - 0 = F(y_\min).
\end{aligned}
\] Similarly, we get that \[
\begin{aligned}
P(Y_i = y_\max) &= \lim_{b \to \infty}P\left(b \geq Y_i^* \geq y_\max \right) \\
&= \lim_{b \to \infty} F(b) - F(y_\max) \\
&= 1 - F(y_\max).
\end{aligned}
\]
At first, it can be difficult to see how this probability calculations are useful for statistical data analysis. To fully make the connection, we will need to make use of an indicator variable, say \(c_i\), which is equal to 1 when the observation \(y_i\) is censored, and equal to 0 if \(y_i\) is not censored. To fit some parametric model with parameter vector \(\theta\), we often turn to the likelihood function, which is \[
\mathcal{L}\left(\theta \mid y \right) = f_Y(y \mid \theta)
\] if there are no censored values (here we implicitly assume that \(y\) is absolutely continuous). The function \(f_Y(y \mid \theta)\) is the probability density function of random variable \(Y\) where we consider the parameters to be fixed at the value \(\theta\).
In constructing the likelihood for our observation \(Y_i\), we run into the slightly tricky business1 that \(Y_i\) is not always continuous – if the \(i\)th observation is censored, \(Y_i\) actually becomes discrete. We will not address the technicalities of combining censored and discrete data here, but suffice to say if the parametric model we assume for \(Y_i\) is any standard (all GLM compatible distributions and many others beyond), we can say that \[
\mathcal{L}(\theta \mid y_i) = \bigg( f(y_i \mid \theta) \bigg)^{1 - c_i} \bigg( P(Y_i = y_i)\bigg)^{c_i}.
\] That is, we can use the continuous probability density \(f\) when \(y_i\) is not censored and the discrete probability of the realization \(y_i\) if the observation is censored.
We can then (assuming mutual, potentially conditional, independence of the observations) write the likelihood of the sample by multiplying together all of our individual \(y_i\) likelihoods. With the sample likelihood in hand, we can proceed to do whatever kinds of torture we would normally do in modern statistics, like finding parameter values that maximize the likelihood, or adding priors and using Bayes’ theorem.
In the following chapters, we’ll discuss a bit more theory but primarily go through some examples. Notably, we’ve only discussed univariate analyses here, so in the first example we’ll walk through an actual example of dealing with a censored outcome variable. Dealing with censored predictor/independent variables is much more detailed, and will be explained in its own example.
to really get into the details of this, you need Lebesgue’s decomposition theorem which relies on measure theory. Writing out the likelihood in a more general form without the censoring indicators is also not possible without measure theory because it relies on the Radon-Nikodym theorem. If this is the part you want to learn about, you will have to learn it from someone other than me!↩︎