Robust estimation of correlation coefficient for censored data

3 minute read

Published:

-

Background

We recently got an interesting request from our collaborators from the bench side. They have two types of experimental assays, let’s call them assay X and assay Y, that measure the activity of the same biological pathway. The difference between them is that assay Y has a limit of detection of certain value U, and it reports any value lower than U as U; on the contrary, X is much sensitive and effectively have no limit of detection. In other word, Y is left-censored, whereas X is not censored at all. A simulated data is shown below.

Our collaborators would like to learn how “consistent” these two assays are. A popular metrics for between-assay consistency is Pearson correlation coefficient. However, it is obviously inappropriate to be used here, unless the censoring is accounted for. In addition, we need not only the point estimation, but also a confidence interval, of correlation coefficient. To address this challenge, one of my colleagues developed an approach to robustly estimate Pearson correlation coefficient (and its credible interval) for such censored data. The original workflow was implemented in JAGS and SAS. I am curious if I can recreate the analysis using R and Stan (I mean, why not?). So here we go!

Simulate a toy example data

Let \(X\) be a random variable, and define random variable \(Y\) as:

\[\displaylines{ y_n = \alpha + \beta x_n + \epsilon_n, \text{ where } \epsilon_n \sim normal(0, \sigma) }\]

…and \(Y\) is left-censored at a known threshold, 0:

\[\displaylines{ \begin{equation} y_{n} = \begin{cases} y_n & \text{if $y_n$ > 0}\\ 0 & \text{if $y_n$ <= 0 } \end{cases} \end{equation} }\]

An example of simulated toy data is shown as below:

Stan model

We then define the following model using Stan. It imputes censored data points of \(Y\), and Pearson correlation coefficient is computed as:

\[\displaylines{ \rho = \frac{Cov(X,Y_{imputed})}{sd(X) \times sd(Y_{imputed})} }\]
data {
  int<lower=0> N_obs;          // Number of observed data points
  int<lower=0> N_cens;         // Number of censored data points
  vector[N_obs] x_obs;      // Observed predictor matrix
  vector[N_cens] x_cens;    // Censored predictor matrix
  vector[N_obs] y_obs;
  real sd_x;
}

transformed data{
  int<lower=0> N;
  N = N_obs + N_cens;
  
  vector[N] x;
  x = append_row(x_obs, x_cens);
  
  
}

parameters {
  real beta;       // coefficients for predictors
  real alpha;
  real<lower=0> sigma;  // error scale
  vector<upper=0>[N_cens] y_cens;
}

transformed parameters {
  
  vector[N] y_imputed;
  y_imputed = append_row(y_obs, y_cens);
  
  real cov_x_y;
  cov_x_y = sum((y_imputed - mean(y_imputed)) .* (x - mean(x))) / (N-1);
  
  real<lower=0> sd_y;
  sd_y = sqrt(square(beta * sd_x) + square(sigma));
  
  real<lower = -1, upper=1> rho;
  rho = cov_x_y/sd_x/sd_y;
  
}
model {
  y_obs ~ normal(alpha + x_obs * beta, sigma);  // target density - observed points
  y_cens ~ normal(alpha + x_cens * beta, sigma); // target density - censored points
}


Since \(\rho\) is computed for each draw, we can then obtain the posterior distribution of \(\rho\) and get its 95% credible interval. And here is the final output, where the simulated data, the point estimation and 95% credible interval of \(\rho\), and a regression line of \(X\) and \(Y_{imputed}\) are shown on the same plot:

And the posterior distribution of \(\rho\) and the ground-truth value of \(\rho\) are shown below:

Recap

  • Several R packages allow one to fit censored data using Tobit model. One advantages of using Bayesian analysis, instead, is the ability of analyzing complex model,e.g. hierarchical models.