Cleavage Prediction with Logistic Regression
The prediction that a site will be cleaved or conversely not cleaved is based on the occurrence of different amino acids at different locations that surround the site of interest. The cleavage of a site can be considered a binary outcome because there are only two outcomes; cleaved and not cleaved. Thus, we are interested in obtaining the probability that a site will be cleaved based on the amino acids that surround the site.
A common approach to predict binary outcomes is to use logistic regression to describe the relationship between cleavage at a site and the amino acids that surround that site. The advantage of using logistic regression is that the probability is always constrained between zero and one such that negative probabilities and probabilities greater than one never occur.
Logistic regression uses the logit function to model the cleavage probability at a site as a linear function of the amino acids at the different locations that surround the site:
Given the specific amino acids present at each location surrounding the site, the cleavage probability of a site can be calculated as:
