Nan Xiao <https://nanx.me>
Introduction
The protr package offers a unique and comprehensive toolkit forgenerating various numerical representation schemes of proteinsequences. The descriptors included are extensively utilized inbioinformatics and chemogenomics research.
The commonly used descriptors listed in protr include amino acidcomposition, autocorrelation, CTD, conjoint traid, quasi-sequence order,pseudo amino acid composition, and profile-based descriptors derived byPosition-Specific Scoring Matrix (PSSM).
The descriptors for proteochemometric (PCM) modeling include thescales-based descriptors derived by principal components analysis,factor analysis, multidimensional scaling, amino acid properties(AAindex), 20+ classes of 2D and 3D molecular descriptors (for example,Topological, WHIM, VHSE, among others), and BLOSUM/PAM matrix-deriveddescriptors.
The protr package also implemented parallelized similaritycomputation derived by pairwise protein sequence alignment and GeneOntology (GO) semantic similarity measures. ProtrWeb, the webapplication built on protr, can be accessed from http://protr.org.
If you find protr useful in your research, please feel free to citeour paper:
Nan Xiao, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. (2015).protr/ProtrWeb: R package and web server for generating variousnumerical representation schemes of protein sequences.Bioinformatics 31 (11), 1857-1859.
BibTeX entry:
@article{Xiao2015, author = {Xiao, Nan and Cao, Dong-Sheng and Zhu, Min-Feng and Xu, Qing-Song.}, title = {protr/{ProtrWeb}: {R} package and web server for generating various numerical representation schemes of protein sequences}, journal = {Bioinformatics}, year = {2015}, volume = {31}, number = {11}, pages = {1857--1859}, doi = {10.1093/bioinformatics/btv042}}
An example predictive modeling workflow
Here, we use the subcellular localization dataset of human proteinspresented in Chou and Shen (2008) todemonstrate the basic usage of protr.
The complete dataset includes 3,134 protein sequences (2,750different proteins) classified into 14 human subcellular locations. Weselected two classes of proteins as our benchmark dataset. Class 1contains 325 extracell proteins and class 2 includes 307mitochondrion proteins. We aim to build a random forestclassification model to classify these two types of proteins.
First, we load the protr package, then read the protein sequencesstored in two separated FASTA files with readFASTA()
:
library("protr")# Load FASTA files# Note that `system.file()` is for accessing example files# in the protr package. Replace it with your own file path.extracell <- readFASTA( system.file( "protseq/extracell.fasta", package = "protr" ))mitonchon <- readFASTA( system.file( "protseq/mitochondrion.fasta", package = "protr" ))
To read protein sequences stored in PDB format files, usereadPDB()
instead. The loaded sequences will be stored astwo lists in R, and each component is a character string representingone protein sequence. In this case, there are 325 extracellprotein sequences and 306 mitonchon protein sequences:
length(extracell)
## [1] 325
length(mitonchon)
## [1] 306
To ensure that the protein sequences only have the 20 standard aminoacid types, which is usually required for the descriptor computation, weuse the protcheck()
function to do the amino acid typesanity check and remove the non-standard sequences:
extracell <- extracell[(sapply(extracell, protcheck))]mitonchon <- mitonchon[(sapply(mitonchon, protcheck))]
length(extracell)
## [1] 323
length(mitonchon)
## [1] 304
Two protein sequences were removed from each class. For the remainingsequences, we calculate the Type II PseAAC descriptor, i.e., theamphiphilic pseudo amino acid composition (APseAAC) descriptor (Chou 2005) and make class labels forclassification modeling.
# Calculate APseAAC descriptorsx1 <- t(sapply(extracell, extractAPAAC))x2 <- t(sapply(mitonchon, extractAPAAC))x <- rbind(x1, x2)# Make class labelslabels <- as.factor(c(rep(0, length(extracell)), rep(1, length(mitonchon))))
In protr, the functions of commonly used descriptors for proteinsequences and proteochemometric (PCM) modeling descriptors are namedafter extract...()
.
Next, we will split the data into a 75% training set and a 25% testset.
set.seed(1001)# Split training and test settr.idx <- c( sample(1:nrow(x1), round(nrow(x1) * 0.75)), sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75)))te.idx <- setdiff(1:nrow(x), tr.idx)x.tr <- x[tr.idx, ]x.te <- x[te.idx, ]y.tr <- labels[tr.idx]y.te <- labels[te.idx]
We will train a random forest classification model on the trainingset with 5-fold cross-validation, using the randomForest
package.
library("randomForest")rf.fit <- randomForest(x.tr, y.tr, cv.fold = 5)rf.fit
The training result is:
## Call:## randomForest(x = x.tr, y = y.tr, cv.fold = 5)## Type of random forest: classification## Number of trees: 500## No. of variables tried at each split: 8#### OOB estimate of error rate: 25.11%## Confusion matrix:## 0 1 class.error## 0 196 46 0.1900826## 1 72 156 0.3157895
With the model trained on the training set, we predict on the testset and plot the ROC curve with the pROC
package, as isshown in Figure 1.
# Predict on test setrf.pred <- predict(rf.fit, newdata = x.te, type = "prob")[, 1]# Plot ROC curvelibrary("pROC")plot.roc(y.te, rf.pred, grid = TRUE, print.auc = TRUE)
The area under the ROC curve (AUC) is:
## Call:## plot.roc.default(x = y.te, predictor = rf.pred, col = "#0080ff",## grid = TRUE, print.auc = TRUE)#### Data: rf.pred in 81 controls (y.te 0) > 76 cases (y.te 1).## Area under the curve: 0.8697
Figure 1: ROC curve for the test set of protein subcellular localizationdata.
Package overview
The protr package (Xiao et al. 2015)implemented most of the state-of-the-art protein sequence descriptorswith R. Generally, each type of the descriptor (feature) can becalculated with a function named extractX()
in the protrpackage, where X
stands for the abbreviation of thedescriptor name. The descriptors are:
- Amino acid composition
extractAAC()
- Amino acid compositionextractDC()
- Dipeptide compositionextractTC()
- Tripeptide composition
- Autocorrelation
extractMoreauBroto()
- Normalized Moreau-BrotoautocorrelationextractMoran()
- Moran autocorrelationextractGeary()
- Geary autocorrelation
- CTD descriptors
extractCTDC()
- CompositionextractCTDT()
- TransitionextractCTDD()
- Distribution
- Conjoint triad descriptors
extractCTriad()
- Conjoint triad descriptors
- Quasi-sequence-order descriptors
extractSOCN()
- Sequence-order-coupling numberextractQSO()
- Quasi-sequence-order descriptors
- Pseudo-amino acid composition
extractPAAC()
- Pseudo-amino acid composition(PseAAC)extractAPAAC()
- Amphiphilic pseudo-amino acidcomposition (APseAAC)
- Profile-based descriptors
extractPSSM()
extractPSSMAcc()
extractPSSMFeature()
The descriptors commonly used in Proteochemometric Modeling (PCM)implemented in protr include:
extractScales()
,extractScalesGap()
-Scales-based descriptors derived by Principal Components AnalysisextractProtFP()
,extractProtFPGap()
-Scales-based descriptors derived by amino acid properties from AAindex(a.k.a. Protein Fingerprint)extractDescScales()
- Scales-based descriptors derivedby 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM,VHSE, etc.)
extractFAScales()
- Scales-based descriptors derived byFactor AnalysisextractMDSScales()
- Scales-based descriptors derivedby Multidimensional ScalingextractBLOSUM()
- BLOSUM and PAM matrix-deriveddescriptors
The protr package integrates the function of parallelized similarityscore computation derived by local or global protein sequence alignmentbetween a list of protein sequences; Biostrings and pwalign provide thesequence alignment computation, and the corresponding functions listedin the protr package include:
twoSeqSim()
- Similarity calculation derived bysequence alignment between two protein sequencesparSeqSim()
- Parallelized pairwise similaritycalculation with a list of protein sequences (in-memory version)parSeqSimDisk()
- Parallelized pairwise similaritycalculation with a list of protein sequences (disk-based version)crossSetSim()
- Parallelized pairwise similaritycalculation between two sets of protein sequences (in-memoryversion)
protr also integrates the function for parallelized similarity scorecomputation derived by Gene Ontology (GO) semantic similarity measuresbetween a list of GO terms / Entrez Gene IDs, the GO similaritycomputation is provided by GOSemSim, the corresponding functions listedin the protr package include:
twoGOSim()
- Similarity calculation derived by GO-termssemantic similarity measures between two GO terms / Entrez GeneIDs;parGOSim()
- Pairwise similarity calculation with alist of GO terms / Entrez Gene IDs.
To use the parSeqSim()
function, we suggest the users toinstall the packages foreach and doParallel first in order to make theparallelized pairwise similarity computation available.
In the following sections, we will introduce the descriptors andfunction usage in this order.
Commonly used descriptors
Note: Users need to evaluate the underlying detailsof the descriptors provided intelligently, instead of using protr withtheir data blindly, especially for the descriptor types that have moreflexibility. It would be wise for the users to use some negative andpositive control comparisons where relevant to help guide theinterpretation of the results.
A protein or peptide sequence with \(N\) amino acid residues can be generallyrepresented as \(\{\,R_1, R_2, \ldots,R_n\,\}\), where \(R_i\)represents the residue at the \(i\)-thposition in the sequence. The labels \(i\) and \(j\) are used to index amino acid positionin a sequence, and \(r\), \(s\), \(t\)are used to represent the amino acid type. The computed descriptors areroughly categorized into 4 groups according to their majorapplications.
A protein sequence can be partitioned equally into segments. Thedescriptors designed for the complete sequence can often be applied toeach individual segment.
Amino acid composition descriptor
The amino acid composition describes the fraction of each amino acidtype within a protein sequence. The fractions of all 20 natural aminoacids are calculated as follows:
\[f(r) = \frac{N_r}{N} \quad r = 1, 2, \ldots, 20.\]
where \(N_r\) is the number of theamino acid type \(r\) and \(N\) is the length of the sequence.
As was described above, we can use the functionextractAAC()
to extract the descriptors (features) fromprotein sequences:
library("protr")x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr"))[[1]]extractAAC(x)#> A R N D C E Q #> 0.06405694 0.07117438 0.03914591 0.05160142 0.06761566 0.04804270 0.04804270 #> G H I L K M F #> 0.08185053 0.03024911 0.03558719 0.07651246 0.03914591 0.01245552 0.03202847 #> P S T W Y V #> 0.05338078 0.08896797 0.04448399 0.02313167 0.04270463 0.04982206
Here, using the function readFASTA()
, we loaded a singleprotein sequence (P00750, Tissue-type plasminogen activator) from aFASTA format file. Then, we extracted the AAC descriptors withextractAAC()
. The result returned is a named vector whoseelements are tagged with the name of each amino acid.
Dipeptide composition descriptor
Dipeptide composition gives a 400-dimensional descriptor, definedas:
\[f(r, s) = \frac{N_{rs}}{N - 1} \quad r, s = 1, 2, \ldots, 20.\]
where \(N_{rs}\) is the number ofdipeptides represented by amino acid type \(r\) and type \(s\). Similar to extractAAC()
,we use extractDC()
to compute the descriptors:
dc <- extractDC(x)head(dc, n = 30L)#> AA RA NA DA CA EA #> 0.003565062 0.003565062 0.000000000 0.007130125 0.003565062 0.003565062 #> QA GA HA IA LA KA #> 0.007130125 0.007130125 0.001782531 0.003565062 0.001782531 0.001782531 #> MA FA PA SA TA WA #> 0.000000000 0.005347594 0.003565062 0.007130125 0.003565062 0.000000000 #> YA VA AR RR NR DR #> 0.000000000 0.000000000 0.003565062 0.007130125 0.005347594 0.001782531 #> CR ER QR GR HR IR #> 0.005347594 0.005347594 0.000000000 0.007130125 0.001782531 0.003565062
Here, we only showed the first 30 elements of the result vector andomitted the rest of the result. The element names of the returned vectorare self-explanatory, as before.
Tripeptide composition descriptor
Tripeptide composition gives an 8000-dimensional descriptor, definedas:
\[f(r, s, t) = \frac{N_{rst}}{N - 2} \quad r, s, t = 1, 2, \ldots, 20\]
where \(N_{rst}\) is the number oftripeptides represented by amino acid type \(r\), \(s\), and \(t\). With the functionextractTC()
, we can easily obtain the length-8000descriptor, to save some space, here we also omitted the longoutputs:
tc <- extractTC(x)head(tc, n = 36L)#> AAA RAA NAA DAA CAA EAA #> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #> QAA GAA HAA IAA LAA KAA #> 0.001785714 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #> MAA FAA PAA SAA TAA WAA #> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000 #> YAA VAA ARA RRA NRA DRA #> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 #> CRA ERA QRA GRA HRA IRA #> 0.000000000 0.000000000 0.000000000 0.001785714 0.000000000 0.000000000 #> LRA KRA MRA FRA PRA SRA #> 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
Autocorrelation descriptors
Autocorrelation descriptors are defined based on the distribution ofamino acid properties along the sequence. The amino acid properties usedhere are various types of amino acids index (Retrieved from the AAindex database,see Kawashima, Ogata, and Kanehisa (1999),Kawashima and Kanehisa (2000), and Kawashima et al. (2008); see Figure 2 for anillustrated example). Three types of autocorrelation descriptors aredefined here and described below.
All the amino acid indices are centralized and standardized beforethe calculation, i.e.
\[P_r = \frac{P_r - \bar{P}}{\sigma}\]
where \(\bar{P}\) is the average ofthe property of the 20 amino acids:
\[\bar{P} = \frac{\sum_{r=1}^{20} P_r}{20} \quad \textrm{and} \quad \sigma= \sqrt{\frac{1}{2} \sum_{r=1}^{20} (P_r - \bar{P})^2}\]
Figure 2: An illustrated example in the AAIndex database.
Normalized Moreau-Broto autocorrelation descriptors
For protein sequences, the Moreau-Broto autocorrelation descriptorscan be defined as:
\[AC(d) = \sum_{i=1}^{N-d} P_i P_{i + d} \quad d = 1, 2, \ldots,\textrm{nlag}\]
where \(d\) is called the lag of theautocorrelation; \(P_i\) and \(P_{i+d}\) are the properties of the aminoacids at position \(i\) and \(i+d\); \(\textrm{nlag}\) is the maximum value of thelag.
The normalized Moreau-Broto autocorrelation descriptors are definedas:
\[ATS(d) = \frac{AC(d)}{N-d} \quad d = 1, 2, \ldots, \textrm{nlag}\]
The corresponding function for this descriptor isextractMoreauBroto()
. A typical call would be:
moreau <- extractMoreauBroto(x)head(moreau, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.081573213 -0.016064817 -0.015982990 -0.025739038 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.079058632 -0.042771564 -0.036320847 0.024087298 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> -0.005273958 0.052274763 0.082170073 0.005419919 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.083292042 0.004810584 0.001872446 -0.001531495 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> -0.011917230 0.071161551 0.033473197 0.026882737 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.073075402 0.115272790 0.041517897 -0.027025993 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.033477388 -0.003245255 0.078117010 -0.028177304 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.046695832 0.020584423 0.052740185 0.030804784 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> 0.037170476 -0.058993771 0.070641780 -0.089192490
The eight default properties used here are:
- AccNo. CIDH920105 - Normalized Average Hydrophobicity Scales
- AccNo. BHAR880101 - Average Flexibility Indices
- AccNo. CHAM820101 - Polarizability Parameter
- AccNo. CHAM820102 - Free Energy of Solution in Water, kcal/mole
- AccNo. CHOC760101 - Residue Accessible Surface Area inTripeptide
- AccNo. BIGC670101 - Residue Volume
- AccNo. CHAM810101 - Steric Parameter
- AccNo. DAYM780201 - Relative Mutability
Users can change the property names of the AAindex database with theargument props
. The AAindex data shipped with protr can beloaded by data(AAindex)
which has detailed information oneach property. With the arguments customprops
andnlag
, users can specify their own properties and lag valueto calculate with. For example:
# Define 3 custom propertiesmyprops <- data.frame( AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43))# Use 4 properties in the AAindex database and 3 customized propertiesmoreau2 <- extractMoreauBroto( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ))head(moreau2, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.081573213 -0.016064817 -0.015982990 -0.025739038 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.079058632 -0.042771564 -0.036320847 0.024087298 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> -0.005273958 0.052274763 0.082170073 0.005419919 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.083292042 0.004810584 0.001872446 -0.001531495 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> -0.011917230 0.071161551 0.033473197 0.026882737 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.073075402 0.115272790 0.041517897 -0.027025993 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.033477388 -0.003245255 0.078117010 -0.028177304 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.046695832 0.020584423 0.052740185 0.030804784 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> 0.037170476 -0.058993771 0.070641780 -0.089192490
About the standard input format of props
andcustomprops
, see ?extractMoreauBroto
fordetails.
Moran autocorrelation descriptors
For protein sequences, the Moran autocorrelation descriptors can bedefined as:
\[I(d) = \frac{\frac{1}{N-d} \sum_{i=1}^{N-d} (P_i - \bar{P}')(P_{i+d} - \bar{P}')}{\frac{1}{N} \sum_{i=1}^{N} (P_i -\bar{P}')^2} \quad d = 1, 2, \ldots, 30\]
where \(d\), \(P_i\), and \(P_{i+d}\) are defined in the same way as inthe first place; \(\bar{P}'\) isthe considered property \(P\) along thesequence, i.e.,
\[\bar{P}' = \frac{\sum_{i=1}^N P_i}{N}\]
\(d\), \(P\), \(P_i\) and \(P_{i+d}\), \(\textrm{nlag}\) have the same meaning asabove.
With extractMoran()
(which has the identical parametersas extractMoreauBroto()
), we can compute the Moranautocorrelation descriptors (only print out the first 36 elements):
# Use the 3 custom properties defined before# and 4 properties in the AAindex databasemoran <- extractMoran( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ))head(moran, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.062895724 -0.044827681 -0.045065117 -0.055955678 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.060586377 -0.074128412 -0.067308852 -0.001293384 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> -0.033747588 0.029392193 0.061789800 -0.023368437 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.062769417 -0.024912264 -0.028298043 -0.031584063 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> -0.043466730 0.047830694 0.005883901 -0.001769769 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.049334048 0.096427969 0.015147594 -0.060092509 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.007549152 -0.033987885 0.056307675 -0.061844453 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.021484780 -0.008461776 0.014229951 -0.009142419 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> -0.003272262 -0.109613332 0.033346233 -0.141538598
Geary autocorrelation descriptors
Geary autocorrelation descriptors for protein sequences can bedefined as:
\[C(d) = \frac{\frac{1}{2(N-d)} \sum_{i=1}^{N-d} (P_i -P_{i+d})^2}{\frac{1}{N-1} \sum_{i=1}^{N} (P_i - \bar{P}')^2} \quad d= 1, 2, \ldots, 30\]
where \(d\), \(P\), \(P_i\), \(P_{i+d}\), and \(\textrm{nlag}\) have the same meaning asabove.
For each amino acid index, there will be \(3 \times \textrm{nlag}\) autocorrelationdescriptors. The usage of extractGeary()
is exactly thesame as extractMoreauBroto()
andextractMoran()
:
# Use the 3 custom properties defined before# and 4 properties in the AAindex databasegeary <- extractGeary( x, customprops = myprops, props = c( "CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "MyProp1", "MyProp2", "MyProp3" ))head(geary, n = 36L)#> CIDH920105.lag1 CIDH920105.lag2 CIDH920105.lag3 CIDH920105.lag4 #> 0.9361830 1.0442920 1.0452843 1.0563467 #> CIDH920105.lag5 CIDH920105.lag6 CIDH920105.lag7 CIDH920105.lag8 #> 0.9406031 1.0765517 1.0675786 0.9991363 #> CIDH920105.lag9 CIDH920105.lag10 CIDH920105.lag11 CIDH920105.lag12 #> 1.0316555 0.9684585 0.9353130 1.0201990 #> CIDH920105.lag13 CIDH920105.lag14 CIDH920105.lag15 CIDH920105.lag16 #> 0.9340933 1.0207373 1.0251486 1.0290464 #> CIDH920105.lag17 CIDH920105.lag18 CIDH920105.lag19 CIDH920105.lag20 #> 1.0414375 0.9494403 0.9905987 0.9987183 #> CIDH920105.lag21 CIDH920105.lag22 CIDH920105.lag23 CIDH920105.lag24 #> 0.9472542 0.9010009 0.9828848 1.0574098 #> CIDH920105.lag25 CIDH920105.lag26 CIDH920105.lag27 CIDH920105.lag28 #> 0.9897955 1.0290018 0.9400066 1.0584150 #> CIDH920105.lag29 CIDH920105.lag30 BHAR880101.lag1 BHAR880101.lag2 #> 0.9762904 1.0029734 0.9818711 1.0051730 #> BHAR880101.lag3 BHAR880101.lag4 BHAR880101.lag5 BHAR880101.lag6 #> 0.9967069 1.1012905 0.9595859 1.1337056
Composition/Transition/Distribution
The CTD descriptors are developed by Dubchaket al. (1995) and Dubchak et al.(1999).
Figure 3: The sequence of a hypothetic protein indicating theconstruction of composition, transition, and distribution descriptors ofa protein. The sequence index indicates the position of an amino acid inthe sequence. The index for each type of amino acid in the sequence(1
, 2
or 3
) indicates theposition of the first, second, third, … of that type of amino acid. 1/2transition indicates the position of 12
or 21
pairs in the sequence (1/3 and 2/3 are defined similarly).
Step 1: Sequence Encoding
The amino acids are categorized into three classes according to theirattributes, and each amino acid is encoded by one of the indices 1, 2, 3according to which class it belongs. The attributes used here includehydrophobicity, normalized van der Waals volume, polarity, andpolarizability. The corresponding classification for each attribute islisted in Table 1.
Group 1 | Group 2 | Group 3 | |
---|---|---|---|
Hydrophobicity | Polar | Neutral | Hydrophobicity |
R, K, E, D, Q, N | G, A, S, T, P, H, Y | C, L, V, I, M, F, W | |
Normalized van der Waals Volume | 0-2.78 | 2.95-4.0 | 4.03-8.08 |
G, A, S, T, P, D, C | N, V, E, Q, I, L | M, H, K, F, R, Y, W | |
Polarity | 4.9-6.2 | 8.0-9.2 | 10.4-13.0 |
L, I, F, W, C, M, V, Y | P, A, T, G, S | H, Q, R, K, N, E, D | |
Polarizability | 0-1.08 | 0.128-0.186 | 0.219-0.409 |
G, A, S, D, T | C, P, N, V, E, Q, I, L | K, M, H, F, R, Y, W | |
Charge | Positive | Neutral | Negative |
K, R | A, N, C, Q, G, H, I, L, M, F, P, S, T, W, Y, V | D, E | |
Secondary Structure | Helix | Strand | Coil |
E, A, L, M, Q, K, R, H | V, I, Y, C, W, F, T | G, N, P, S, D | |
Solvent Accessibility | Buried | Exposed | Intermediate |
A, L, F, C, G, I, V, W | R, K, Q, E, N, D | M, S, P, T, H, Y |
For example, for a given sequence MTEITAAMVKELRESTGAGA
,it will be encoded as 32132223311311222222
according to itshydrophobicity.
Step 2: Compute Composition, Transition, and DistributionDescriptors
Three types of descriptors, Composition (C),Transition (T), and Distribution (D)can be calculated for a given attribute as follows.
Composition
Composition is defined as the global percentage for each encodedclass in the protein sequence. In the above example, using thehydrophobicity classification, the numbers for encoded classes1
, 2
, 3
are 5, 10, and 5, so thatthe compositions for them will be \(5/20=25\%\), \(10/20=10\%\), and \(5/20=25\%\), where 20 is the length of theprotein sequence. The composition descriptor can be expressed as
\[C_r = \frac{n_r}{n} \quad r = 1, 2, 3\]
where \(n_r\) is the number of aminoacid type \(r\) in the encodedsequence; \(N\) is the length of thesequence. An example for extractCTDC()
:
extractCTDC(x)#> hydrophobicity.Group1 hydrophobicity.Group2 hydrophobicity.Group3 #> 0.29715302 0.40569395 0.29715302 #> normwaalsvolume.Group1 normwaalsvolume.Group2 normwaalsvolume.Group3 #> 0.45195730 0.29715302 0.25088968 #> polarity.Group1 polarity.Group2 polarity.Group3 #> 0.33985765 0.33274021 0.32740214 #> polarizability.Group1 polarizability.Group2 polarizability.Group3 #> 0.33096085 0.41814947 0.25088968 #> charge.Group1 charge.Group2 charge.Group3 #> 0.11032028 0.79003559 0.09964413 #> secondarystruct.Group1 secondarystruct.Group2 secondarystruct.Group3 #> 0.38967972 0.29537367 0.31494662 #> solventaccess.Group1 solventaccess.Group2 solventaccess.Group3 #> 0.43060498 0.29715302 0.27224199
The result shows the elements that are named asPropertyNumber.GroupNumber
in the returned vector.
Transition
A transition from class 1 to 2 is the percent frequency with which 1is followed by 2 or 2 is followed by 1 in the encoded sequences. Thetransition descriptor can be computed as
\[T_{rs} = \frac{n_{rs} + n_{sr}}{N - 1} \quad rs = \text{12}, \text{13},\text{23}\]
where \(n_{rs}\), \(n_{sr}\) are the numbers of dipeptideencoded as rs
and sr
in the sequence; \(N\) is the length of the sequence. Anexample for extractCTDT()
:
extractCTDT(x)#> prop1.Tr1221 prop1.Tr1331 prop1.Tr2332 prop2.Tr1221 prop2.Tr1331 prop2.Tr2332 #> 0.27094474 0.16042781 0.23351159 0.26737968 0.22638146 0.17112299 #> prop3.Tr1221 prop3.Tr1331 prop3.Tr2332 prop4.Tr1221 prop4.Tr1331 prop4.Tr2332 #> 0.21033868 0.20499109 0.23707665 0.27272727 0.15151515 0.24598930 #> prop5.Tr1221 prop5.Tr1331 prop5.Tr2332 prop6.Tr1221 prop6.Tr1331 prop6.Tr2332 #> 0.18181818 0.02139037 0.15686275 0.21925134 0.22816399 0.15864528 #> prop7.Tr1221 prop7.Tr1331 prop7.Tr2332 #> 0.25133690 0.21568627 0.18003565
Distribution
The distribution descriptor describes the distribution ofeach attribute in the sequence.
There are five “distribution” descriptors for each attribute, andthey are the position percent in the whole sequence for the firstresidue, 25% residues, 50% residues, 75% residues, and 100% residues fora certain encoded class. For example, there are 10 residues encoded as2
in the above example, the positions for the first residue2
, the 2nd residue 2
(25% * 10 = 2), the 5th2
residue (50% * 10 = 5), the 7th 2
(75% * 10= 7) and the 10th residue 2
(100% * 10) in the encodedsequence are 2, 5, 15, 17, 20, so that the distribution descriptors for2
are: 10.0 (2/20 * 100), 25.0 (5/20 * 100), 75.0 (15/20 *100), 85.0 (17/20 * 100), 100.0 (20/20 * 100).
To compute the distribution descriptor, useextractCTDD()
:
extractCTDD(x)#> prop1.G1.residue0 prop1.G1.residue25 prop1.G1.residue50 prop1.G1.residue75 #> 0.3558719 23.1316726 50.1779359 73.8434164 #> prop1.G1.residue100 prop1.G2.residue0 prop1.G2.residue25 prop1.G2.residue50 #> 99.8220641 0.5338078 27.4021352 47.3309609 #> prop1.G2.residue75 prop1.G2.residue100 prop1.G3.residue0 prop1.G3.residue25 #> 75.2669039 100.0000000 0.1779359 19.5729537 #> prop1.G3.residue50 prop1.G3.residue75 prop1.G3.residue100 prop2.G1.residue0 #> 51.7793594 75.6227758 99.6441281 0.3558719 #> prop2.G1.residue25 prop2.G1.residue50 prop2.G1.residue75 prop2.G1.residue100 #> 25.6227758 48.0427046 75.4448399 100.0000000 #> prop2.G2.residue0 prop2.G2.residue25 prop2.G2.residue50 prop2.G2.residue75 #> 1.4234875 23.3096085 54.4483986 76.3345196 #> prop2.G2.residue100 prop2.G3.residue0 prop2.G3.residue25 prop2.G3.residue50 #> 99.4661922 0.1779359 22.7758007 48.9323843 #> prop2.G3.residue75 prop2.G3.residue100 prop3.G1.residue0 prop3.G1.residue25 #> 69.5729537 99.8220641 0.1779359 20.9964413 #> prop3.G1.residue50 prop3.G1.residue75 prop3.G1.residue100 prop3.G2.residue0 #> 50.8896797 74.5551601 99.6441281 0.5338078 #> prop3.G2.residue25 prop3.G2.residue50 prop3.G2.residue75 prop3.G2.residue100 #> 26.5124555 46.2633452 75.4448399 100.0000000 #> prop3.G3.residue0 prop3.G3.residue25 prop3.G3.residue50 prop3.G3.residue75 #> 0.3558719 24.1992883 50.5338078 73.8434164 #> prop3.G3.residue100 prop4.G1.residue0 prop4.G1.residue25 prop4.G1.residue50 #> 99.8220641 0.3558719 26.5124555 48.3985765 #> prop4.G1.residue75 prop4.G1.residue100 prop4.G2.residue0 prop4.G2.residue25 #> 76.1565836 99.2882562 1.4234875 21.5302491 #> prop4.G2.residue50 prop4.G2.residue75 prop4.G2.residue100 prop4.G3.residue0 #> 51.4234875 75.8007117 100.0000000 0.1779359 #> prop4.G3.residue25 prop4.G3.residue50 prop4.G3.residue75 prop4.G3.residue100 #> 22.7758007 48.9323843 69.5729537 99.8220641 #> prop5.G1.residue0 prop5.G1.residue25 prop5.G1.residue50 prop5.G1.residue75 #> 0.8896797 20.8185053 48.9323843 69.5729537 #> prop5.G1.residue100 prop5.G2.residue0 prop5.G2.residue25 prop5.G2.residue50 #> 99.8220641 0.1779359 24.9110320 49.1103203 #> prop5.G2.residue75 prop5.G2.residue100 prop5.G3.residue0 prop5.G3.residue25 #> 75.2669039 100.0000000 0.3558719 26.1565836 #> prop5.G3.residue50 prop5.G3.residue75 prop5.G3.residue100 prop6.G1.residue0 #> 64.2348754 77.4021352 99.2882562 0.1779359 #> prop6.G1.residue25 prop6.G1.residue50 prop6.G1.residue75 prop6.G1.residue100 #> 22.9537367 50.8896797 74.3772242 99.8220641 #> prop6.G2.residue0 prop6.G2.residue25 prop6.G2.residue50 prop6.G2.residue75 #> 1.6014235 21.5302491 49.2882562 70.8185053 #> prop6.G2.residue100 prop6.G3.residue0 prop6.G3.residue25 prop6.G3.residue50 #> 98.9323843 0.3558719 29.0035587 48.2206406 #> prop6.G3.residue75 prop6.G3.residue100 prop7.G1.residue0 prop7.G1.residue25 #> 77.4021352 100.0000000 0.5338078 23.4875445 #> prop7.G1.residue50 prop7.G1.residue75 prop7.G1.residue100 prop7.G2.residue0 #> 50.0000000 74.5551601 98.9323843 0.3558719 #> prop7.G2.residue25 prop7.G2.residue50 prop7.G2.residue75 prop7.G2.residue100 #> 23.1316726 50.1779359 73.8434164 99.8220641 #> prop7.G3.residue0 prop7.G3.residue25 prop7.G3.residue50 prop7.G3.residue75 #> 0.1779359 27.2241993 48.0427046 75.4448399 #> prop7.G3.residue100 #> 100.0000000
Conjoint triad descriptors
Conjoint triad descriptors are proposed by Shen et al. (2007). The conjoint triaddescriptors were used to model protein-protein interactions based on theclassification of amino acids. In this approach, each protein sequenceis represented by a vector space consisting of descriptors of aminoacids. To reduce the dimensions of vector space, the 20 amino acids wereclustered into several classes according to their dipoles and volumes ofthe side chains. The conjoint triad descriptors are calculated asfollows:
Step 1: Classification of Amino Acids
Electrostatic and hydrophobic interactions dominate protein-proteininteractions. These two kinds of interactions may be reflected by thedipoles and volumes of the side chains of amino acids, respectively.Accordingly, these two parameters were calculated using thedensity-functional theory method B3LYP/6-31G and the molecular modelingapproach. Based on the dipoles and volumes of the side chains, the 20amino acids can be clustered into seven classes (See Table 2). Aminoacids within the same class likely involve synonymous mutations becauseof their similar characteristics.
No. | Dipole Scale\(^1\) | Volume Scale\(^2\) | Class |
---|---|---|---|
1 | \(-\) | \(-\) | Ala, Gly, Val |
2 | \(-\) | \(+\) | Ile, Leu, Phe, Pro |
3 | \(+\) | \(+\) | Tyr, Met, Thr, Ser |
4 | \(++\) | \(+\) | His, Asn, Gln, Tpr |
5 | \(+++\) | \(+\) | Arg, Lys |
6 | \(+'+'+'\) | \(+\) | Asp, Glu |
7 | \(+^3\) | \(+\) | Cys |
1 Dipole Scale (Debye): \(-\), Dipole < 1.0; \(+\), 1.0 < Dipole < 2.0; \(++\), 2.0 < Dipole < 3.0; \(+++\), Dipole > 3.0; \(+'+'+'\), Dipole > 3.0 withopposite orientation.
2 Volume Scale (\(\overset{\lower.5em\circ}{\mathrm{A}}\lower.01em^3\)):\(-\), Volume < 50; \(+\), Volume > 50.
3 Cys is separated from class 3 because of its ability toform disulfide bonds.
Step 2: Conjoint Triad Calculation
The conjoint triad descriptors considered the properties of one aminoacid and its vicinal amino acids and regarded any three continuous aminoacids as a unit. Thus, the triads can be differentiated according to theclasses of amino acids, i.e., triads composed of three amino acidsbelonging to the same classes, such as ART and VKS, can be treatedidentically. To conveniently represent a protein, we first use a binaryspace \((\mathbf{V}, \mathbf{F})\) torepresent a protein sequence. Here, \(\mathbf{V}\) is the vector space of thesequence features, and each feature \(v_i\) represents a sort of triad type;\(\mathbf{F}\) is the frequency vectorcorresponding to \(\mathbf{V}\), andthe value of the \(i\)-th dimension of\(\mathbf{F} (f_i)\) is the frequencyof type \(v_i\) appearing in theprotein sequence. For the amino acids that have been categorized intoseven classes, the size of \(\mathbf{V}\) should be \(7 \times 7 \times 7\); thus \(i = 1, 2, \ldots, 343\). The detaileddescription for (\(\mathbf{V}\), \(\mathbf{F}\)) is illustrated in Figure4.
Figure 4: Schematic diagram for constructing the vector space (\(\mathbf{V}\), \(\mathbf{F}\)) of protein sequences. \(\mathbf{V}\) is the vector space of thesequence features; each feature (\(v_i\)) represents a triad composed of threeconsecutive amino acids; \(\mathbf{F}\)is the frequency vector corresponding to \(\mathbf{V}\), and the value of the \(i\)-th dimension of \(\mathbf{F} (f_i)\) is the frequency that\(v_i\) triad appeared in the proteinsequence.
Clearly, each protein correlates to the length (number of aminoacids) of protein. In general, a long protein would have a large valueof \(f_i\), which complicates thecomparison between two heterogeneous proteins. Thus, we defined a newparameter, \(d_i\), by normalizing\(f_i\) with the followingequation:
\[d_i = \frac{f_i - \min\{\,f_1, f_2 , \ldots, f_{343}\,\}}{\max\{\,f_1,f_2, \ldots, f_{343}\,\}}\]
The numerical value of \(d_i\) ofeach protein ranges from 0 to 1, enabling the comparison betweenproteins. Accordingly, we obtain another vector space (designated \(\mathbf{D}\)) consisting of \(d_i\) to represent protein.
To compute conjoint triads of protein sequences, we can use:
ctriad <- extractCTriad(x)head(ctriad, n = 65L)#> VS111 VS211 VS311 VS411 VS511 VS611 VS711 VS121 VS221 VS321 VS421 VS521 VS621 #> 0.1 0.3 0.6 0.2 0.4 0.0 0.3 1.0 0.6 0.5 0.0 0.2 0.3 #> VS721 VS131 VS231 VS331 VS431 VS531 VS631 VS731 VS141 VS241 VS341 VS441 VS541 #> 0.0 0.2 0.4 0.5 0.2 0.3 0.3 0.1 0.3 0.3 0.2 0.2 0.0 #> VS641 VS741 VS151 VS251 VS351 VS451 VS551 VS651 VS751 VS161 VS261 VS361 VS461 #> 0.1 0.2 0.2 0.2 0.5 0.1 0.2 0.0 0.0 0.1 0.4 0.2 0.3 #> VS561 VS661 VS761 VS171 VS271 VS371 VS471 VS571 VS671 VS771 VS112 VS212 VS312 #> 0.2 0.0 0.1 0.1 0.3 0.1 0.0 0.1 0.0 0.1 0.8 0.4 0.4 #> VS412 VS512 VS612 VS712 VS122 VS222 VS322 VS422 VS522 VS622 VS722 VS132 VS232 #> 0.6 0.1 0.5 0.2 0.8 0.5 0.2 0.3 0.2 0.0 0.2 0.1 0.3
by which we only outputted the first 65 of 343 dimensions to savespace.
Quasi-sequence-order descriptors
The quasi-sequence-order descriptors are proposed by Chou (2000). They are derived from the distancematrix between the 20 amino acids.
Sequence-order-coupling number
The \(d\)-th ranksequence-order-coupling number is defined as:
\[\tau_d = \sum_{i=1}^{N-d} (d_{i, i+d})^2 \quad d = 1, 2, \ldots,\textrm{maxlag}\]
where \(d_{i, i+d}\) is the distancebetween the two amino acids at position \(i\) and \(i+d\).
Note: maxlag is the maximum lag and the length ofthe protein must be not less than \(\textrm{maxlag}\).
The function extractSOCN()
is used for computing thesequence-order-coupling numbers:
extractSOCN(x)#> Schneider.lag1 Schneider.lag2 Schneider.lag3 Schneider.lag4 Schneider.lag5 #> 204.2036 199.8708 206.8102 197.4828 193.3366 #> Schneider.lag6 Schneider.lag7 Schneider.lag8 Schneider.lag9 Schneider.lag10 #> 208.1936 195.5476 200.9789 196.7110 193.9931 #> Schneider.lag11 Schneider.lag12 Schneider.lag13 Schneider.lag14 Schneider.lag15 #> 199.7031 204.9389 187.0140 198.4702 205.4526 #> Schneider.lag16 Schneider.lag17 Schneider.lag18 Schneider.lag19 Schneider.lag20 #> 193.1274 187.3529 190.4949 202.8853 198.5299 #> Schneider.lag21 Schneider.lag22 Schneider.lag23 Schneider.lag24 Schneider.lag25 #> 191.1013 185.0074 189.9857 202.7113 201.6267 #> Schneider.lag26 Schneider.lag27 Schneider.lag28 Schneider.lag29 Schneider.lag30 #> 194.5770 185.9939 204.1297 191.1629 183.9073 #> Grantham.lag1 Grantham.lag2 Grantham.lag3 Grantham.lag4 Grantham.lag5 #> 6674686.0000 6761609.0000 7138892.0000 6748261.0000 6291229.0000 #> Grantham.lag6 Grantham.lag7 Grantham.lag8 Grantham.lag9 Grantham.lag10 #> 6839853.0000 6594164.0000 6556148.0000 6620183.0000 6770614.0000 #> Grantham.lag11 Grantham.lag12 Grantham.lag13 Grantham.lag14 Grantham.lag15 #> 6495689.0000 6865537.0000 6297267.0000 6498247.0000 6615566.0000 #> Grantham.lag16 Grantham.lag17 Grantham.lag18 Grantham.lag19 Grantham.lag20 #> 6572680.0000 6569081.0000 6173947.0000 6570829.0000 6471308.0000 #> Grantham.lag21 Grantham.lag22 Grantham.lag23 Grantham.lag24 Grantham.lag25 #> 6461649.0000 5939432.0000 6532121.0000 6652472.0000 6480660.0000 #> Grantham.lag26 Grantham.lag27 Grantham.lag28 Grantham.lag29 Grantham.lag30 #> 6382281.0000 6276521.0000 6537634.0000 6442991.0000 6350157.0000
Users can also specify the maximum lag value with thenlag
argument.
Note: Besides the Schneider-Wrede physicochemicaldistance matrix (Schneider and Wrede 1994)used by Kuo-Chen Chou, another chemical distance matrix by Grantham (1974) is also used here. So thedescriptors dimension will be nlag * 2
. Thequasi-sequence-order descriptors described next also utilized the twomatrices.
Quasi-sequence-order descriptors
For each amino acid type, a quasi-sequence-order descriptor can bedefined as:
\[X_r = \frac{f_r}{\sum_{r=1}^{20} f_r + w \sum_{d=1}^{\textrm{maxlag}}\tau_d} \quad r = 1, 2, \ldots, 20\]
where \(f_r\) is the normalizedoccurrence for amino acid type \(i\)and \(w\) is a weighting factor (\(w=0.1\)). These are the first 20quasi-sequence-order descriptors. The other 30 quasi-sequence-order aredefined as:
\[X_d = \frac{w \tau_{d-20}}{\sum_{r=1}^{20} f_r + w\sum_{d=1}^{\textrm{maxlag}} \tau_d} \quad d = 21, 22, \ldots, 20 +\textrm{maxlag}\]
Figure 5: A schematic drawing to show (a) the 1st-rank, (b) the2nd-rank, and (3) the 3rd-rank sequence-order-coupling mode along aprotein sequence. (a) Reflects the coupling mode between all the mostcontiguous residues, (b) that between all the 2nd most contiguousresidues, and (c) that between all the 3rd most contiguous residues.This figure is from Chou (2000).
A minimal example for extractQSO()
:
extractQSO(x)#> Schneider.Xr.A Schneider.Xr.R Schneider.Xr.N Schneider.Xr.D Schneider.Xr.C #> 6.096218e-02 6.773576e-02 3.725467e-02 4.910842e-02 6.434897e-02 #> Schneider.Xr.E Schneider.Xr.Q Schneider.Xr.G Schneider.Xr.H Schneider.Xr.I #> 4.572164e-02 4.572164e-02 7.789612e-02 2.878770e-02 3.386788e-02 #> Schneider.Xr.L Schneider.Xr.K Schneider.Xr.M Schneider.Xr.F Schneider.Xr.P #> 7.281594e-02 3.725467e-02 1.185376e-02 3.048109e-02 5.080182e-02 #> Schneider.Xr.S Schneider.Xr.T Schneider.Xr.W Schneider.Xr.Y Schneider.Xr.V #> 8.466970e-02 4.233485e-02 2.201412e-02 4.064145e-02 4.741503e-02 #> Grantham.Xr.A Grantham.Xr.R Grantham.Xr.N Grantham.Xr.D Grantham.Xr.C #> 1.835033e-06 2.038926e-06 1.121409e-06 1.478221e-06 1.936980e-06 #> Grantham.Xr.E Grantham.Xr.Q Grantham.Xr.G Grantham.Xr.H Grantham.Xr.I #> 1.376275e-06 1.376275e-06 2.344765e-06 8.665435e-07 1.019463e-06 #> Grantham.Xr.L Grantham.Xr.K Grantham.Xr.M Grantham.Xr.F Grantham.Xr.P #> 2.191845e-06 1.121409e-06 3.568120e-07 9.175167e-07 1.529194e-06 #> Grantham.Xr.S Grantham.Xr.T Grantham.Xr.W Grantham.Xr.Y Grantham.Xr.V #> 2.548657e-06 1.274329e-06 6.626509e-07 1.223356e-06 1.427248e-06 #> Schneider.Xd.1 Schneider.Xd.2 Schneider.Xd.3 Schneider.Xd.4 Schneider.Xd.5 #> 3.457972e-02 3.384600e-02 3.502111e-02 3.344162e-02 3.273951e-02 #> Schneider.Xd.6 Schneider.Xd.7 Schneider.Xd.8 Schneider.Xd.9 Schneider.Xd.10 #> 3.525537e-02 3.311390e-02 3.403364e-02 3.331093e-02 3.285068e-02 #> Schneider.Xd.11 Schneider.Xd.12 Schneider.Xd.13 Schneider.Xd.14 Schneider.Xd.15 #> 3.381760e-02 3.470422e-02 3.166883e-02 3.360882e-02 3.479121e-02 #> Schneider.Xd.16 Schneider.Xd.17 Schneider.Xd.18 Schneider.Xd.19 Schneider.Xd.20 #> 3.270408e-02 3.172623e-02 3.225829e-02 3.435647e-02 3.361893e-02 #> Schneider.Xd.21 Schneider.Xd.22 Schneider.Xd.23 Schneider.Xd.24 Schneider.Xd.25 #> 3.236099e-02 3.132904e-02 3.217206e-02 3.432701e-02 3.414334e-02 #> Schneider.Xd.26 Schneider.Xd.27 Schneider.Xd.28 Schneider.Xd.29 Schneider.Xd.30 #> 3.294954e-02 3.149609e-02 3.456720e-02 3.237140e-02 3.114275e-02 #> Grantham.Xd.1 Grantham.Xd.2 Grantham.Xd.3 Grantham.Xd.4 Grantham.Xd.5 #> 3.402298e-02 3.446605e-02 3.638918e-02 3.439801e-02 3.206838e-02 #> Grantham.Xd.6 Grantham.Xd.7 Grantham.Xd.8 Grantham.Xd.9 Grantham.Xd.10 #> 3.486488e-02 3.361253e-02 3.341875e-02 3.374516e-02 3.451195e-02 #> Grantham.Xd.11 Grantham.Xd.12 Grantham.Xd.13 Grantham.Xd.14 Grantham.Xd.15 #> 3.311057e-02 3.499580e-02 3.209915e-02 3.312361e-02 3.372162e-02 #> Grantham.Xd.16 Grantham.Xd.17 Grantham.Xd.18 Grantham.Xd.19 Grantham.Xd.20 #> 3.350302e-02 3.348467e-02 3.147055e-02 3.349358e-02 3.298629e-02 #> Grantham.Xd.21 Grantham.Xd.22 Grantham.Xd.23 Grantham.Xd.24 Grantham.Xd.25 #> 3.293706e-02 3.027516e-02 3.329628e-02 3.390974e-02 3.303396e-02 #> Grantham.Xd.26 Grantham.Xd.27 Grantham.Xd.28 Grantham.Xd.29 Grantham.Xd.30 #> 3.253250e-02 3.199340e-02 3.332438e-02 3.284195e-02 3.236875e-02
where users can also specify the maximum lag with the argumentnlag
and the weighting factor with the argumentw
.
Pseudo-amino acid composition (PseAAC)
This group of descriptors is proposed by Chou(2001). PseAAC descriptors are also named as the type 1pseudo-amino acid composition. Let \(H_1^o (i)\), \(H_2^o (i)\), \(M^o (i)\) (\(i=1,2, 3, \ldots, 20\)) be the original hydrophobicity values, theoriginal hydrophilicity values and the original side chain masses of the20 natural amino acids, respectively. They are converted to thefollowing qualities by a standard conversion:
\[H_1 (i) = \frac{H_1^o (i) - \frac{1}{20} \sum_{i=1}^{20} H_1^o(i)}{\sqrt{\frac{\sum_{i=1}^{20} [H_1^o (i) - \frac{1}{20}\sum_{i=1}^{20} H_1^o (i) ]^2}{20}}}\]
\(H_2^o (i)\) and \(M^o (i)\) are normalized as \(H_2 (i)\) and \(M(i)\) in the same way.
Figure 6: A schematic drawing to show (a) the first-tier, (b) thesecond-tier, and (3) the third-tier sequence order correlation modealong a protein sequence. Panel (a) reflects the correlation modebetween all the most contiguous residues, panel (b) between all thesecond-most contiguous residues, and panel (c) between all thethird-most contiguous residues. This figure is from Chou (2001).
Then, a correlation function can be defined as
\[\Theta (R_i, R_j) = \frac{1}{3} \bigg\{ [ H_1 (R_i) - H_1 (R_j) ]^2 + [H_2 (R_i) - H_2 (R_j) ]^2 + [ M (R_i) - M (R_j) ]^2 \bigg\}\]
This correlation function is an average value for the three aminoacid properties: hydrophobicity, hydrophilicity, and side chain mass.Therefore, we can extend this definition of correlation functions forone amino acid property or for a set of \(n\) amino acid properties.
For one amino acid property, the correlation can be defined as:
\[\Theta (R_i, R_j) = [H_1 (R_i) - H_1(R_j)]^2\]
where \(H (R_i)\) is the amino acidproperty of amino acid \(R_i\) afterstandardization.
For a set of n amino acid properties, it can be defined as: where\(H_k (R_i)\) is the \(k\)-th property in the amino acid propertyset for amino acid \(R_i\).
\[\Theta (R_i, R_j) = \frac{1}{n} \sum_{k=1}^{n} [H_k (R_i) - H_k (R_j)]^2\]
where \(H_k(R_i)\) is the \(k\)-th property in the amino acid propertyset for amino acid \(R_i\).
A set of descriptors named sequence order-correlated factors aredefined as:
\[\begin{align*}\theta_1 & = \frac{1}{N-1} \sum_{i=1}^{N-1} \Theta (R_i, R_{i+1})\\\theta_2 & = \frac{1}{N-2} \sum_{i=1}^{N-2} \Theta (R_i, R_{i+2})\\\theta_3 & = \frac{1}{N-3} \sum_{i=1}^{N-3} \Theta (R_i, R_{i+3})\\ & \ldots \\\theta_\lambda & = \frac{1}{N-\lambda} \sum_{i=1}^{N-\lambda}\Theta (R_i, R_{i+\lambda})\end{align*}\]
\(\lambda\) (\(\lambda < L\)) is a parameter to bespecified. Let \(f_i\) be thenormalized occurrence frequency of the 20 amino acids in the proteinsequence, a set of \(20 + \lambda\)descriptors called the pseudo-amino acid composition for a proteinsequence can be defined as:
\[X_c = \frac{f_c}{\sum_{r=1}^{20} f_r + w \sum_{j=1}^{\lambda} \theta_j}\quad (1 < c < 20)\]
\[X_c = \frac{w \theta_{c-20}}{\sum_{r=1}^{20} f_r + w\sum_{j=1}^{\lambda} \theta_j} \quad (21 < c < 20 + \lambda)\]
where \(w\) is the weighting factorfor the sequence-order effect and is set to \(w = 0.05\) in protr as suggested byKuo-Chen Chou.
With extractPAAC()
, we can compute the PseAACdescriptors directly:
extractPAAC(x)#> Xc1.A Xc1.R Xc1.N Xc1.D Xc1.C #> 9.07025432 10.07806035 5.54293319 7.30659376 9.57415734 #> Xc1.E Xc1.Q Xc1.G Xc1.H Xc1.I #> 6.80269074 6.80269074 11.58976941 4.28317565 5.03903018 #> Xc1.L Xc1.K Xc1.M Xc1.F Xc1.P #> 10.83391488 5.54293319 1.76366056 4.53512716 7.55854527 #> Xc1.S Xc1.T Xc1.W Xc1.Y Xc1.V #> 12.59757544 6.29878772 3.27536961 6.04683621 7.05464225 #> Xc2.lambda.1 Xc2.lambda.2 Xc2.lambda.3 Xc2.lambda.4 Xc2.lambda.5 #> 0.02514092 0.02500357 0.02527773 0.02553159 0.02445265 #> Xc2.lambda.6 Xc2.lambda.7 Xc2.lambda.8 Xc2.lambda.9 Xc2.lambda.10 #> 0.02561910 0.02486131 0.02506656 0.02553952 0.02437663 #> Xc2.lambda.11 Xc2.lambda.12 Xc2.lambda.13 Xc2.lambda.14 Xc2.lambda.15 #> 0.02491262 0.02533803 0.02351915 0.02479912 0.02548431 #> Xc2.lambda.16 Xc2.lambda.17 Xc2.lambda.18 Xc2.lambda.19 Xc2.lambda.20 #> 0.02478210 0.02513770 0.02457224 0.02543046 0.02500889 #> Xc2.lambda.21 Xc2.lambda.22 Xc2.lambda.23 Xc2.lambda.24 Xc2.lambda.25 #> 0.02476967 0.02342389 0.02431684 0.02610300 0.02626722 #> Xc2.lambda.26 Xc2.lambda.27 Xc2.lambda.28 Xc2.lambda.29 Xc2.lambda.30 #> 0.02457082 0.02343049 0.02588823 0.02490463 0.02451951
The extractPAAC()
function also provides the additionalarguments props
and customprops
, which aresimilar to those arguments for Moreau-Broto/Moran/Geary autocorrelationdescriptors. For their minor differences, please see?extracPAAC
. Users can specify the lambda parameter and theweighting factor with the arguments lambda
andw
.
Note: In the work of Kuo-Chen Chou, the definitionfor “normalized occurrence frequency” was not given. In this work, wedefine it as the occurrence frequency of amino acids in the sequencenormalized to 100%, and hence, our calculated values are not the same astheir values.
Amphiphilic pseudo-amino acid composition (APseAAC)
Amphiphilic Pseudo-Amino Acid Composition (APseAAC) was proposed inChou (2001). APseAAC is also recognized asthe type 2 pseudo-amino acid composition. The definitions ofthese qualities are similar to the PAAC descriptors. From \(H_1 (i)\) and \(H_2 (j)\) defined before, thehydrophobicity and hydrophilicity correlation functions are definedas:
\[\begin{align*}H_{i, j}^1 & = H_1 (i) H_1 (j)\\H_{i, j}^2 & = H_2 (i) H_2 (j)\end{align*}\]
From these qualities, sequence order factors can be defined as:
\[\begin{align*}\tau_1 & = \frac{1}{N-1} \sum_{i=1}^{N-1} H_{i, i+1}^1\\\tau_2 & = \frac{1}{N-1} \sum_{i=1}^{N-1} H_{i, i+1}^2\\\tau_3 & = \frac{1}{N-2} \sum_{i=1}^{N-2} H_{i, i+2}^1\\\tau_4 & = \frac{1}{N-2} \sum_{i=1}^{N-2} H_{i, i+2}^2\\ & \ldots \\\tau_{2 \lambda - 1} & = \frac{1}{N-\lambda} \sum_{i=1}^{N-\lambda}H_{i, i+\lambda}^1\\\tau_{2 \lambda} & = \frac{1}{N-\lambda} \sum_{i=1}^{N-\lambda}H_{i, i+\lambda}^2\end{align*}\]
Figure 7: A schematic diagram to show(a1/a2) the first-rank,(b1/b2) the second-rank and(c1/c2) the third-ranksequence-order-coupling mode along a protein sequence through ahydrophobicity/hydrophilicity correlation function, where \(H_{i, j}^1\) and \(H_{i, j}^2\) are given by Equation (3).Panel (a1/a2) reflects the coupling mode between all the most contiguousresidues, panel (b1/b2) between all the second-most contiguous residues,and panel (c1/c2) between all the third-most contiguous residues. Thisfigure is from Chou (2005).
Then a set of descriptors called Amphiphilic Pseudo-Amino AcidComposition (APseAAC) are defined as:
\[P_c = \frac{f_c}{\sum_{r=1}^{20} f_r + w \sum_{j=1}^{2 \lambda} \tau_j}\quad (1 < c < 20)\]
\[P_c = \frac{w \tau_u}{\sum_{r=1}^{20} f_r + w \sum_{j=1}^{2 \lambda}\tau_j} \quad (21 < u < 20 + 2 \lambda)\]
where \(w\) is the weighting factor,its default value is set to \(w = 0.5\)in protr.
A minimal example for extracAPAAC()
is:
extractAPAAC(x)#> Pc1.A Pc1.R Pc1.N #> 3.537412e+01 3.930458e+01 2.161752e+01 #> Pc1.D Pc1.C Pc1.E #> 2.849582e+01 3.733935e+01 2.653059e+01 #> Pc1.Q Pc1.G Pc1.H #> 2.653059e+01 4.520027e+01 1.670445e+01 #> Pc1.I Pc1.L Pc1.K #> 1.965229e+01 4.225242e+01 2.161752e+01 #> Pc1.M Pc1.F Pc1.P #> 6.878302e+00 1.768706e+01 2.947844e+01 #> Pc1.S Pc1.T Pc1.W #> 4.913073e+01 2.456536e+01 1.277399e+01 #> Pc1.Y Pc1.V Pc2.Hydrophobicity.1 #> 2.358275e+01 2.751321e+01 2.196320e-04 #> Pc2.Hydrophilicity.1 Pc2.Hydrophobicity.2 Pc2.Hydrophilicity.2 #> 1.025766e-03 -3.088876e-04 -1.834385e-04 #> Pc2.Hydrophobicity.3 Pc2.Hydrophilicity.3 Pc2.Hydrophobicity.4 #> 1.174146e-03 7.400156e-04 -1.105715e-03 #> Pc2.Hydrophilicity.4 Pc2.Hydrophobicity.5 Pc2.Hydrophilicity.5 #> -4.493680e-04 1.766358e-03 1.471212e-03 #> Pc2.Hydrophobicity.6 Pc2.Hydrophilicity.6 Pc2.Hydrophobicity.7 #> -1.441572e-03 -4.913600e-03 -1.678053e-05 #> Pc2.Hydrophilicity.7 Pc2.Hydrophobicity.8 Pc2.Hydrophilicity.8 #> 7.312356e-04 -1.885399e-03 -1.928708e-03 #> Pc2.Hydrophobicity.9 Pc2.Hydrophilicity.9 Pc2.Hydrophobicity.10 #> -2.931177e-03 -1.555660e-03 2.916597e-03 #> Pc2.Hydrophilicity.10 Pc2.Hydrophobicity.11 Pc2.Hydrophilicity.11 #> 3.602591e-03 1.055082e-04 8.697920e-04 #> Pc2.Hydrophobicity.12 Pc2.Hydrophilicity.12 Pc2.Hydrophobicity.13 #> -9.276413e-04 -2.001384e-03 1.705044e-03 #> Pc2.Hydrophilicity.13 Pc2.Hydrophobicity.14 Pc2.Hydrophilicity.14 #> 4.364007e-03 7.883453e-04 -9.441693e-04 #> Pc2.Hydrophobicity.15 Pc2.Hydrophilicity.15 Pc2.Hydrophobicity.16 #> -3.133437e-04 -3.599332e-03 3.689079e-05 #> Pc2.Hydrophilicity.16 Pc2.Hydrophobicity.17 Pc2.Hydrophilicity.17 #> 2.483867e-03 4.832798e-04 2.465788e-03 #> Pc2.Hydrophobicity.18 Pc2.Hydrophilicity.18 Pc2.Hydrophobicity.19 #> -3.142728e-04 2.021961e-03 6.421283e-05 #> Pc2.Hydrophilicity.19 Pc2.Hydrophobicity.20 Pc2.Hydrophilicity.20 #> -8.896690e-04 -2.986886e-04 9.304039e-04 #> Pc2.Hydrophobicity.21 Pc2.Hydrophilicity.21 Pc2.Hydrophobicity.22 #> -6.777458e-04 1.646818e-03 3.193506e-03 #> Pc2.Hydrophilicity.22 Pc2.Hydrophobicity.23 Pc2.Hydrophilicity.23 #> 3.270656e-03 2.533569e-03 2.478252e-03 #> Pc2.Hydrophobicity.24 Pc2.Hydrophilicity.24 Pc2.Hydrophobicity.25 #> -2.489106e-03 -1.031008e-03 -3.992322e-03 #> Pc2.Hydrophilicity.25 Pc2.Hydrophobicity.26 Pc2.Hydrophilicity.26 #> -2.596060e-03 8.690771e-04 -1.221378e-03 #> Pc2.Hydrophobicity.27 Pc2.Hydrophilicity.27 Pc2.Hydrophobicity.28 #> 5.208649e-03 4.617400e-03 -1.088584e-03 #> Pc2.Hydrophilicity.28 Pc2.Hydrophobicity.29 Pc2.Hydrophilicity.29 #> -2.512263e-03 1.387641e-03 2.060890e-03 #> Pc2.Hydrophobicity.30 Pc2.Hydrophilicity.30 #> 3.177340e-04 1.451909e-03
This function has the same arguments asextractPAAC()
.
Profile-based descriptors
The profile-based descriptors for protein sequences are available inthe protr package. The feature vectors of profile-based methods werebased on the PSSM by running PSI-BLAST and often show good performance.See Ye, Wang, and Altschul (2011) andRangwala and Karypis (2005) for details.The functions extractPSSM()
, extractPSSMAcc()
,and extractPSSMFeature()
are used to generate thesedescriptors. Users need to install the NCBI-BLAST+ software packagefirst to make the functions fully functional.
Proteochemometric modeling descriptors
Proteochemometric (PCM) modeling utilizes statistical modeling tomodel the ligand-target interaction space. The below descriptorsimplemented in protr are extensively used in proteochemometricmodeling.
Scales-based descriptors derived by Principal ComponentsAnalysis
- Scales-based descriptors derived by Amino Acid Properties fromAAindex (Protein Fingerprint)
- Scales-based descriptors derived by 20+ classes of 2D and 3Dmolecular descriptors (Topological, WHIM, VHSE, etc.)
Scales-based descriptors derived by Factor Analysis
Scales-based descriptors derived by MultidimensionalScaling
BLOSUM and PAM matrix-derived descriptors
Note that every scales-based descriptor function can freely combinemore than 20 classes of 2D and 3D molecular descriptors to constructhighly customized scales-based descriptors. Of course, these functionsare designed to be flexible enough that users can provide totallyself-defined property matrices to construct scales-baseddescriptors.
For example, to compute the “topological scales” derived by PCA(using the first 5 principal components), one can useextractDescScales()
:
x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr"))[[1]]descscales <- extractDescScales( x, propmat = "AATopo", index = c(37:41, 43:47), pc = 5, lag = 7, silent = FALSE)#> Summary of the first 5 principal components: #> PC1 PC2 PC3 PC4 PC5#> Standard deviation 2.581537 1.754133 0.4621854 0.1918666 0.08972087#> Proportion of Variance 0.666430 0.307700 0.0213600 0.0036800 0.00080000#> Cumulative Proportion 0.666430 0.974130 0.9954900 0.9991700 0.99998000
The propmat
argument invokes the AATopo
dataset shipped with the protr package. The argument index
selects the 37 to 41 and the 43 to 47 columns (molecular descriptors) inthe AATopo
dataset to use. The parameter lag
was set for the Auto Cross Covariance (ACC) to generate scales-baseddescriptors of the same length. At last, we printed the summary of thefirst 5 principal components (standard deviation, proportion ofvariance, cumulative proportion of variance).
The result is a length 175 named vector, which is consistent with thedescriptors before:
length(descscales)#> [1] 175head(descscales, 15)#> scl1.lag1 scl2.lag1 scl3.lag1 scl4.lag1 scl5.lag1 #> -2.645644e-01 -1.717847e-02 1.975438e-02 -7.930659e-05 -3.710597e-05 #> scl1.lag2 scl2.lag2 scl3.lag2 scl4.lag2 scl5.lag2 #> 3.548612e-01 1.343712e-01 5.699395e-03 -5.489472e-04 -6.364577e-05 #> scl1.lag3 scl2.lag3 scl3.lag3 scl4.lag3 scl5.lag3 #> 2.011431e-02 -9.211136e-02 -1.461755e-03 6.747801e-04 2.386782e-04
For another example, to compute the descriptors derived by theBLOSUM62 matrix and use the first 5 scales, one can use:
x <- readFASTA(system.file( "protseq/P00750.fasta", package = "protr"))[[1]]blosum <- extractBLOSUM( x, submat = "AABLOSUM62", k = 5, lag = 7, scale = TRUE, silent = FALSE)#> Relative importance of all the possible 20 scales: #> [1] 1.204960e+01 7.982007e+00 6.254364e+00 4.533706e+00 4.326286e+00#> [6] 3.850579e+00 3.752197e+00 3.538207e+00 3.139155e+00 2.546405e+00#> [11] 2.373286e+00 1.666259e+00 1.553126e+00 1.263685e+00 1.024699e+00#> [16] 9.630187e-01 9.225759e-01 7.221636e-01 1.020085e-01 -4.592648e-17
The result is a length 175 named vector:
length(blosum)#> [1] 175head(blosum, 15)#> scl1.lag1 scl2.lag1 scl3.lag1 scl4.lag1 scl5.lag1 #> 0.0042370555 -0.0021502057 0.0005993291 0.0006456375 0.0014849592 #> scl1.lag2 scl2.lag2 scl3.lag2 scl4.lag2 scl5.lag2 #> -0.0014919096 0.0032873726 0.0011734162 -0.0021758536 -0.0018127568 #> scl1.lag3 scl2.lag3 scl3.lag3 scl4.lag3 scl5.lag3 #> -0.0029413528 0.0001494193 0.0003298806 -0.0017877430 -0.0051044133
Dealing with gaps. In proteochemometrics (sequencealignment), gaps can be very useful since a gap in a certain positioncontains information. The protr package has built-in support for suchgaps. We deal with the gaps by using a dummy descriptor to code for the21st type of amino acid. The function extractScalesGap()
and extractProtFPGap()
can be used to deal with such gaps.See ?extractScalesGap
and ?extractProtFPGap
for details.
Similarity calculation by sequence alignment
Similarity computation derived by local or global protein sequencealignment between a list of protein sequences is of great need inprotein research. However, this type of pairwise similarity computationis often computationally intensive, especially when many proteinsequences exist. Luckily, this process is also highly parallelizable.The protr package integrates the function of parallelized similaritycomputation derived by local or global protein sequence alignmentbetween a list of protein sequences.
The function twoSeqSim()
calculates the alignment resultbetween two protein sequences. The function parSeqSim()
calculates the pairwise similarity with a list of protein sequences inparallel:
s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]plist <- list(s1, s2, s3, s4, s5)psimmat <- parSeqSim(plist, cores = 4, type = "local", submat = "BLOSUM62")psimmat
## [,1] [,2] [,3] [,4] [,5]## [1,] 1.00000000 0.11825938 0.10236985 0.04921696 0.03943488## [2,] 0.11825938 1.00000000 0.18858241 0.12124217 0.06391103## [3,] 0.10236985 0.18858241 1.00000000 0.05819984 0.06175942## [4,] 0.04921696 0.12124217 0.05819984 1.00000000 0.05714638## [5,] 0.03943488 0.06391103 0.06175942 0.05714638 1.00000000
Similarly, the function crossSetSim()
calculates thepairwise similarity between two sets of protein sequences inparallel.
Note that for a small number of proteins, calculating their pairwisesimilarity scores derived by sequence alignment in parallel may notsignificantly reduce the overall computation time since each task onlyrequires a relatively short time to finish. Thus, computationaloverheads may exist and affect the performance. In our benchmarks, weused about 1,000 protein sequences on 64 CPU cores and observedsignificant performance improvement compared to the sequentialcomputation.
Scaling up for a large number of sequences
batches
. A useful argument for reducingmemory usage is batches
. It splits the similaritycomputations into smaller batches. A larger batch number reduces thememory need for each parallel process but also adds the overhead offorking new processes. This is a trade-off between time (CPU) and space(memory). Also, use verbose = TRUE
to print the computationprogress.
parSeqSimDisk()
. For similaritycomputations over an even larger number of protein sequences, please tryparSeqSimDisk()
and set batches
to a largenumber. Unlike the in-memory version, this disk-based function cachesthe partial results in each batch to the hard drive and merges all theresults together in the end, which further reduces memory usage.
Example. Let’s assume we have 20,000 sequences. Bysetting cores = 64
and batches = 10000
inparSeqSimDisk()
, the workload will be around20000^2/2/64/10000 = 312.5
alignments per batch per core.This could give you relatively quick feedback about whether the memorysize is going to be sufficient, and how long the computation will takein total after the first few batches.
Similarity calculation by GO semantic similarity measures
The protr package also integrates the function for similarity scorecomputation derived by Gene Ontology (GO) semantic similarity measuresbetween a list of GO terms or Entrez Gene IDs.
The function twoGOSim()
calculates the similarityderived by GO terms semantic similarity measures between two GO terms /Entrez Gene IDs, and the function parGOSim()
calculates thepairwise similarity with a list of GO terms / Entrez Gene IDs:
# By GO Termsgo1 <- c( "GO:0005215", "GO:0005488", "GO:0005515", "GO:0005625", "GO:0005802", "GO:0005905") # AP4B1go2 <- c( "GO:0005515", "GO:0005634", "GO:0005681", "GO:0008380", "GO:0031202") # BCAS2go3 <- c( "GO:0003735", "GO:0005622", "GO:0005840", "GO:0006412") # PDE4DIPgolist <- list(go1, go2, go3)parGOSim(golist, type = "go", ont = "CC", measure = "Wang")
## [,1] [,2] [,3]## [1,] 1.000 0.344 0.17## [2,] 0.344 1.000 0.24## [3,] 0.170 0.240 1.00
# By Entrez gene idgenelist <- list(c("150", "151", "152", "1814", "1815", "1816"))parGOSim(genelist, type = "gene", ont = "BP", measure = "Wang")
## 150 151 152 1814 1815 1816## 150 1.000 0.702 0.725 0.496 0.570 0.455## 151 0.702 1.000 0.980 0.456 0.507 0.551## 152 0.725 0.980 1.000 0.460 0.511 0.538## 1814 0.496 0.456 0.460 1.000 0.687 0.473## 1815 0.570 0.507 0.511 0.687 1.000 0.505## 1816 0.455 0.551 0.538 0.473 0.505 1.000
Miscellaneous tools
In this section, we will briefly introduce some useful tools theprotr package provides.
Retrieve protein sequences from UniProt
This function getUniProt()
gets protein sequences fromuniprot.org by protein ID(s). The input ID
is a charactervector specifying the protein ID(s). The returned sequences are storedin a list:
ids <- c("P00750", "P00751", "P00752")prots <- getUniProt(ids)prots
## [[1]]## [1] "MDAMKRGLCCVLLLCGAVFVSPSQEIHARFRRGARSYQVICRDEKTQMIYQQHQSWLRPVLRSNRVEYCWCN## SGRAQCHSVPVKSCSEPRCFNGGTCQQALYFSDFVCQCPEGFAGKCCEIDTRATCYEDQGISYRGTWSTAESGAECT## NWNSSALAQKPYSGRRPDAIRLGLGNHNYCRNPDRDSKPWCYVFKAGKYSSEFCSTPACSEGNSDCYFGNGSAYRGT## HSLTESGASCLPWNSMILIGKVYTAQNPSAQALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCSTCGLRQ## YSQPQFRIKGGLFADIASHPWQAAIFAKHRRSPGERFLCGGILISSCWILSAAHCFQERFPPHHLTVILGRTYRVVP## GEEEQKFEVEKYIVHKEFDDDTYDNDIALLQLKSDSSRCAQESSVVRTVCLPPADLQLPDWTECELSGYGKHEALSP## FYSERLKEAHVRLYPSSRCTSQHLLNRTVTDNMLCAGDTRSGGPQANLHDACQGDSGGPLVCLNDGRMTLVGIISWG## LGCGQKDVPGVYTKVTNYLDWIRDNMRP"#### [[2]]## [1] "MGSNLSPQLCLMPFILGLLSGGVTTTPWSLARPQGSCSLEGVEIKGGSFRLLQEGQALEYVCPSGFYPYPVQ## TRTCRSTGSWSTLKTQDQKTVRKAECRAIHCPRPHDFENGEYWPRSPYYNVSDEISFHCYDGYTLRGSANRTCQVNG## RWSGQTAICDNGAGYCSNPGIPIGTRKVGSQYRLEDSVTYHCSRGLTLRGSQRRTCQEGGSWSGTEPSCQDSFMYDT## PQEVAEAFLSSLTETIEGVDAEDGHGPGEQQKRKIVLDPSGSMNIYLVLDGSDSIGASNFTGAKKCLVNLIEKVASY## GVKPRYGLVTYATYPKIWVKVSEADSSNADWVTKQLNEINYEDHKLKSGTNTKKALQAVYSMMSWPDDVPPEGWNRT## RHVIILMTDGLHNMGGDPITVIDEIRDLLYIGKDRKNPREDYLDVYVFGVGPLVNQVNINALASKKDNEQHVFKVKD## MENLEDVFYQMIDESQSLSLCGMVWEHRKGTDYHKQPWQAKISVIRPSKGHESCMGAVVSEYFVLTAAHCFTVDDKE## HSIKVSVGGEKRDLEIEVVLFHPNYNINGKKEAGIPEFYDYDVALIKLKNKLKYGQTIRPICLPCTEGTTRALRLPP## TTTCQQQKEELLPAQDIKALFVSEEEKKLTRKEVYIKNGDKKGSCERDAQYAPGYDKVKDISEVVTPRFLCTGGVSP## YADPNTCRGDSGGPLIVHKRSRFIQVGVISWGVVDVCKNQKRQKQVPAHARDFHINLFQVLPWLKEKLQDEDLGFL"#### [[3]]## [1] "APPIQSRIIGGRECEKNSHPWQVAIYHYSSFQCGGVLVNPKWVLTAAHCKNDNYEVWLGRHNLFENENTAQF## FGVTADFPHPGFNLSLLKXHTKADGKDYSHDLMLLRLQSPAKITDAVKVLELPTQEPELGSTCEASGWGSIEPGPDB## FEFPDEIQCVQLTLLQNTFCABAHPBKVTESMLCAGYLPGGKDTCMGDSGGPLICNGMWQGITSWGHTPCGSANKPS## IYTKLIFYLDWINDTITENP"
Read FASTA files
The function readFASTA()
provides a convenient way toread protein sequences stored in FASTA format files. See?readFASTA
for details. The returned sequences are storedin a named list, whose components are named with the protein sequences’names.
Read PDB files
The Protein Data Bank (PDB) file format is a text file format thatdescribes the three-dimensional structures of proteins. ThereadPDB()
function allows you to read protein sequencesstored in PDB format files. See ?readPDB
for details.
Sanity check for amino acid types
The protcheck()
function checks if the proteinsequence’s amino acid types are in the 20 default types, which returnsTRUE
if all the amino acids in the sequence belong to the20 default types:
x <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]# A real sequenceprotcheck(x)#> [1] TRUE# An artificial sequenceprotcheck(paste(x, "Z", sep = ""))#> [1] FALSE
Remove gaps from sequences
The removeGaps()
function allows us to remove/replacegaps (-
) or any “irregular” characters from amino acidsequences to prepare them for feature extraction or sequencealignment-based similarity computation.
Protein sequence partition
The protseg()
function partitions the protein sequencesto create sliding windows. This is usually required when creatingfeature vectors for machine learning tasks. Users can specify a sequencex
, a character aa
(one of the 20 amino acidtypes), and a positive integer k
, which controls the windowsize (half of the window).
This function returns a named list. Each component contains one ofthe segmentations (a character string). Names of the list components arethe positions of the specified amino acid in the sequence.
protseg(x, aa = "M", k = 5)#> $`48`#> [1] "DEKTQMIYQQH"#> #> $`242`#> [1] "LPWNSMILIGK"#> #> $`490`#> [1] "TVTDNMLCAGD"#> #> $`525`#> [1] "LNDGRMTLVGI"
Auto cross covariance computation
Auto Cross Covariance (ACC) is extensively used in the scales-baseddescriptors computation, this approach calculates the auto covarianceand auto cross covariance for generating scale-based descriptors of thesame length. Users can write their own scales-based descriptor functionswith the help of acc()
function in the protr package.
Pre-computed 2D and 3D descriptor sets for the 20 amino acids
The protr package ships with more than 20 pre-computed 2D and 3Ddescriptor sets for the 20 amino acids to use with the scales-baseddescriptors. Please use data(package = "protr")
to list allthe datasets included in the protr package.
BLOSUM and PAM matrices for the 20 amino acids
The BLOSUM and PAM matrices for the 20 amino acids can be used tocalculate BLOSUM and PAM matrix-derived descriptors with functionextractBLOSUM()
, the datasets are named inAABLOSUM45
, AABLOSUM50
,AABLOSUM62
, AABLOSUM80
,AABLOSUM100
, AAPAM30
, AAPAM40
,AAPAM70
, AAPAM120
, andAAPAM250
.
Metadata of the 20 amino acids
As the reference, the AAMetaInfo
dataset containsmetadata of the 20 amino acids used for the 2D and 3D descriptorcalculation in the protr package. This dataset includes each aminoacid’s name, one-letter representation, three-letter representation,SMILE representation, PubChem CID, and PubChem link. Seedata(AAMetaInfo)
for details.
ProtrWeb
The web application built on protr, namely ProtrWeb, is available athttp://protr.org.
ProtrWeb does not require the user to have any programming knowledge.It is a user-friendly web application that computes the protein sequencedescriptors in the protr package.
Figure 8: A screenshot of the web application ProtrWeb.
Source code repository for this Shiny web application: https://github.com/nanxstats/protrweb.
Summary
The summary of the descriptors in the protr package is listed inTable 3.
Descriptor Group | Descriptor Name | Descriptor Dimension | Function Name |
---|---|---|---|
Amino Acid Composition | Amino Acid Composition | 20 | extractAAC() |
Dipeptide Composition | 400 | extractDC() | |
Tripeptide Composition | 8000 | extractTC() | |
Autocorrelation | Normalized Moreau-Broto Autocorrelation | 240\(^1\) | extractMoreauBroto() |
Moran Autocorrelation | 240\(^1\) | extractMoran() | |
Geary Autocorrelation | 240\(^1\) | extractGeary() | |
CTD | Composition | 21 | extractCTDC(), extractCTDCClass() |
Transition | 21 | extractCTDT(), extractCTDTClass() | |
Distribution | 105 | extractCTDD(), extractCTDDClass() | |
Conjoint Triad | Conjoint Triad | 343 | extractCTriad(), extractCTriadClass() |
Quasi-Sequence-Order | Sequence-Order-Coupling Number | 60\(^2\) | extractSOCN() |
Quasi-Sequence-Order Descriptors | 100\(^2\) | extractQSO() | |
Pseudo-Amino Acid Composition | Pseudo-Amino Acid Composition | 50\(^3\) | extractPAAC() |
Amphiphilic Pseudo-Amino Acid Composition | 80\(^4\) | extractAPAAC() |
1 The number depends on the choice of the number ofproperties of amino acids and the choice of the maximum values of thelag
. The default is to use 8 types of properties andlag = 30
.
2 The number depends on the maximum value oflag
. By default, lag = 30
. Two distancematrices are used, so the descriptor dimension is \(30 \times 2 = 60\) and \((20 + 30) \times 2 = 100\).
3 The number depends on the choice of the number of the setof amino acid properties and the choice of the \(\lambda\) value. The default is to use 3types of properties proposed by Kuo-Chen Chou and \(\lambda = 30\).
4 The number depends on the choice of the \(\lambda\) vlaue. The default is \(\lambda = 30\).
The scales-based PCM descriptors in the protr package are listed inTable 4.
Derived by | Descriptor Class | Function Name | |
---|---|---|---|
Principal Components Analysis | Scales-based descriptors derived by PrincipalComponents Analysis | extractScales(), extractScalesGap() | |
Scales-based descriptors derived by amino acidproperties from AAindex (a.k.a. Protein Fingerprint) | extractProtFP(), extractProtFPGap() | ||
Scales-based descriptors derived by 2D and 3D moleculardescriptors (Topological, WHIM, VHSE, etc.) | extractDescScales() | ||
Factor Analysis | Scales-based descriptors derived by FactorAnalysis | extractFAScales() | |
Multidimensional Scaling | Scales-based descriptors derived by MultidimensionalScaling | extractMDSScales() | |
Substitution Matrix | BLOSUM and PAM matrix-derived descriptors | extractBLOSUM() |
The amino acid descriptor sets used by scales-based descriptorsprovided by the protr package are listed in Table 5. Note that thenon-informative descriptors (like the descriptors have only one valueacross all the 20 amino acids) in these datasets have already beenfiltered out.
Dataset Name | Descriptor Set Name | Dimensionality | Calculated by |
---|---|---|---|
AA2DACOR | 2D Autocorrelations Descriptors | 92 | Dragon |
AA3DMoRSE | 3D-MoRSE Descriptors | 160 | Dragon |
AAACF | Atom-Centred Fragments Descriptors | 6 | Dragon |
AABurden | Burden Eigenvalues Descriptors | 62 | Dragon |
AAConn | Connectivity Indices Descriptors | 33 | Dragon |
AAConst | Constitutional Descriptors | 23 | Dragon |
AAEdgeAdj | Edge Adjacency Indices Descriptors | 97 | Dragon |
AAEigIdx | Eigenvalue-Based Indices Descriptors | 44 | Dragon |
AAFGC | Functional Group Counts Descriptors | 5 | Dragon |
AAGeom | Geometrical Descriptors | 41 | Dragon |
AAGETAWAY | GETAWAY Descriptors | 194 | Dragon |
AAInfo | Information Indices Descriptors | 47 | Dragon |
AAMolProp | Molecular Properties Descriptors | 12 | Dragon |
AARandic | Randic Molecular Profiles Descriptors | 41 | Dragon |
AARDF | RDF Descriptors | 82 | Dragon |
AATopo | Topological Descriptors | 78 | Dragon |
AATopoChg | Topological Charge Indices Descriptors | 15 | Dragon |
AAWalk | Walk and Path Counts Descriptors | 40 | Dragon |
AAWHIM | WHIM Descriptors | 99 | Dragon |
AACPSA | CPSA Descriptors | 41 | Accelrys Discovery Studio |
AADescAll | All the 2D Descriptors Calculated by Dragon | 1171 | Dragon |
AAMOE2D | All the 2D Descriptors Calculated by MOE | 148 | MOE |
AAMOE3D | All the 3D Descriptors Calculated by MOE | 143 | MOE |
References
Chou, Kuo-Chen. 2000. “Prediction of Protein Subcellar Locationsby Incorporating Quasi-Sequence-Order Effect.” Biochemicaland Biophysical Research Communications 278: 477–83.
———. 2001. “Prediction of Protein Cellular Attributes UsingPseudo-Amino Acid Composition.” PROTEINS: Structure,Function, and Genetics 43: 246–55.
———. 2005. “Using Amphiphilic Pseudo Amino Acid Composition toPredict Enzyme Subfamily Classes.” Bioinformatics 21(1): 10–19.
Chou, Kuo-Chen, and Hong-Bin Shen. 2008.“Cell-PLoc: A Package of Web Servers forPredicting Subcellular Localization of Proteins in VariousOrganisms.” Nature Protocols 3 (2): 153–62.
Dubchak, Inna, Ilya Muchink, Stephen R. Holbrook, and Sung-Hou Kim.1995. “Prediction of Protein Folding Class Using GlobalDescription of Amino Acid Sequence.” Proceedings of theNational Academy of Sciences 92: 8700–8704.
Dubchak, Inna, Ilya Muchink, Christopher Mayor, Igor Dralyuk, andSung-Hou Kim. 1999. “Recognition of a Protein Fold in the Contextof the SCOP Classification.” Proteins:Structure, Function and Genetics 35: 401–7.
Grantham, R. 1974. “Amino Acid Difference Formula to Help ExplainProtein Evolution.” Science 185: 862–64.
Kawashima, S., and M. Kanehisa. 2000. “AAindex: AminoAcid Index Database.” Nucleic Acids Research 28: 374.
Kawashima, S., H. Ogata, and M. Kanehisa. 1999.“AAindex: Amino Acid Index Database.”Nucleic Acids Research 27: 368–69.
Kawashima, S., P. Pokarowski, M. Pokarowska, A. Kolinski, T. Katayama,and M. Kanehisa. 2008. “AAindex: Amino Acid IndexDatabase (Progress Report).” Nucleic Acids Research 36:D202–5.
Rangwala, Huzefa, and George Karypis. 2005. “Profile-Based DirectKernels for Remote Homology Detection and Fold Recognition.”Bioinformatics 21 (23): 4239–47.
Schneider, Gisbert, and Paul Wrede. 1994. “The Rational Design ofAmino Acid Sequences by Artificial Neural Networks and SimulatedMolecular Evolution: Do Novo Design of an Idealized Leader CleavageSite.” Biophysical Journal 66: 335–44.
Shen, J. W., J. Zhang, X. M. Luo, W. L. Zhu, K. Q. Yu, K. X. Chen, Y. X.Li, and H. L. Jiang. 2007. “Predicting Protein-ProteinInteractions Based Only on Sequences Information.”Proceedings of the National Academy of Sciences 104: 4337–41.
Xiao, Nan, Dong-Sheng Cao, Min-Feng Zhu, and Qing-Song Xu. 2015.“protr/ProtrWeb:R Package and Web Server for Generating Various NumericalRepresentation Schemes of Protein Sequences.”Bioinformatics 31 (11): 1857–59.
Ye, Xugang, Guoli Wang, and Stephen F Altschul. 2011. “AnAssessment of Substitution Scores for Protein Profile–ProfileComparison.” Bioinformatics 27 (24): 3356–63.