Tropsha 4 5 05

Information about Tropsha 4 5 05

Published on November 24, 2007

Author: Dabby

Source: authorstream.com

Content

Quantitative Genotype Phenotype Relationships (QGPR): Can we learn from Quantitative Structure Activity Relationships (QSAR) modeling?:  Quantitative Genotype Phenotype Relationships (QGPR): Can we learn from Quantitative Structure Activity Relationships (QSAR) modeling? Alexander Tropsha, Sasha Golbraikh, Scott Oloff, Raed Khashan Laboratory for Molecular Modeling School of Pharmacy The unbearable lightness of “predictive” modeling The relationship between target property and attributes (descriptors):  The relationship between target property and attributes (descriptors) Objects Target Property Attributes (Descriptors) Comp.1 Value1 D1 D2 D3 D4 Comp.2 Value2 " " " " Comp.3 Value3 " " " " Comp.N ValueN " " " " - - - - - - - - - - - - - - {TP} = K{Attributes} ^ Predictive biological data modeling: focus on validation :  Predictive biological data modeling: focus on validation QSPR is an empirical data modeling exercise: Choice of statistical data modeling techniques Choice of descriptor types VALIDATE both internally and externally Non-linear methods with variable selection using stochastic optimization techniques to determine context-dependent descriptors Integrated workflow for predictive QSPR modeling Some simple validation techniques and (an example of) the applicability domain definition Examples of studies in QSPR and QGPR areas IT issues Components of QSPR Modeling :  Components of QSPR Modeling Target properties Continuous (e.g., weight) Categorical unrelated (e.g., different phenotypes) Categorical related (e.g., subranges described as classes) Descriptors (or independent variables) Continuous (allows distance based similarity) Categorical related (allows distance based similarity) Categorical unrelated (genotypes; special similarity metrics) Correlation methods (with and w/o variable selection) Linear (e.g., LR, MLR, PCR, PLS) Non-linear (e.g., kNN, RP, ANN, SVM) Validation and prediction Internal (training set) vs. external (test set) Slide5:  VARIABLE SELECTION kNN QSAR* Randomly select a subset of descriptors (HDP) Select the best QSAR model for nvar and K SIMULATED ANNEALING LEAVE-ONE-OUT CROSS-VALIDATION Exclude a compound Predict activity ŷ of the excluded compound as the weighted average of activities of 1 to K nearest neighbors Calculate the predictive ability (q2) of the “model” Modify descriptor subset *Zheng, W. and Tropsha, A. JCICS., 2000; 40; 185-194 Slide6:  Predictive R2 versus cross-validated R2(q2) for QSAR models with q2>0.5. using common definition (e.g., [3]) of training and test sets. Training set: compounds 1-21 Test set: compounds 22-31 Training set: compounds 1-12 and 23-31 Test set: compounds 13-22 BEWARE OF q2!!! (Golbraikh & Tropsha, J. Mol. Graphics Mod. 2002, 20, 269-276. ) 31 Cramer steroids [1] (Benchmark to investigate novel QSAR methods [2]) 1. Cramer, R.D. III, Patterson, D.E., Bunce, J.D. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am.Chem.Soc. 1988, 110, 5959-5967 2. Coats, E.A. The CoMFA steroids as a benchmark data set for development of 3D QSAR methods. In 3D QSAR in Drug Design. V.3. Kubinyi, H., Folkers, G., Martin, Y.C., Eds. Kluwer/ESCOM:Dordrecht, 1998, pp 199-213. 3. Kubinyi, H.; Hamprecht, F.A. & Mietzner, T. Three-Dimensional Quantitative Similarity-Activity Relationships (3D QSiAR) from SEAL Similarity Matrices, J. Med. Chem., 1998, 41, 2553 – 2564. COMPONENTS OF PREDICTIVE QSAR MODELING WORKFLOW*:  COMPONENTS OF PREDICTIVE QSAR MODELING WORKFLOW* Model Building: Combination of various descriptor sets and variable selection data modeling methods (Combi-QSAR) Model Validation Y-randomization Training and test set selection Applicability domain Evaluation of external predictive power *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:… Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77. Activity randomization:  Activity randomization Struc.1 Struc.2 Struc.n . . Pro.1 Struc.3 . . Pro.2 Pro.3 Pro.n RATIONAL SELECTION OF MULTIPLE TRAINING AND TEST SETS*:  RATIONAL SELECTION OF MULTIPLE TRAINING AND TEST SETS* *Golbraikh et al., J. Comp. Aid. Mol. Design 2003, 17, 241–253. Slide10:  DEFINING THE APPLICABILITY DOMAIN Training set: 60 compounds Test set: 35 compounds MODEL: Two nearest neighbors The number of descriptors: 8 Q2(CV)=0.57 R2 =0.67 DISTANCES: <D>train=0.287 StDev(D)train=s =0.149 Closest nearest neighbors of test set compounds: Dtest ≤ <D>train+ s ZCutOff (ZCutOff=0.5) N is the total number of distances ( Ntrain=60 2=120; Ntest=70 ) Ni is the number of distances in each category (bin) Slide11:  Criteria for Predictive QSAR Model. Correlation coefficient Coefficients of determination Regression Regression through the origin CRITERIA QSPR modeling process revisited:  QSPR modeling process revisited GENET- GENOM- PROTEOM- BIOINFORMAT- MEDINFORMAT- CHEMOGENOM- CHEMOINFORMAT- PROTEOCHEMOMETR- -ICS “-ics” – an old Latin suffix that means “way too much” COMBINATORIAL QSPRomics, or C-Qics COMBINATORIAL QSPRomics:  COMBINATORIAL QSPRomics C-Qics KNN KNN (MML) BINARY QSAR,… BINARY QSAR,… COMFA descriptors COMFA descriptors Molconn Molconn Z Z descriptors descriptors Chirality descriptors Chirality descriptors Volsurf Volsurf descriptors descriptors Comma descriptors Comma descriptors MOE descriptors MOE descriptors Dragon descriptors Dragon descriptors SAR Dataset SAR Dataset Compound representation Compound representation Selection of best models Selection of best models Model validation Model validation using using external test external test set set and and Y Y - - Randomization Randomization QSAR model QSAR model in in g g SVM SVM (MML) DECISION TREE DECISION TREE Predictive QSAR Workflow:  Only accept models that have a q2 > 0.6 R2 > 0.6, etc. Multiple Training Sets Validated Predictive Models with High Internal & External Accuracy Predictive QSAR Workflow Original Dataset Multiple Test Sets Combi-QSAR Modeling Split into Training and Test Sets Activity Prediction Y-Randomization Database Screening Slide15:  STructure-Activity Relationships for the Design of Molecules (STARDOM™): WORKFLOW Input Structure File Convert Structures dbtranslate Babel etc. MolconnZ GenAP etc. Generate Descriptors Utility (UNC) Normalize Descriptors Descriptor Generation MolconnZ- ToDescr (UNC) Reformat Descriptors Descriptor formatting Input Descriptor File Input Activity File Train & Test Set Selection SE8 (UNC) Build & Test Models Randomize (UNC) Randomize Activities RWKNN, SAPLS (UNC), etc QSAR Algorithm KNNPredict, SAPLSPred (UNC) etc. Predict Test Set Report & Visualize Results ModStat (UNC) Compile Results Weblab, TSAR, MOE, Spotfire, etc. Visualize Results Database to Screen Screen Database Utility Normalize Descriptors DBMine, KNNPredict, etc. Mine Database TSAR, MOE, Spotfire, etc. Visualize Hits QSAR Model(s) programs functions User input Predictive QSAR workflow as an automated grid application (currently based on IBM’s middleware):  Predictive QSAR workflow as an automated grid application (currently based on IBM’s middleware) Browser Portal Server WebSphere Application Server WebSphere Work Flow Java Wrappers Applications run on the Computer Grid kNN SVM’s, etc. kNNPredict, SVMPredict, etc. Relational Database (DB2 or Oracle) File Database (Data Grid) Screening of Compound Databases Visualization Tools (Spotfire, ChemDraw, Chime, etc.) KEY(↔): Initial Model Building Flow Screen Database Flow Data Retrieval and Visualization Slide17:  EXAMPLE 1: COMBINATORIAL QSAR OF AMBERGRIS FRAGRANCE COMPOUNDS* Amber, woody, cedarwood, animal, strong Amber woody, camphoraceus, spicy, weak Amber, exotic woody, animal Strong amber Amber woody, sea water Amber, camphoraceus *Kovatcheva A., et al. J. Chem. Inf. Comp. Sci., 2004, 44, 582-95 Slide18:  TOTAL PREDICTION ACCURACY FOR THE TEST SET USING BEST ACTUAL & RANDOMIZED MODELS Example 3. Consensus QSAR models for the prediction of Ames genotoxicity*:  3,363 diverse compounds (including >300 drugs) tested for their Ames genotoxicity 60% mutagens, 40% non mutagens 148 initial topological descriptors ANN, kNN, Decision Forest (DF) methods 2963 compounds in the training set, 400 compounds (39 drugs) in randomly selected test set Example 3. Consensus QSAR models for the prediction of Ames genotoxicity* *Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie Q, Tong W. Mutagenesis, 2004, 19, 365-77. Comparison of GenTox prediction for 30 drugs in external test set:  Comparison of GenTox prediction for 30 drugs in external test set Content-dependent descriptor types identified by different models (LogP was never selected):  Content-dependent descriptor types identified by different models (LogP was never selected) Effect of applicability domain on the prediction accuracy of kNN QSAR:  Effect of applicability domain on the prediction accuracy of kNN QSAR Genomic Butterfly Spot Dataset:  Genomic Butterfly Spot Dataset 2000 Data examples with presence or lack of phenotype. 6 developmental loci result in the phenotype. 30 additional loci added as noise HYPOTHESIS: Our well developed QSPR-omics methodologies can be accurately applied to QGPR to identify the developmental loci kNN Results (Traditional):  kNN Results (Traditional) 70-90% Training set accuracy however phenotypes were predicted differently with identical selected descriptor values New “k”NN for QGPR:  New “k”NN for QGPR If more than “k” elements have identical selected descriptors then we average all of those elements rather than the first “k”. ONLY the descriptors c_source and c_thresh were found to be relevant Slide26:  SVM Classification Slide27:  SVM Classification Descriptors Found Identified by SVM:  Descriptors Found Identified by SVM Recursive Partitioning using DTReg:  Recursive Partitioning using DTReg www.dtreg.com Random Forests using DTReg:  Random Forests using DTReg Shuffled Difficult Data Structures to model:  Difficult Data Structures to model “k”NN works well SVM-RBF works well Decision Trees: no correlation Random Forest: no correlation CLASSIFICATION ACCURACY CRITERIA AS TARGET FUNCTIONS IN QSAR :  CLASSIFICATION ACCURACY CRITERIA AS TARGET FUNCTIONS IN QSAR Alexander Golbraikh April 5, 2005 Slide33:  2x2 CONFUSION MATRIX AND MEASURES OF CLASSIFICATION ACCURACY N=A+B+C+D B+D A+C TOTAL C+D D C PREDICTED(-) A+B B A PREDICTED(+) TOTAL ACTUAL(-) ACTUAL(+) Kappa + B/(B+D) False positive rate Enrichment + Odds ratio + D/(B+D) Specificity (Sp) Misclassification rate + A/(A+C) Sensitivity (Ss) Negative predictive power (NPP) (A+D)/N Correct classification rate Positive predictive power (PPP) (B+D)/N Overall diagnostic power + False negative rate (A+C)/N Prevalence Fielding, A.H.; Bell, J.F. Environmental Conservation 1997, 24 (1), 38-49. C/(A+C) A/(A+B) D/(C+D) (B+C)/N (AD)/(BC) AN/[(A+B)(A+C)] {(A+D)/N-[(A+C)(A+B)+(B+D)(C+D)]/N2}/ {1-[(A+C)(A+B)+(B+D)(C+D)]/N2} Slide34:  DRAWBACK OF SOME CHARACTERISTICS 100 20 80 Total 34 14 20 Predicted (-) 66 6 60 Predicted (+) Total Actual (-) Actual (+) 28 20 8 Total 16 14 2 Predicted (-) 12 6 6 Predicted (+) Total Actual (-) Actual (+) PPP=60/66=0.91 Prev=80/100=0.80 E=0.91/0.80=1.14 PPP=6/12=0.50 Prev=8/28=0.29 E=0.50/0.29=1.72 Slide35:  NORMALIZED CONFUSION MATRICES 70/70+340/340 340/340 70/70 Total 28/70+280/340 280/340 28/70 Predicted (-) 42/70+60/340 60/340 42/70 Predicted (+) Total Actual (-) Actual (+) 2 1 1 Total 1.22 0.82 0.40 Predicted (-) 0.78 0.18 0.60 Predicted (+) Total Actual (-) Actual (+) PPP=0.60/0.78=0.77 Prev=1/2=0.50 E=0.77/0.50=1.54 Slide36:  THE NORMALIZED CONFUSION MATRIX AND CLASSIFICATION ACCURACY MEASURES + Kappa + B/(B+D) False positive rate + + (AD)/(BC) Odds ratio + D/(B+D) Specificity (Sp) + Misclassification rate + A/(A+C) Sensitivity (Ss) + Negative predictive power (NPP) + Correct classification rate (CCR) + Positive predictive power (PPP) (B+D)/N Overall diagnostic power + C/(A+C) False negative rate (A+C)/N Prevalence 2 1 1 Total C/(A+C)+D/(B+D) D/(B+D) C/(A+C) Predicted(-) A/(A+C)+B/(B+D) B/(B+D) A/(A+C) Predicted(+) Total Actual(-) Actual(+) CLASSIFICATION QSAR: nxn NORMALIZED CONFUSION MATRIX:  CLASSIFICATION QSAR: nxn NORMALIZED CONFUSION MATRIX CLASSIFICATION ACCURACY: CONSIDERATIONS:  CONSIDERATIONS Many parameters used for evaluation of classification accuracy cannot be used as characteristics of QSAR models, because they depend on the size of each class. These parameters become independent of the size of each class, if they are calculated using normalized confusion matrices. n2-n linearly independent parameters are necessary to fully characterize the performance of classification accuracy algorithms. When we are not interested in the classes to which misclassified compounds are assigned, n diagonal elements of the normalized confusion matrix are sufficient to estimate the algorithm performance. Set of criteria, which good classification models must satisfy, were established. Decision Tree (MOE): data:  Decision Tree (MOE): data Dataset 1 and 2 2000 objects, 36 descriptors External test set: 400 objects (used for prediction) Class 1: 200 objects Class 2: 200 objects Training set: 1200 objects (used for learning a tree) Class 1: 600 objects Class 2: 600 objects Internal test set: 400 objects (used for pruning a tree) Class 1: 200 objects Class 2: 200 objects Decision Tree (MOE): parameters:  Protocol: separate test sample Descriptors included: 36 or 34 Node Split Size: 10 Max. Sample Size: 255 Max. Tree Depth: 10 Best Tree Thresh: 1.0 0.8* 0.6* 0.4* Use Priors * With 34 descriptors only Decision Tree (MOE): parameters Slide41:  All 36 descriptors The trees included only two descriptors: c_source and c_thresh Prediction accuracy for BOTH DATASETS:* Training+Internal Test sets: 100% External Test set: 100% Decision Tree (MOE): results * Result has been checked using EXCEL: Pairs of c_source and c_thresh values uniquely define object class for whole datasets! Slide42:  34 descriptors: c_source and c_thresh were excluded Prediction accuracy for BOTH DATASETS: BAD Decision Tree (MOE): results Decision Tree (MOE): conclusions:  Model based on only two descriptors, c_source and c_thresh, predicts the classes with the accuracy of 100%. There are no other important descriptors in the dataset. Decision Tree (MOE): conclusions Summary:  Summary Predictive QSPR workflow affords statistically significant models which can be used directly for database mining. Extensive model validation is a must! Consensus screening is more effective than using single models Model building should be ongoing process concurrent with experimental validation and model enrichment  integrated workflows The public has an insatiable curiosity to know everything, except what is worth knowing. Oscar Wilde ACKNOWLEDGMENTS:  ACKNOWLEDGMENTS UNC ASSOCIATES Former: -Stephen CAMMER -Sung Jin CHO -Weifan ZHENG - Min SHEN - Bala KRISHNAMOORTHY Protein structure group: John GRIER Luke HUAN Ruchir SHAH Shuxing ZHANG Shuquan ZONG Peter Itskowitz Funding NIH NSF NCI-BSF Berlex, IBM, MCNC, GSK, Inspire, Millennium, Ortho-McNeil QSAR group: Alex GOLBRAIKH Raed KHASHAN Scott OLOFF Kun Wang Mei Wang Chris Grulke Jun FENG Yun-De XIAO Yuanyuan QIAO Patricia LIMA Assia KOVACHEVA M. KARTHIKEYAN Current

Related presentations


Other presentations created by Dabby

Propaganda Comparativa
16. 11. 2007
0 views

Propaganda Comparativa

ch 6 ppt
15. 06. 2007
0 views

ch 6 ppt

Feudal Japan Origin Religion
09. 10. 2007
0 views

Feudal Japan Origin Religion

Riedel DASER2
25. 09. 2007
0 views

Riedel DASER2

Shen CRF
25. 09. 2007
0 views

Shen CRF

Anna
11. 10. 2007
0 views

Anna

intro CS 3
16. 10. 2007
0 views

intro CS 3

TheatreHistoryO
17. 10. 2007
0 views

TheatreHistoryO

panama 5
22. 10. 2007
0 views

panama 5

Lesson 1 Intro and Pre WW II
22. 10. 2007
0 views

Lesson 1 Intro and Pre WW II

gf5
25. 09. 2007
0 views

gf5

hao discr prob mod rel dat
25. 09. 2007
0 views

hao discr prob mod rel dat

Correcting News Mistakes
05. 10. 2007
0 views

Correcting News Mistakes

MRCME Febrile Rash
23. 10. 2007
0 views

MRCME Febrile Rash

Microfinance MDGs
28. 11. 2007
0 views

Microfinance MDGs

kinetic models
25. 09. 2007
0 views

kinetic models

rtc
16. 10. 2007
0 views

rtc

debate
26. 10. 2007
0 views

debate

SALSA RTE Burchardt Frank
01. 11. 2007
0 views

SALSA RTE Burchardt Frank

Behav Interv Gay MA Users
02. 11. 2007
0 views

Behav Interv Gay MA Users

usits2001 talk
29. 10. 2007
0 views

usits2001 talk

ECCR IU Mar15 07
21. 11. 2007
0 views

ECCR IU Mar15 07

Lesson 1 Introduction
28. 12. 2007
0 views

Lesson 1 Introduction

99 ChemAware Chapter 03
02. 01. 2008
0 views

99 ChemAware Chapter 03

Dr G B Reddy
03. 01. 2008
0 views

Dr G B Reddy

Sloboda Prague
25. 09. 2007
0 views

Sloboda Prague

ber
02. 08. 2007
0 views

ber

05 bandura
02. 08. 2007
0 views

05 bandura

Robins
25. 09. 2007
0 views

Robins

Comp Gen Phylo HMM
25. 09. 2007
0 views

Comp Gen Phylo HMM

plkongres2007 crop 04
04. 10. 2007
0 views

plkongres2007 crop 04

lysenko
26. 11. 2007
0 views

lysenko

CNE120 11 8 04
02. 08. 2007
0 views

CNE120 11 8 04

Martin Hilbert
22. 10. 2007
0 views

Martin Hilbert

antioxidants
04. 03. 2008
0 views

antioxidants

presentation reynolds
07. 11. 2007
0 views

presentation reynolds

certeau present
03. 01. 2008
0 views

certeau present

NewBrunswick
12. 03. 2008
0 views

NewBrunswick

JVM models in ACL2
25. 09. 2007
0 views

JVM models in ACL2

ge203 08
25. 03. 2008
0 views

ge203 08

Q307 englanti
26. 03. 2008
0 views

Q307 englanti

auerickson
25. 09. 2007
0 views

auerickson

EcologicalFootprints
07. 04. 2008
0 views

EcologicalFootprints

TradeinHealthService s130207
28. 03. 2008
0 views

TradeinHealthService s130207

april cyprus lnarayanan
30. 03. 2008
0 views

april cyprus lnarayanan

BRAMBLE
31. 12. 2007
0 views

BRAMBLE

Macro course 2005 lecture 4
09. 04. 2008
0 views

Macro course 2005 lecture 4

summit2008a
10. 04. 2008
0 views

summit2008a

Wayne NY NJPresentation
13. 04. 2008
0 views

Wayne NY NJPresentation

AE2 C04 2007
14. 04. 2008
0 views

AE2 C04 2007

Rinolfi
17. 10. 2007
0 views

Rinolfi

HDX4000 Training NA
22. 04. 2008
0 views

HDX4000 Training NA

chapman poster 14jan05
25. 09. 2007
0 views

chapman poster 14jan05

BBC Series State of the Earth
08. 10. 2007
0 views

BBC Series State of the Earth

1960spowerpoint
02. 11. 2007
0 views

1960spowerpoint

hansjeppson
15. 10. 2007
0 views

hansjeppson

hegel
05. 01. 2008
0 views

hegel

exec blue 060120
18. 06. 2007
0 views

exec blue 060120

Ethiopia session II
18. 06. 2007
0 views

Ethiopia session II

emergenuity
18. 06. 2007
0 views

emergenuity

experiencia aenor
18. 06. 2007
0 views

experiencia aenor

India Work Plan UNCT
07. 01. 2008
0 views

India Work Plan UNCT

posterH2OinPFCs
01. 01. 2008
0 views

posterH2OinPFCs

etd2004
12. 10. 2007
0 views

etd2004

chi00
19. 11. 2007
0 views

chi00

38613SciTechStudies1
16. 10. 2007
0 views

38613SciTechStudies1

educause 2004 Fedora
25. 09. 2007
0 views

educause 2004 Fedora

cours7
23. 10. 2007
0 views

cours7

comics
15. 06. 2007
0 views

comics

Columbia Political Cartoons
15. 06. 2007
0 views

Columbia Political Cartoons

Collins Math Stats2
15. 06. 2007
0 views

Collins Math Stats2

Chapter Eight student version
15. 06. 2007
0 views

Chapter Eight student version

blagues
15. 06. 2007
0 views

blagues

Anime Manga Pres
15. 06. 2007
0 views

Anime Manga Pres

1193 Cartoons pig
15. 06. 2007
0 views

1193 Cartoons pig

1 cartoon
15. 06. 2007
0 views

1 cartoon

PBOCJapan060103
09. 10. 2007
0 views

PBOCJapan060103

control
15. 06. 2007
0 views

control

jcdl contentmodels
25. 09. 2007
0 views

jcdl contentmodels

curso dq abp joao
28. 12. 2007
0 views

curso dq abp joao

conf present 045
07. 01. 2008
0 views

conf present 045

05 International Conflict
23. 11. 2007
0 views

05 International Conflict

banse1
15. 06. 2007
0 views

banse1

Feg Express
18. 06. 2007
0 views

Feg Express

Fantasztikus programozas
18. 06. 2007
0 views

Fantasztikus programozas

smp99
25. 09. 2007
0 views

smp99

efg pr005
07. 11. 2007
0 views

efg pr005

F8 Femenino
18. 06. 2007
0 views

F8 Femenino

9 3 DEPAC SLPRS Ppresentation
29. 11. 2007
0 views

9 3 DEPAC SLPRS Ppresentation

geer sesiposter
25. 09. 2007
0 views

geer sesiposter