구조-활성의 정량적 관계: 두 판 사이의 차이

위키백과, 우리 모두의 백과사전.
내용 삭제됨 내용 추가됨
Cheminfo (토론 | 기여)
Cheminfo (토론 | 기여)
"Quantitative structure–activity relationship" 문서를 번역하여 만듦
15번째 줄: 15번째 줄:


== SAR과 SAR 역설 ==
== SAR과 SAR 역설 ==
모든 분자의 기본 적인 가정은 유사한 분자는 유사한 활성을 지닌다는 가정에 기초하며, 이러한 원리를 소위 구조-활동 관계(<u>SAR</u>)이라고 부른다. 근본적인 문제는 분자 수준에서의 ''작은'' 차이를 어떻게 나타낼 것이다.  그것은 생물학적 활성의 각각 종류, 예를 들어, [https://ko.wikipedia.org/w/화학%20반응 <u><font color="#0066cc">반응성</font></u>], <u>생체 변환성</u>, [https://ko.wikipedia.org/w/용해도 <u><font color="#0066cc">가용성</font></u>], 타겟 활성 등은 다른 차이에 따라 달라지기 때문이다. 좋은 예들은 Patanie/LaVoice<ref name="pmid11848856">{{저널 인용|제목=Bioisosterism: A Rational Approach in Drug Design|저널=Chemical Reviews|날짜=Dec 1996|권=96|호=8|쪽=3147–3176|doi=10.1021/cr950066q|pmid=11848856}}</ref> 와 Brown<ref>Nathan Brown. </ref> 가 쓴 <u>bioisosterism</font></u> 리뷰에 잘 나타나 있다. 
모든 분자의 기본 적인 가정은 유사한 분자는 유사한 활성을 지닌다는 가정에 기초하며, 이러한 원리를 소위 구조-활동 관계(<u>SAR</u>)이라고 부른다. 근본적인 문제는 분자 수준에서의 ''작은'' 차이를 어떻게 나타낼 것이다.  그것은 생물학적 활성의 각각 종류, 예를 들어, [https://ko.wikipedia.org/w/화학%20반응 <u><font color="#0066cc">반응성</font></u>], <u>생체 변환성</u>, [https://ko.wikipedia.org/w/용해도 <u><font color="#0066cc">가용성</font></u>], 타겟 활성 등은 다른 차이에 따라 달라지기 때문이다. 좋은 예들은 Patanie/LaVoice<ref name="pmid11848856">{{저널 인용|제목=Bioisosterism: A Rational Approach in Drug Design|저널=Chemical Reviews|날짜=Dec 1996|권=96|호=8|쪽=3147–3176|doi=10.1021/cr950066q|pmid=11848856}}</ref> 와 Brown<ref>Nathan Brown. </ref> 가 쓴 <u><font style="background-color: rgb(254, 252, 224);">bioisosterism</font></u><font style="background-color: rgb(254, 252, 224);"> 리뷰에 잘 나타나 있다.  </font>


일반적으로, 강한 경향성을 찾는 것이 더 관심을 갖는다. 세워진 [[가설|가설은]] 언제나 [[유한 집합|유한]]한 화학적 데이터의 수에 의존한다. 따라서, 과적합된 가설을 피하고, 구조/분자 데이터를 해석할 수 없거나, 과적합한 해석으로 유도되는 것을 막기 위하여 [[귀납|유도 원리]]를 존중해야 한다. 
일반적으로, 강한 경향성을 찾는 것이 더 관심을 갖는다. 세워진 [[가설|가설은]] 언제나 [[유한 집합|유한]]한 화학적 데이터의 수에 의존한다. 따라서, 과적합된 가설을 피하고, 구조/분자 데이터를 해석할 수 없거나, 과적합한 해석으로 유도되는 것을 막기 위하여 [[귀납|유도 원리]]를 존중해야 한다. 
30번째 줄: 30번째 줄:


An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed. This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.<ref name="Kumar_2013">{{저널 인용|제목=Pharmacophore-similarity-based QSAR (PS-QSAR) for group-specific biological activity predictions|저널=Journal of Biomolecular Structure & Dynamics|날짜=November 2013|권=33|호=1|쪽=56–69|doi=10.1080/07391102.2013.849618|pmid=24266725}}</ref>
An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed. This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.<ref name="Kumar_2013">{{저널 인용|제목=Pharmacophore-similarity-based QSAR (PS-QSAR) for group-specific biological activity predictions|저널=Journal of Biomolecular Structure & Dynamics|날짜=November 2013|권=33|호=1|쪽=56–69|doi=10.1080/07391102.2013.849618|pmid=24266725}}</ref>

=== 3D-QSAR ===
The acronym '''3D-QSAR''' or '''3-D QSAR''' refers to the application of force field calculations requiring three-dimensional structures of a given set of small molecules with known activities (training set). The training set need to be superimposed (aligned) by either experimental data (e.g. based on ligand-protein [[결정학|crystallography]]) or molecule superimposition software. It uses computed potentials, e.g. the Lennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. The first 3-D QSAR was named Comparative Molecular Field Analysis (CoMFA) by Cramer et al. It examined the steric fields (shape of the molecule) and the electrostatic fields<ref name="isbn0-582-38210-6">{{서적 인용|제목=Molecular modelling: principles and applications|연도=2001|출판사=Prentice Hall|위치=Englewood Cliffs, N.J|isbn=0-582-38210-6}}</ref> which were correlated by means of partial least squares regression (PLS).

The created data space is then usually reduced by a following feature extraction (see also dimensionality reduction). The following learning method can be any of the already mentioned [[기계 학습|machine learning]] methods, e.g. [[서포트 벡터 머신|support vector machines]].<ref name="isbn0-262-19509-7">{{서적 인용|제목=Kernel methods in computational biology|연도=2004|출판사=MIT Press|위치=Cambridge, Mass|isbn=0-262-19509-7}}</ref> An alternative approach uses multiple-instance learning by encoding molecules as sets of data instances, each of which represents a possible molecular conformation. A label or response is assigned to each set corresponding to the activity of the molecule, which is assumed to be determined by at least one instance in the set (i.e. some conformation of the molecule).<ref>{{저널 인용|제목=Solving the multiple instance problem with axis-parallel rectangles|저널=Artificial Intelligence|연도=1997|권=89|호=1–2|쪽=31–71|doi=10.1016/S0004-3702(96)00034-3}}</ref>

On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA) patent has dropped any restriction on the use of GRID and partial least-squares (PLS) technologies and the Rome Center for Molecular Design (RCMD) team ([http://dctf.uniroma1.it/dipartimento/persone/docenti/professori-associati/ragno-rino www.rcmd.it]) opened a 3-D QSAR web server ([http://www.3d-qsar.com/ www.3d-qsar.com]). Recently (October 2016) the 3D QSAR web server has been updated and opened to the public four basic web applications: Py-MolEdit, Py-ConfSearch, Py-Align an Py-CoMFA. The suffix Py stands for python as both the web site and the application have been developed with the [[파이썬|python]] language. The four applications allow to build a 3-D QSAR model from scratch by simply knowing the training set structures and bioactivities. The www.3D-QSAR.com server include all the features to analyze the molecular interactions fields (MIFs) and all the 3-D QSAR maps in a 3-D fashion and interactive way.

=== Chemical descriptor based ===
In this approach, descriptors quantifying various electronic, geometric, or steric properties of a molecule are computed and used to develop a QSAR.<ref>{{저널 인용|제목=Catalyst design: knowledge extraction from high-throughput experimentation|저널=J. Catal.|연도=2003|권=216|쪽=3776–3777|doi=10.1016/S0021-9517(02)00036-2}}</ref> This approach is different from the fragment (or group contribution) approach in that the descriptors are computed for the system as whole rather than from the properties of individual fragments. This approach is different from the 3D-QSAR approach in that the descriptors are computed from scalar quantities (e.g., energies, geometric parameters) rather than from 3D fields.

An example of this approach is the QSARs developed for olefin polymerization by half sandwich compounds.<ref name="pmid17348648">{{저널 인용|제목=Structure-activity correlation in titanium single-site olefin polymerization catalysts containing mixed cyclopentadienyl/aryloxide ligation|저널=Journal of the American Chemical Society|날짜=Apr 2007|권=129|호=13|쪽=3776–7|doi=10.1021/ja0640849|pmid=17348648}}</ref><ref name="Organometallics2012">{{저널 인용|제목=Structure–Activity Correlation for Relative Chain Initiation to Propagation Rates in Single-Site Olefin Polymerization Catalysis|저널=Organometallics|연도=2012|권=31|호=2|쪽=602–618|doi=10.1021/om200884x}}</ref>

== 모델링 ==
In the literature it can be often found that chemists have a preference for partial least squares (PLS) methods,{{출처}} since it applies the feature extraction and [[귀납|induction]] in one step.

=== Data mining approach ===
Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face a feature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining.

A typical [[데이터 마이닝|data mining]] based prediction uses e.g. [[서포트 벡터 머신|support vector machines]], [[결정 트리|decision trees]], [[인공신경망|neural networks]] for [[귀납|inducing]] a predictive learning model.

Molecule mining approaches, a special case of structured data mining approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore, there exist also approaches using maximum common subgraph searches or graph kernels.<ref name="isbn0-521-58519-8">{{서적 인용|제목=Algorithms on strings, trees, and sequences: computer science and computational biology|연도=1997|출판사=Cambridge University Press|위치=Cambridge, UK|isbn=0-521-58519-8}}</ref><ref name="isbn0-8247-2397-X">{{서적 인용|제목=Predictive toxicology|연도=2005|출판사=Taylor & Francis|위치=Washington, DC|isbn=0-8247-2397-X}}</ref>
[[파일:QSAR-protocol.jpg|오른쪽|섬네일|500x500픽셀]]

=== Matched molecular pair analysis ===
Typically QSAR models derived from non linear [[기계 학습|machine learning]] is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept of matched molecular pair analysis or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs.<ref name="Prediction-driven matched molecular pairs to interpret QSARs and aid the molecular optimization process">http://www.jcheminf.com/content/6/1/48/</ref>

== Evaluation of the quality of QSAR models ==
QSAR modeling produces predictive models derived from application of statistical tools correlating biological activity (including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of [[분자기하|molecular structure]] or properties. QSARs are being applied in many disciplines, for example: [[리스크 평가제도|risk assessment]], toxicity prediction, and regulatory decisions<ref name="Tong_2005">{{저널 인용|제목=Assessing QSAR Limitations – A Regulatory Perspective|저널=Current Computer-Aided Drug Design|날짜=April 2005|권=1|호=2|쪽=195&ndash;205|doi=10.2174/1573409053585663}}</ref> in addition to drug discovery and lead optimization.<ref name="pmid13677480">{{저널 인용|제목=In silico prediction of drug toxicity|저널=Journal of Computer-Aided Molecular Design|연도=2003|권=17|호=2-4|쪽=119–27|bibcode=2003JCAMD..17..119D|doi=10.1023/A:1025361621494|pmid=13677480}}</ref> Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds.

For validation of QSAR models, usually various strategies are adopted:<ref name="isbn3-527-30044-9">{{서적 인용|제목=Chemometric methods in molecular design|연도=1995|편집자-성=Waterbeemd, Han van de|출판사=VCH|위치=Weinheim|쪽=309&ndash;318|장=Statistical validation of QSAR results|isbn=3-527-30044-9}}</ref>
# internal validation or cross-validation (actually, while extracting data, cross validation is a measure of model robustness, the more a model is robust (higher q2) the less data extraction perturb the original model);
# external validation by splitting the available data set into training set for model development and prediction set for model predictivity check;
# blind external validation by application of model on new external data and
# data randomization or Y-scrambling for verifying the absence of chance correlation between the response and the modeling descriptors.
The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose; for QSAR models validation must be mainly for robustness, prediction performances and applicability domain (AD) of the models.<ref name="Roy2007">{{저널 인용|제목=On some aspects of validation of predictive quantitative structure-activity relationship models|저널=Expert Opinion on Drug Discovery|날짜=Dec 2007|권=2|호=12|쪽=1567–77|doi=10.1517/17460441.2.12.1567}}</ref>

Some validation methodologies can be problematic. For example, ''leave one-out'' cross-validation generally leads to an overestimation of predictive capacity. Even with external validation, it is difficult to determine whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published.

Different aspects of validation of QSAR models that need attention includes methods of selection of training set compounds,<ref>{{저널 인용|제목=On selection of training and test sets for the development of predictive QSAR models|저널=QSAR & Combinatorial Science|연도=2006|권=25|호=3|쪽=235–251|doi=10.1002/qsar.200510161}}</ref> setting training set size<ref>{{저널 인용|제목=Exploring the impact of size of training sets for the development of predictive QSAR models|저널=Chemometrics and Intelligent Laboratory Systems|연도=2008|권=90|호=1|쪽=31–42|doi=10.1016/j.chemolab.2007.07.004}}</ref> and impact of variable selection<ref name="pmid17933600">{{저널 인용|제목=Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure-retention relationships|저널=Analytica Chimica Acta|날짜=Oct 2007|권=602|호=2|쪽=164–72|doi=10.1016/j.aca.2007.09.014|pmid=17933600}}</ref> for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.<ref name="Roy_2009">{{저널 인용|제목=On two novel parameters for validation of predictive QSAR models|저널=Molecules|연도=2009|권=14|호=5|쪽=1660–701|doi=10.3390/molecules14051660|pmid=19471190}}</ref><ref name="pmid21800825">{{저널 인용|제목=Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient|저널=Journal of Chemical Information and Modeling|날짜=Sep 2011|권=51|호=9|쪽=2320–35|doi=10.1021/ci200211n|pmid=21800825}}</ref>

=== Chemical ===
One of the first historical QSAR applications was to predict [[끓는점|boiling points]].<ref name="isbn0-85626-454-7">{{서적 인용|제목=Chemical graph theory: introduction and fundamentals|연도=1991|출판사=Abacus Press|위치=Tunbridge Wells, Kent, England|isbn=0-85626-454-7}}</ref>

It is well known for instance that within a particular family of [[화합물|chemical compounds]], especially of [[유기화학|organic chemistry]], that there are strong [[상관분석|correlations]] between structure and observed properties. A simple example is the relationship between the number of carbons in [[알케인|alkanes]] and their [[끓는점|boiling points]]. There is a clear trend in the increase of boiling point with an increase in the number carbons, and this serves as a means for predicting the boiling points of higher alkanes.

A still very interesting application is the Hammett equation, Taft equation and pKa prediction methods.<ref name="RMC_2013">{{백과사전 인용}}</ref>

=== Biological ===
The biological activity of molecules is usually measured in assays to establish the level of inhibition of particular [[신호 전달|signal transduction]] or [[대사회로|metabolic pathways]]. Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific targets and have low [[독성도|toxicity]] (non-specific activity). Of special interest is the prediction of partition coefficient log ''P'', which is an important measure used in identifying "druglikeness" according to Lipinski's Rule of Five.

While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an [[효소|enzyme]] or [[수용체|receptor]] binding site, QSAR can also be used to study the interactions between the structural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from site-directed mutagenesis.<ref name="pmid12668435">{{저널 인용|제목=Structural modeling extends QSAR analysis of antibody-lysozyme interactions to 3D-QSAR|저널=Biophysical Journal|날짜=Apr 2003|권=84|호=4|쪽=2264–72|bibcode=2003BpJ....84.2264F|doi=10.1016/S0006-3495(03)75032-2|pmc=1302793|pmid=12668435}}</ref>

It is part of the [[기계 학습|machine learning]] method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE). In general, all QSAR problems can be divided into coding<ref name="isbn3-527-29913-0">{{서적 인용|제목=Handbook of Molecular Descriptors|연도=2002|출판사=Wiley-VCH|위치=Weinheim|isbn=3-527-29913-0}}</ref>
and [[학습|learning]].<ref name="isbn0-471-05669-3">{{서적 인용|제목=Pattern classification|연도=2001|출판사=John Wiley & Sons|위치=Chichester|isbn=0-471-05669-3}}</ref>

=== Applications ===
(Q)SAR models have been used for [[위험관리|risk management]]. QSARS are suggested by regulatory authorities; in the [[유럽 연합|European Union]], QSARs are suggested by the [[REACH]] regulation, where "REACH" abbreviates "Registration, Evaluation, Authorisation and Restriction of Chemicals".

The chemical descriptor space whose [[볼록 폐포|convex hull]] is generated by a particular training set of chemicals is called the training set's applicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain uses [[보외법|extrapolation]], and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic.

The QSAR equations can be used to predict biological activities of newer molecules before their synthesis.

Examples of machine learning tools for QSAR modeling include:<ref name="pmid25448759">{{저널 인용|제목=Machine-learning approaches in drug discovery: methods and applications|저널=Drug Discovery Today|url=https://dx.doi.org/10.6084/m9.figshare.3123040.v1|날짜=Mar 2015|권=20|호=3|쪽=318–31|doi=10.1016/j.drudis.2014.10.012|pmid=25448759}}</ref>
{| class="wikitable" style="margin-bottom: 10px;"
! S.No.
! Name
! Algorithms
! External link
|-
| 1.
| R
| RF,SVM, Naïve Bayesian, and ANN
| {{웹 인용|url=http://www.r-project.org/|제목=R: The R Project for Statistical Computing}}
|-
| 2.
| libSVM
| SVM
| {{웹 인용|url=https://www.csie.ntu.edu.tw/~cjlin/libsvm/|제목=LIBSVM -- A Library for Support Vector Machines}}
|-
| 3.
| Orange
| RF, SVM, and Naïve Bayesian
| {{웹 인용|url=http://www.ailab.si/orange/|제목=Orange Data Mining}}
|-
| 4.
| RapidMiner
| SVM, RF, Naïve Bayes, DT, ANN, and k-NN
| {{웹 인용|url=http://rapid-i.com/|제목=RapidMiner &#124; #1 Open Source Predictive Analytics Platform}}
|-
| 5.
| Weka
| RF, SVM, and Naïve Bayes
| {{웹 인용|url=http://www.cs.waikato.ac.nz/ml/weka/|제목=Weka 3 - Data Mining with Open Source Machine Learning Software in Java}}
|-
| 6.
| Knime
| DT, Naïve Bayes, and SVM
| {{웹 인용|url=http://www.knime.org/|제목=KNIME &#124; Open for Innovation}}
|-
| 7.
| AZOrange<ref name="pmid21798025">{{저널 인용|제목=AZOrange - High performance open source machine learning for QSAR modeling in a graphical programming environment|저널=Journal of Cheminformatics|연도=2011|권=3|쪽=28|doi=10.1186/1758-2946-3-28|pmc=3158423|pmid=21798025}}</ref>
| RT, SVM, ANN, and RF
| {{웹 인용|url=https://github.com/AZcompTox/AZOrange|제목=AZCompTox/AZOrange: AstraZeneca add-ons to Orange.|웹사이트=GitHub}}
|-
| 8.
| Tanagra
| SVM, RF, Naïve Bayes, and DT
| {{웹 인용|url=http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html|제목=TANAGRA - A free DATA MINING software for teaching and research}}
|-
| 9.
| Elki
| k-NN
| {{웹 인용|url=http://elki.dbs.ifi.lmu.de/++|제목=ELKI Data Mining Framework}}
|-
| 10.
| MALLET
| {{웹 인용|url=http://mallet.cs.umass.edu/|제목=MALLET homepage}}
|-
| 11.
| MOA
| {{웹 인용|url=http://moa.cms.waikato.ac.nz/+|제목=MOA Massive Online Analysis &#124; Real Time Analytics for Data Streams}}
|}

== See also ==
{{Columns-list|2|*[[ADME]]
*[[Matched molecular pair analysis]]
*[[Cheminformatics]]
*[[Computer-assisted drug design]] (CADD)
*[[Conformation–activity relationship]]
*[[Differential solubility]]
*[[Molecular design software]]
*[[Partition coefficient]]
*[[Pharmacokinetics]]
*[[Pharmacophore]]
*''[[QSAR & Combinatorial Science]]'' &ndash; [[Scientific journal]]
*[[List of software for molecular mechanics modeling|Software for molecular mechanics modeling]]
* [[Chemicalize.org]]:[[Chemicalize.org#List of the predicted structure based properties|List of predicted structure based properties]]}}


== References ==
== References ==

2017년 6월 15일 (목) 01:27 판

구조-활성의 정량적 관계 모델(QSAR 모델)은 화학, 생물학, 공학에서 사용되는 회귀 또는 분류모델이다. 다른 회귀모델과 같이 QSAR 회귀모델은 예측변수(X)와 수치형태의 반응변수(Y)와의 관계를 나타내고, QSAR 분류모델은 예측변수와 범주형태의 반응변수와 관계를 나타낸 것이다. 

QSAR 모델링에서 예측변수는 물리화학적 특성값이나, 이론적인 분자 표현자로 구성되어 있다. QSAR의 반응변수는 화합물의 생물학적 활성일 수 있다. 먼저 QSAR모델은 화학구조와 생물학적 활성과의 관계를 제안하여 요약하고, 그 다음에는 QSAR모델을 이용하여 새로운 화합물의 활성을 예측하는 것이다. 

관련된 용어로 구조-물성의 정량적 관계 (QSPR)는 반응변수로 화학적 물성을 사용하는 것이다. [1][2] 다양한 화학분자의 물성이나 거동이 QSPR분야에서 연구되고 있다.  몇 가지 예로서 "구조-반응성의 정량적 관계(QSRRs),  구조-크로마토그래피의 정량적 관계(QSCRs), 구조-독성의 정량적 관계(QSTRs), 구조–전기화학의 정량적 관계(QSERs), 구조–생분해성의 정량적 관계(QSBRs)"등이 있다.[3]

예를 들어,생물학적 활성은 특정 생물학적 반응을 나타내는 물질의 농도를 통해서 정량적으로 나타낼 수 있다.  또한 물리 화학적 특성 및 화학구조를 수치화 하면,  수학적인 관계 또는 그들간에 구조-활성의 정량적 관계를 찾을 수 있다.  수학식이 신중하게 검증되었다면[4][5][6] , 모델을 통해서 임의의 화학구조에 대한 반응값을 예측하는데 사용될 수 있다.[7]

QSAR은 수학적인 모델의 형태를 지니고 있다.

  • 활성= f(물리화학적 물성 및/또는 화학구조의 특성) + 오차

오차는 모델 오차(바이어스)과 관측치내의 변이을 나타내는 관측치 변이을 포함한다. 

QSAR 연구에서 필수적인 단계

QSAR/QSPR의 주요 단계는 (1) 데이터군의 선택과 구조 및 경험적 표현자 추출, (2) 변수선택 (3) 모델구현 (4) 평가검증을 포함한다. 

SAR과 SAR 역설

모든 분자의 기본 적인 가정은 유사한 분자는 유사한 활성을 지닌다는 가정에 기초하며, 이러한 원리를 소위 구조-활동 관계(SAR)이라고 부른다. 근본적인 문제는 분자 수준에서의 작은 차이를 어떻게 나타낼 것이다.  그것은 생물학적 활성의 각각 종류, 예를 들어, 반응성, 생체 변환성, 가용성, 타겟 활성 등은 다른 차이에 따라 달라지기 때문이다. 좋은 예들은 Patanie/LaVoice[8] 와 Brown[9] 가 쓴 bioisosterism 리뷰에 잘 나타나 있다.  

일반적으로, 강한 경향성을 찾는 것이 더 관심을 갖는다. 세워진 가설은 언제나 유한한 화학적 데이터의 수에 의존한다. 따라서, 과적합된 가설을 피하고, 구조/분자 데이터를 해석할 수 없거나, 과적합한 해석으로 유도되는 것을 막기 위하여 유도 원리를 존중해야 한다. 

SAR 역설은  모든 유사한 분자는 유사한 활성을 지니지 않는 경우가 있다는 사실에 근거한다. 

종류

조각기반 (그룹 기여도)

유사하게,"분배 계수"— 용해도 차이 측정, 그리고 그 자체가 QSAR예측의 예측변수로 사용—는 원자화 기법("XLogP"나 "ALogP"로 알려짐)이나 화학 조각기법("CLogP"와 기타 변형된 형태로 알려짐)을 통해서 예측될 수 있다.  화합물의 logP는 각 조각들의 합으로 결정할 수 있다고 알려져 있고, 조각기반 기법이 원자화 기법보다 예측이 우수하다고 받아 들려지고 있다. [10] 조각의 기여값은 알려진 실험값 logP 데이터를 기반으로 통계적으로 결정된다. 이 방법은 혼합된 결과를 주며, 일반적으로 ±0.1 단위 이상의 정확도를 갖는 것을 신뢰하지는 않는다.[11]

Group or Fragment based QSAR is also known as GQSAR. GQSAR allows flexibility to study various molecular fragments of interest in relation to the variation in biological response. The molecular fragments could be substituents at various substitution sites in congeneric set of molecules or could be on the basis of pre-defined chemical rules in case of non-congeneric sets. GQSAR also considers cross-terms fragment descriptors, which could be helpful in identification of key fragment interactions in determining variation of activity.[12] Lead discovery using Fragnomics is an emerging paradigm. In this context FB-QSAR proves to be a promising strategy for fragment library design and in fragment-to-lead identification endeavours.[13]

An advanced approach on fragment or group-based QSAR based on the concept of pharmacophore-similarity is developed. This method, pharmacophore-similarity-based QSAR (PS-QSAR) uses topological pharmacophoric descriptors to develop QSAR models. This activity prediction may assist the contribution of certain pharmacophore features encoded by respective fragments toward activity improvement and/or detrimental effects.[14]

3D-QSAR

The acronym 3D-QSAR or 3-D QSAR refers to the application of force field calculations requiring three-dimensional structures of a given set of small molecules with known activities (training set). The training set need to be superimposed (aligned) by either experimental data (e.g. based on ligand-protein crystallography) or molecule superimposition software. It uses computed potentials, e.g. the Lennard-Jones potential, rather than experimental constants and is concerned with the overall molecule rather than a single substituent. The first 3-D QSAR was named Comparative Molecular Field Analysis (CoMFA) by Cramer et al. It examined the steric fields (shape of the molecule) and the electrostatic fields[15] which were correlated by means of partial least squares regression (PLS).

The created data space is then usually reduced by a following feature extraction (see also dimensionality reduction). The following learning method can be any of the already mentioned machine learning methods, e.g. support vector machines.[16] An alternative approach uses multiple-instance learning by encoding molecules as sets of data instances, each of which represents a possible molecular conformation. A label or response is assigned to each set corresponding to the activity of the molecule, which is assumed to be determined by at least one instance in the set (i.e. some conformation of the molecule).[17]

On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA) patent has dropped any restriction on the use of GRID and partial least-squares (PLS) technologies and the Rome Center for Molecular Design (RCMD) team (www.rcmd.it) opened a 3-D QSAR web server (www.3d-qsar.com). Recently (October 2016) the 3D QSAR web server has been updated and opened to the public four basic web applications: Py-MolEdit, Py-ConfSearch, Py-Align an Py-CoMFA. The suffix Py stands for python as both the web site and the application have been developed with the python language. The four applications allow to build a 3-D QSAR model from scratch by simply knowing the training set structures and bioactivities. The www.3D-QSAR.com server include all the features to analyze the molecular interactions fields (MIFs) and all the 3-D QSAR maps in a 3-D fashion and interactive way.

Chemical descriptor based

In this approach, descriptors quantifying various electronic, geometric, or steric properties of a molecule are computed and used to develop a QSAR.[18] This approach is different from the fragment (or group contribution) approach in that the descriptors are computed for the system as whole rather than from the properties of individual fragments. This approach is different from the 3D-QSAR approach in that the descriptors are computed from scalar quantities (e.g., energies, geometric parameters) rather than from 3D fields.

An example of this approach is the QSARs developed for olefin polymerization by half sandwich compounds.[19][20]

모델링

In the literature it can be often found that chemists have a preference for partial least squares (PLS) methods,[출처 필요] since it applies the feature extraction and induction in one step.

Data mining approach

Computer SAR models typically calculate a relatively large number of features. Because those lack structural interpretation ability, the preprocessing steps face a feature selection problem (i.e., which structural features should be interpreted to determine the structure-activity relationship). Feature selection can be accomplished by visual inspection (qualitative selection by a human); by data mining; or by molecule mining.

A typical data mining based prediction uses e.g. support vector machines, decision trees, neural networks for inducing a predictive learning model.

Molecule mining approaches, a special case of structured data mining approaches, apply a similarity matrix based prediction or an automatic fragmentation scheme into molecular substructures. Furthermore, there exist also approaches using maximum common subgraph searches or graph kernels.[21][22]

Matched molecular pair analysis

Typically QSAR models derived from non linear machine learning is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept of matched molecular pair analysis or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs.[23]

Evaluation of the quality of QSAR models

QSAR modeling produces predictive models derived from application of statistical tools correlating biological activity (including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure or properties. QSARs are being applied in many disciplines, for example: risk assessment, toxicity prediction, and regulatory decisions[24] in addition to drug discovery and lead optimization.[25] Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds.

For validation of QSAR models, usually various strategies are adopted:[26]

  1. internal validation or cross-validation (actually, while extracting data, cross validation is a measure of model robustness, the more a model is robust (higher q2) the less data extraction perturb the original model);
  2. external validation by splitting the available data set into training set for model development and prediction set for model predictivity check;
  3. blind external validation by application of model on new external data and
  4. data randomization or Y-scrambling for verifying the absence of chance correlation between the response and the modeling descriptors.

The success of any QSAR model depends on accuracy of the input data, selection of appropriate descriptors and statistical tools, and most importantly validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose; for QSAR models validation must be mainly for robustness, prediction performances and applicability domain (AD) of the models.[27]

Some validation methodologies can be problematic. For example, leave one-out cross-validation generally leads to an overestimation of predictive capacity. Even with external validation, it is difficult to determine whether the selection of training and test sets was manipulated to maximize the predictive capacity of the model being published.

Different aspects of validation of QSAR models that need attention includes methods of selection of training set compounds,[28] setting training set size[29] and impact of variable selection[30] for training set models for determining the quality of prediction. Development of novel validation parameters for judging quality of QSAR models is also important.[31][32]

Chemical

One of the first historical QSAR applications was to predict boiling points.[33]

It is well known for instance that within a particular family of chemical compounds, especially of organic chemistry, that there are strong correlations between structure and observed properties. A simple example is the relationship between the number of carbons in alkanes and their boiling points. There is a clear trend in the increase of boiling point with an increase in the number carbons, and this serves as a means for predicting the boiling points of higher alkanes.

A still very interesting application is the Hammett equation, Taft equation and pKa prediction methods.[34]

Biological

The biological activity of molecules is usually measured in assays to establish the level of inhibition of particular signal transduction or metabolic pathways. Drug discovery often involves the use of QSAR to identify chemical structures that could have good inhibitory effects on specific targets and have low toxicity (non-specific activity). Of special interest is the prediction of partition coefficient log P, which is an important measure used in identifying "druglikeness" according to Lipinski's Rule of Five.

While many quantitative structure activity relationship analyses involve the interactions of a family of molecules with an enzyme or receptor binding site, QSAR can also be used to study the interactions between the structural domains of proteins. Protein-protein interactions can be quantitatively analyzed for structural variations resulted from site-directed mutagenesis.[35]

It is part of the machine learning method to reduce the risk for a SAR paradox, especially taking into account that only a finite amount of data is available (see also MVUE). In general, all QSAR problems can be divided into coding[36] and learning.[37]

Applications

(Q)SAR models have been used for risk management. QSARS are suggested by regulatory authorities; in the European Union, QSARs are suggested by the REACH regulation, where "REACH" abbreviates "Registration, Evaluation, Authorisation and Restriction of Chemicals".

The chemical descriptor space whose convex hull is generated by a particular training set of chemicals is called the training set's applicability domain. Prediction of properties of novel chemicals that are located outside the applicability domain uses extrapolation, and so is less reliable (on average) than prediction within the applicability domain. The assessment of the reliability of QSAR predictions remains a research topic.

The QSAR equations can be used to predict biological activities of newer molecules before their synthesis.

Examples of machine learning tools for QSAR modeling include:[38]

S.No. Name Algorithms External link
1. R RF,SVM, Naïve Bayesian, and ANN “R: The R Project for Statistical Computing”. 
2. libSVM SVM “LIBSVM -- A Library for Support Vector Machines”. 
3. Orange RF, SVM, and Naïve Bayesian “Orange Data Mining”. 
4. RapidMiner SVM, RF, Naïve Bayes, DT, ANN, and k-NN “RapidMiner | #1 Open Source Predictive Analytics Platform”. 
5. Weka RF, SVM, and Naïve Bayes “Weka 3 - Data Mining with Open Source Machine Learning Software in Java”. 
6. Knime DT, Naïve Bayes, and SVM “KNIME | Open for Innovation”. 
7. AZOrange[39] RT, SVM, ANN, and RF “AZCompTox/AZOrange: AstraZeneca add-ons to Orange.”. 《GitHub》. 
8. Tanagra SVM, RF, Naïve Bayes, and DT “TANAGRA - A free DATA MINING software for teaching and research”. 
9. Elki k-NN “ELKI Data Mining Framework”. 
10. MALLET “MALLET homepage”. 
11. MOA “MOA Massive Online Analysis | Real Time Analytics for Data Streams”. 

See also

2

References

  1. Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T, Prachayasittikul V (2009). “A practical overview of quantitative structure-activity relationship”. 《Excli J.》 8: 74–88. 
  2. Nantasenamat C, Isarankura-Na-Ayudhya C, Prachayasittikul V (Jul 2010). “Advances in computational methods to predict the biological activity of compounds”. 《Expert Opinion on Drug Discovery》 5 (7): 633–54. doi:10.1517/17460441.2010.492827. PMID 22823204. 
  3. Yousefinejad S, Hemmateenejad B (2015). “Chemometrics tools in QSAR/QSPR studies: A historical perspective”. 《Chemometrics and Intelligent Laboratory Systems》. 149, Part B: 177–204. doi:10.1016/j.chemolab.2015.06.016. 
  4. “The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models”. 《QSAR &Comb. Sci.》 22: 69–77. 2003. doi:10.1002/qsar.200390007. 
  5. “Principles of QSAR models validation: internal and external”. 《QSAR &Comb. Sci.》 26: 694–701. 2007. doi:10.1002/qsar.200610151. 
  6. “Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection”. 《Journal of Chemical Information and Modeling》 52 (8): 2044–58. Aug 2012. doi:10.1021/ci300084j. PMID 22721530. 
  7. Tropsha, Alexander (2010). “Best Practices for QSAR Model Development, Validation, and Exploitation”. 《Molecular Informatics》 29 (6-7): 476–488. doi:10.1002/minf.201000061. ISSN 1868-1743. 
  8. “Bioisosterism: A Rational Approach in Drug Design”. 《Chemical Reviews》 96 (8): 3147–3176. Dec 1996. doi:10.1021/cr950066q. PMID 11848856. 
  9. Nathan Brown.
  10. “On the hydrophobicity of peptides: Comparing empirical predictions of peptide log P values”. 《Bioinformation》 1 (7): 237–41. 2006. doi:10.6026/97320630001237. PMC 1891704. PMID 17597897. 
  11. “Prediction of physicochemical parameters by atomic contributions”. 《J. Chem. Inf. Comput. Sci》 39 (5): 868–873. 1999. doi:10.1021/ci990307l. 
  12. “Group-Based QSAR (G-QSAR)”. 
  13. “Rationalizing fragment based drug discovery for BACE1: insights from FB-QSAR, FB-QSSR, multi objective (MO-QSPR) and MIF studies”. 《Journal of Computer-Aided Molecular Design》 24 (10): 843–64. Oct 2010. Bibcode:2010JCAMD..24..843M. doi:10.1007/s10822-010-9378-9. PMID 20740315. 
  14. “Pharmacophore-similarity-based QSAR (PS-QSAR) for group-specific biological activity predictions”. 《Journal of Biomolecular Structure & Dynamics》 33 (1): 56–69. November 2013. doi:10.1080/07391102.2013.849618. PMID 24266725. 
  15. 《Molecular modelling: principles and applications》. Englewood Cliffs, N.J: Prentice Hall. 2001. ISBN 0-582-38210-6. 
  16. 《Kernel methods in computational biology》. Cambridge, Mass: MIT Press. 2004. ISBN 0-262-19509-7. 
  17. “Solving the multiple instance problem with axis-parallel rectangles”. 《Artificial Intelligence》 89 (1–2): 31–71. 1997. doi:10.1016/S0004-3702(96)00034-3. 
  18. “Catalyst design: knowledge extraction from high-throughput experimentation”. 《J. Catal.》 216: 3776–3777. 2003. doi:10.1016/S0021-9517(02)00036-2. 
  19. “Structure-activity correlation in titanium single-site olefin polymerization catalysts containing mixed cyclopentadienyl/aryloxide ligation”. 《Journal of the American Chemical Society》 129 (13): 3776–7. Apr 2007. doi:10.1021/ja0640849. PMID 17348648. 
  20. “Structure–Activity Correlation for Relative Chain Initiation to Propagation Rates in Single-Site Olefin Polymerization Catalysis”. 《Organometallics》 31 (2): 602–618. 2012. doi:10.1021/om200884x. 
  21. 《Algorithms on strings, trees, and sequences: computer science and computational biology》. Cambridge, UK: Cambridge University Press. 1997. ISBN 0-521-58519-8. 
  22. 《Predictive toxicology》. Washington, DC: Taylor & Francis. 2005. ISBN 0-8247-2397-X. 
  23. http://www.jcheminf.com/content/6/1/48/
  24. “Assessing QSAR Limitations – A Regulatory Perspective”. 《Current Computer-Aided Drug Design》 1 (2): 195–205. April 2005. doi:10.2174/1573409053585663. 
  25. “In silico prediction of drug toxicity”. 《Journal of Computer-Aided Molecular Design》 17 (2-4): 119–27. 2003. Bibcode:2003JCAMD..17..119D. doi:10.1023/A:1025361621494. PMID 13677480. 
  26. Waterbeemd, Han van de, 편집. (1995). 〈Statistical validation of QSAR results〉. 《Chemometric methods in molecular design》. Weinheim: VCH. 309–318쪽. ISBN 3-527-30044-9. 
  27. “On some aspects of validation of predictive quantitative structure-activity relationship models”. 《Expert Opinion on Drug Discovery》 2 (12): 1567–77. Dec 2007. doi:10.1517/17460441.2.12.1567. 
  28. “On selection of training and test sets for the development of predictive QSAR models”. 《QSAR & Combinatorial Science》 25 (3): 235–251. 2006. doi:10.1002/qsar.200510161. 
  29. “Exploring the impact of size of training sets for the development of predictive QSAR models”. 《Chemometrics and Intelligent Laboratory Systems》 90 (1): 31–42. 2008. doi:10.1016/j.chemolab.2007.07.004. 
  30. “Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure-retention relationships”. 《Analytica Chimica Acta》 602 (2): 164–72. Oct 2007. doi:10.1016/j.aca.2007.09.014. PMID 17933600. 
  31. “On two novel parameters for validation of predictive QSAR models”. 《Molecules》 14 (5): 1660–701. 2009. doi:10.3390/molecules14051660. PMID 19471190. 
  32. “Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient”. 《Journal of Chemical Information and Modeling》 51 (9): 2320–35. Sep 2011. doi:10.1021/ci200211n. PMID 21800825. 
  33. 《Chemical graph theory: introduction and fundamentals》. Tunbridge Wells, Kent, England: Abacus Press. 1991. ISBN 0-85626-454-7. 
  34. 인용 틀이 비었음 (도움말) 
  35. “Structural modeling extends QSAR analysis of antibody-lysozyme interactions to 3D-QSAR”. 《Biophysical Journal》 84 (4): 2264–72. Apr 2003. Bibcode:2003BpJ....84.2264F. doi:10.1016/S0006-3495(03)75032-2. PMC 1302793. PMID 12668435. 
  36. 《Handbook of Molecular Descriptors》. Weinheim: Wiley-VCH. 2002. ISBN 3-527-29913-0. 
  37. 《Pattern classification》. Chichester: John Wiley & Sons. 2001. ISBN 0-471-05669-3. 
  38. “Machine-learning approaches in drug discovery: methods and applications”. 《Drug Discovery Today》 20 (3): 318–31. Mar 2015. doi:10.1016/j.drudis.2014.10.012. PMID 25448759. 
  39. “AZOrange - High performance open source machine learning for QSAR modeling in a graphical programming environment”. 《Journal of Cheminformatics》 3: 28. 2011. doi:10.1186/1758-2946-3-28. PMC 3158423. PMID 21798025.