|2.1 Retrieval and compiling of various bioassays related to M. tuberculosis|
PubChem bioassay database of NCBI and ChEMBL assay stores and manages a large number of biochemical assays and are frequently updated on a routinely manner to include new entries. These assays include the bioactivity outcomes associated with several thousand macromolecular targets, tested in the whole-cell (BioSystems), enzyme based and protein targets.
It was observed that, 52% of the entries in bioassay correspond to protein, which means bioassay with protein sequences were specified as targets, 47% were in the category of BioSystems which means target sequences that are involved in biological pathways and 1% corresponds to the RNAi as target sequence.
Our assumption here is to use data driven results to determine the relationship between unknown compounds with the tested compounds. Since structure – structure similarity can help us in estimating which compounds have larger associations with what kind of compounds; it will also assist in understanding the targets for unknown compounds based on the shared targets strategy.
Table 1: Target distribution for Mtb Assays
|2.2 Multi-dimensional Scaling studies on Mtb bioassay sets|
Multi-dimensional Scaling (MDS): This technique is a non-linear mapping approach to “rearrange” objects in an efficient manner, and thus to arrive at a configuration that best approximates the observed distance. MDS uses a function minimization algorithm that evaluates different configurations with the goal of maximizing the goodness-of-fit.
Atom-pair descriptor: Atom-pair is defined as two non hydrogen atoms and an interatomic separation measured in bonds along the shortest path connecting the two atoms. The description included the number of heavy-atom connections and the number of π electron pairs on each atom .
We calculated atom pair descriptor for data set as a measure for clustering the similar compounds and to find the various structural outliers through R visualization methods. The compound data sets used for the MDS analysis was AID 1626; screened against the whole-cell Mtb H37Rv strain and were retrieved from PubChem bioassay. Among the total of 215101 compounds tested, 2044 were active, 209567 as inactive and 3501 compounds as inconclusive were reported.
We considered only the active data set and performed the analysis on Rstudio. R language is a statistical language used for a wide statistical and graphical analysis on different data sets. ChemmineR is a specific package developed for the cheminformatics analysis of drug-like small molecule data in R environment. It contains the functions for efficient processing of large numbers of molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms.
The various commands executed on ChemmineR package are as follows:
b) header(sdfset[]) #header information
“95 97 0 1 0 0 0 0 0999 V2000”
d) propma[1:4,] # displays the Molecular formula, mol.wt and atom description of four compounds
#creating a boxplot on the atoms frequency
g) groups(sdfset[1:4], groups=”fctgroup”, type=”countMA”)
#Enumerate functional groups:
h) apset <- sdf2ap(sdfset)
Computing the atom-pair descriptors for compounds.
i) clusters <- cmp.cluster(db=apset, cutoff = c(0.7, 0.8, 0.9), quiet = TRUE)
j) cluster.visualize (apset,clusters,size.off = 2, quiet = TRUE)
Table 2: Assay activity scores of above compounds
The compound distribution as shown in the binning clustering plot has a wide range of applications in analyzing the compounds based on their activity scores. Though we randomly choose the compounds from the plot but in the future work we will look at the different dimensions at which compound hot-spots could be found. This could also help in the activity prediction of virtual compounds.
|2.3 Developing Hadoop-MapReduce Platform for Storage and Analysis of Chemical Data.|
Another aspect of our big data project is to develop hadoop architecture for Platform as a Service (PaaS). This can be from just Storage aspects through to Management techniques in manipulating and wrenching of useful data sets to obtain the information in the form of patterns and trends.
But essentially, a Big Data application has to evolve more than the related field of pattern discovery and data mining. Hadoop, a much advanced framework for distributed data management and processing contains libraries for running data processing on a distributed computer architecture using MapReduce programming model and its own distributed file system called HDFS. It has well established application of scalability. In particular, Hadoop can process extremely large volumes of data with varying structures (or no structure at all).
I. To work on different clustering algorithms on compound data set using MDS approaches.
II. To write R scripts for MapReduce algorithm and running descriptor calculations for 90 M compounds and GDB 17 database.
III. To visualize larger data sets and thereby form the network based analysis in deciphering the compound target relationship.
IV. To develop Hadoop architecture as Platform for carrying out the high –dimensional analysis and integration of R with hadoop.
1. Fan J, Han F, Liu H. Challenges of Big Data Analysis. Natl Sci Rev. 2014 Jun;1(2):293-314.
2. Haggarty SJ, Clemons PA, Schreiber SL. Chemical genomic profiling of biological networks using graph theory and combinations of small molecule perturbations. J Am Chem Soc. 2003 Sep 3;125(35):10543-5.
3. Carhart RE, Smith DH, Venkataraghavan R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci., 1985, 25 (2), pp 64–73
4. Cao Y, Charisi A, Cheng LC, Jiang T, Girke T. ChemmineR: a compound mining framework for R. Bioinformatics. 2008 Aug 1;24(15):1733-4