Work methodology

Work done:-

2.1 Retrieval and compiling of various bioassays related to M. tuberculosis

 

PubChem bioassay database of NCBI and ChEMBL assay stores and manages a large number of biochemical assays and are frequently updated on a routinely manner to include new entries. These assays include the bioactivity outcomes associated with several thousand macromolecular targets, tested in the whole-cell (BioSystems), enzyme based and protein targets.

It was observed that, 52% of the entries in bioassay correspond to protein, which means bioassay with protein sequences were specified as targets, 47% were in the category of BioSystems which means target sequences that are involved in biological pathways and 1% corresponds to the RNAi as target sequence.

Our assumption here is to use data driven results to determine the relationship between unknown compounds with the tested compounds. Since structure – structure similarity can help us in estimating which compounds have larger associations with what kind of compounds; it will also assist in understanding the targets for unknown compounds based on the shared targets strategy[2].

 Assay distribution

Assay type Number
Protein 1204
RNAi Target 19
BioSystems 1088
Total 2311

             Table 1: Target distribution for Mtb Assays

2.2 Multi-dimensional Scaling studies on Mtb bioassay sets

Multi-dimensional Scaling (MDS): This technique is a non-linear mapping approach to “rearrange” objects in an efficient manner, and thus to arrive at a configuration that best approximates the observed distance. MDS uses a function minimization algorithm that evaluates different configurations with the goal of maximizing the goodness-of-fit.

Atom-pair descriptor: Atom-pair is defined as two non hydrogen atoms and an interatomic separation measured in bonds along the shortest path connecting the two atoms. The description included the number of heavy-atom connections and the number of π electron pairs on each atom [3].
We calculated atom pair descriptor for data set as a measure for clustering the similar compounds and to find the various structural outliers through R visualization methods. The compound data sets used for the MDS analysis was AID 1626; screened against the whole-cell Mtb H37Rv strain and were retrieved from PubChem bioassay. Among the total of 215101 compounds tested, 2044 were active, 209567 as inactive and 3501 compounds as inconclusive were reported.

We considered only the active data set and performed the analysis on Rstudio. R language is a statistical language used for a wide statistical and graphical analysis on different data sets. ChemmineR[4] is a specific package developed for the cheminformatics analysis of drug-like small molecule data in R environment. It contains the functions for efficient processing of large numbers of molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms.

The various commands executed on ChemmineR package are as follows:

a) sdfset

b) header(sdfset[[1]]) #header information
Molecule_Name
“54719543”
Source
“-OEChem-02191408432D”
Comment
“”
Counts_Line
“95 97 0 1 0 0 0 0 0999 V2000”

c) propma

d) propma[1:4,] # displays the Molecular formula, mol.wt and atom description of four compounds

PubChem_Compound_ID Mol.formula  Mol.Wt C H O Cl N S F Se I
54719543 C34H53O8 589.7798 34 53 8 0 0 0 0 0 0
54682933 C22H21ClN2O8 476.8637 22 21 8 1 2 0 0 0 0
54680690 C21H21ClN2O8 464.8530 21 21 8 1 2 0 0 0 0
54679384 C23H27N3O7 457.4764 23 27 7 0 3 0 0 0 0

e)propma
f)boxplot(propma,col=”blue”,main=”Atom Frequency”)
#creating a boxplot on the atoms frequency

Atom frequency

g) groups(sdfset[1:4], groups=”fctgroup”, type=”countMA”)
#Enumerate functional groups:

PubChem_Compound_ID RNH2 R2NH R3N ROPO3 ROH RCHO RCOR RCOOH RCOOR ROR RCCH RCN
54719543 0 0 0 0 3 0 1 1 0 2 0 0
54682933 0 0 1 0 5 0 2 0 0 0 0 0
54680690 0 0 1 0 5 0 2 0 0 0 0 0
54679384 0 0 2 0 4 0 2 0 0 0 0 0

h) apset <- sdf2ap(sdfset)
Computing the atom-pair descriptors for compounds.
i) clusters <- cmp.cluster(db=apset, cutoff = c(0.7, 0.8, 0.9), quiet = TRUE)
j) cluster.visualize (apset,clusters,size.off = 2, quiet = TRUE)

CLustering

Screenshot 2015-11-27 19.43.54

comp

PUBCHEM_CID PUBCHEM_ACTIVITY_OUTCOME PUBCHEM_ACTIVITY_SCORE
882594 Active 41
16192266 Active 75
16746272 Active 76
16746619 Active 41
3244848 Active 41
2121796 Active 80
1986244 Active 72
655522 Active 77

Table 2: Assay activity scores of above compounds

The compound distribution as shown in the binning clustering plot has a wide range of applications in analyzing the compounds based on their activity scores. Though we randomly choose the compounds from the plot but in the future work we will look at the different dimensions at which compound hot-spots could be found. This could also help in the activity prediction of virtual compounds.

2.3 Developing Hadoop-MapReduce Platform for Storage and Analysis of Chemical Data.

Another aspect of our big data project is to develop hadoop architecture for Platform as a Service (PaaS). This can be from just Storage aspects through to Management techniques in manipulating and wrenching of useful data sets to obtain the information in the form of patterns and trends.
But essentially, a Big Data application has to evolve more than the related field of pattern discovery and data mining. Hadoop, a much advanced framework for distributed data management and processing contains libraries for running data processing on a distributed computer architecture using MapReduce programming model and its own distributed file system called HDFS. It has well established application of scalability. In particular, Hadoop can process extremely large volumes of data with varying structures (or no structure at all).

Future Work:
I. To work on different clustering algorithms on compound data set using MDS approaches.
II. To write R scripts for MapReduce algorithm and running descriptor calculations for 90 M compounds and GDB 17 database.
III. To visualize larger data sets and thereby form the network based analysis in deciphering the compound target relationship.
IV. To develop Hadoop architecture as Platform for carrying out the high –dimensional analysis and integration of R with hadoop.

References:
1. Fan J, Han F, Liu H. Challenges of Big Data Analysis. Natl Sci Rev. 2014 Jun;1(2):293-314.
2. Haggarty SJ, Clemons PA, Schreiber SL. Chemical genomic profiling of biological networks using graph theory and combinations of small molecule perturbations. J Am Chem Soc. 2003 Sep 3;125(35):10543-5.
3. Carhart RE, Smith DH, Venkataraghavan R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci., 1985, 25 (2), pp 64–73
4. Cao Y, Charisi A, Cheng LC, Jiang T, Girke T. ChemmineR: a compound mining framework for R. Bioinformatics. 2008 Aug 1;24(15):1733-4