This is a post about the biomedical and chemical datasets.
- ZINC database [J. J. Irwin et al., 2012] [Sterling and Irwin, 2015] [Kusner et al., 2017]
- molecule dataset
- 250,000 drug like commercially abailable molecules
- 35 million commercially-available compounds
- maximum atom number 38
- paper: Zinc: a free tool to discover chemistry for biology [J. J. Irwin et al., 2012]
- paper: ZINC 15 – Ligand Discovery for Everyone [Sterling and Irwin, 2015]
- Connectivity Map [Subramanian A, et al., 2017] [Lamb J, et al., 2006]
- A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles [Subramanian A, et al.]
- The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. [Lamb J, et al.]
- Protein Data Bank [Helen Berman et al., 2003] [Kleywegt GJ 2018]
- information about the 3D structures of proteins, nucleic acids, and complex assemblies.
- GEO [Tanya Barrett, 2013]
- Gene Expression Omnibus
- international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community.
- BioSNAP : Stanford Biomedical Network Dataset Collection [Sagar Maheshwari and Jure Leskovec, 2018]
- Entity types
- Cell [C] : basic structural, biological, and functional unit of all organisms measured by single-cell technologies
- Disease [D] : medical condition that is associated with specific symptoms and signs
- Drug/Chemical [Ch] : chemical substance of known structure that produces a biological effect
- Function [F] : gene role classified into molecular functions, cellular components, and biological processes
- Gene [G] : sequence of DNA or RNA that codes for a molecule that has a function
- Genomic region [Gr] : segment of a nucleic acid molecule, e.g., a regulatory sequence
- Protein [P] : molecule that performs a vast array of functions within organisms
- Side-effect [Se] : secondary, typically undesirable effect of a drug or medical treatment
- Species/Organism [S] : basic unit of classification and a taxonomic rank, as well as a unit of biodiversity
- Tissue [T] : cellular organizational level between cells and a complete organ
- Entity types
GTEx (Genotype Tissue Expression): 유전형(SNP칩, 전장엑솜, 전장게놈), 전사체(RNA-Seq), 표현형( 포괄적인 표현형 정보와 임상 정보)
NCI-60 (National Cancer Institute Anticancer Drug Screen): 유전형(전장엑솜), 전사체(mRNA칩, miRNA칩), 단백질체(SWATH 프로파일), 표현형(암세포주 및 약물처리 정보)
ENCODE (Encyclopedia of DNA Elements): 유전형(세포주의 전장게놈), 전사체(RNA-Seq), 후성유전체(ChIP-seq, DNase-seq, 5C:Chromatin Conformation Capture Carbon Copy)
TCGA (The Cancer Genome Atlas): 유전형(암 전장 게놈 및 전장 엑솜), 전사체(RNA-Seq, miRNA-Seq), 후성유전체(methyl-Seq), 단백질체(역상 단백질칩: reverse phase protein array), 표현형(병리, 기초 임상정보, 약물정보)
1000 Genome Project: 유전형(전장게놈), 전사체(RNA-Seq), 표현형(가계 및 조상정보)
NIH Roadmap Epigenomics Proejct: 유전형(전장게놈), 전사체(RNA-Seq, small RNA-Seq), 후성유전체(ChIP-Seq), 표현형(수십종류의 세포주 및 다양한 배양세포)
GEO(Gene Expression Omnibus): 전사체(어레이, RNA-Seq)
SRA(Sequence Read Archive): 유전형(전장 게놈 및 전장 엑솜), 전사체(RNA-Seq, small RNA-Seq, miRNA-Seq), 후성유전체(epigenome), 마이크로바이옴(16s rRNA, 샷건게놈)
- PharmGKB [M. Whirl-Carrillo et al., 2012]
- PHARMACOGENOMICS. KNOWLEDGE BASE.
- knowledge about the impact of genetic variation on drug response
- relationship between genetic variations and how our body responds to medications.
- Drugs, Pathways, Dosing Guidelines, Drug Labels
- Chemical-Protein Interaction Networks
- ORGANISMS 2031, CHEMICALS 0.5 mio, PROTEINS 9.6 mio, INTERACTIONS 1.6 bn
- 68,000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes.
- RDKit [G. Landrum, 2006] [Landrum, 2016]
- Rdkit: Open-source cheminformatics
- SMILES -> chemical structure graph tool
- Decagon (Multimodal graph of polypharmacy) [Marinka Zitnik, 2018]
- Protein-protein interaction network
- Drug-target protein associations
- Drug-target protein associations culled from several curated databases
- Polypharmacy side effects in the form of (drug A, side effect type, drug B) triples
- Side effects of individual drugs in the form of (drug A, side effect type) tuples
- Side effect categories
- DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.
- MoleculeNet [Zhenqin Wu et al., 2017]
- MoleculeNet, molecular molecule · molecular physics · biophysics · living body for discovery of new drugs? A data set containing four kinds of data is released at DeepChem.
RNA-seq based expression profiles of genes extracted from TCGA breast cancer level3 data[Prat Aparicio, 2012]
Cancer hall-mark gene sets [Liberzon et al., 2015]
PAM50 molecular subtype
breast cancer subtype scheme. Luminal A, Luminal B, Basal-like, HER2.
- bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information.
version 5.1.1, released 2018-07-03, contains 11,877 drug entries including 2,474 approved small molecule drugs, 1,180 approved biotech (protein/peptide) drugs, 129 nutraceuticals and over 5,748 experimental drugs. Additionally, 5,131 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each DrugCard entry contains more than 200 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data.
- STRING [Szklarczyk et al., 2014]
- database of putatively associating genes from multiple pieces of evidence like biological experiments, text-mined literature information, computational prediction, etc.
- ex) protein-protein interaction network topology
- SMILES [Weininger, 1988]
- Simplified molecular-input line-entry system
- Drug chemical structure
- specification in form of a line notation for describing the structure of chemical species using short ASCII strings
- DTIP [Kyle Yingkai Gao et al., 2018]
- IBM research dataset from BindingDB
- paper : Interpretable Drug Target Prediction Using Deep Neural Representation
- 39,747 positive examples and 31,218 negative examples
- BindingDB [Gilson etal., 2016]
- Public, web-accessible database
- binding affinities, focusing chiefly on the interactions of small molecules (drugs/drug candidates) and proteins (targets/target candidates)
- SIDER database [Kuhn et al., 2015]
- drug side effect
- 996 drugs and 4192 side effects
- drugs confounder-controlled side effects
- 1,332 drugs and 10,093 side effects
- extended-connectivity fingerprints with diameter 6
- drug structural features
- the hashed 1,024-bit length vector encoding the presence or absence of substructure in a drug molecule
Gene ontology (GO) annotation [Ashburner et al., 2009]
- Twosides database [Tatonetti et al., 2012]
- 645 drugs and 1618 DDI(drug-drug interaction), in total 63,473 DDI pairs
- CPI database [Wishart et al., 2008]
- chemical protein interactome
- about how much power a drug needs to bind with its protein target
- ChEMBL (StARlite)
- Chemical European Molecular Biology Laboratory
- chemical database of bioactive molecules with drug-like properties.
- 1.8M compounds, 1.1M assays, 69k documents, 12k targets, 11k drugs, 1.7k cells
- TTD database [Chen et al., 2002] [Y. H. Li et al., 2018]
- therapeutic target database
- database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets. Also included in this database are links to relevant databases containing information about target function, sequence, 3D structure, ligand binding properties, enzyme nomenclature and drug structure, therapeutic class, clinical development status. All information provided are fully referenced.
- protein and nucleic acid target
- PDBBind dataset [Liu et al., 2017]
- binding affinities for the protein-ligand complexes in the Protein Data Bank (PDB).
version 2017released by Jan 1st, 2017. This release provides binding data of a total of 17,900 biomolecular complexes, including protein-ligand (14,761), nucleic acid-ligand (121), protein-nucleic acid (837), and protein-protein complexes (2,181), which is currently the largest collection of this kind.
- Liu et al., Acc. Chem. Res. 2017, 50, 302-309
- Tox21 Data Challenge 2014
- for prediction of compounds’ interference in biochemical pathways using only chemical structure data(SMILES).
- GDB Databases
- GDB-11 [Fink, T et al., 2005, 2007]
- small organic molecules up to 11 atoms of C, N, O and F following simple chemical stability and synthetic feasibility rules.
- GDB-13 [Blum L. C. et al., 2009]
- small organic molecules up to 13 atoms of C, N, O, S and Cl following simple chemical stability and synthetic feasibility rules.
- GDB-17 [Ruddigkeit Lars et al., 2012]
- 166.4 billion molecules of up to 17 atoms of C, N, O, S, and halogens.
- Compared to known molecules in PubChem, GDB-17 molecules are much richer in nonaromatic heterocycles, quaternary centers, and stereoisomers, densely populate the third dimension in shape space, and represent many more scaffold types.
- GDB-11 [Fink, T et al., 2005, 2007]
- QM (Quantum machine) dataset [Blum L. C. et al., 2009]
- subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), totalling 7165 molecules.
- extension of the QM7 dataset for multitask learning where 13 additional properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). Additional molecules comprising chlorine atoms are also included, totalling 7211 molecules.
- computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules.
- QM8 [L. Ruddigkeit et al., 2012]
- training set of 10 000 molecules, (coupled-cluster) CC2 excitation energies can be reproduced to within ±0.1 eV for the remaining molecules.
- MD Trajectories of small molecules [S. Chmiela et al., 2017]
- The molecular dynamics (MD) datasets in this package range in size from 150k to nearly 1M conformational geometries. All trajectories are calculated at a temperature of 500 K and a resolution of 0.5 fs. The molecules have different sizes and the molecular PESs exhibit different levels of complexity.
- MD Trajectories of C7O2H10
- This data set consists of molecular dynamics trajectories of 113 randomly selected C7O2H10 isomers calculated at a temperature of 500 K and resolution of 1fs using density functional theory with the PBE exchange-correlation potential. C7O2H10 is the largest set of isomer of QM9.
- Datasets including densities
- These datasets contain not only molecular geometries and energies but also valence densities. For each dataset, the energies are given in energies.txt (in kcal/mol, one line per molecular geometry). The densities are given in densities.txt (in Fourier basis coefficients, one line per molecular geometry). The structures are given in structures.xyz (with positions in Bohr).
- ISO17 - MD Trajectories of C7O2H10 with total energies and atomic forces [K.T. Schütt et al., 2017]
- The molecules were randomly drawn from the largest set of isomers in the QM9 dataset  which consists of molecules with a fixed composition of atoms (C7O2H10) arranged in different chemically valid structures. It is an extension of the ismoer MD data used in .
-  R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 2014.
-  Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R., & Tkatchenko, A. (2017). Quantum-chemical insights from deep tensor neural networks. Nature Communications, 8, 13890.
- dSPP: Database of structural propensities of proteins
- This repository comprises residual propensities of individual residues in proteins to populate helical, extended or disordered structural states. The data are derived from experimental NMR assignments of unrelated proteins in solution state near physiological conditions. The residue-specific propensity scores are normalized in a range -1.0 to 1.0 and prepared for machine learning in Plain-text, Numerical Python, Keras and Tensorflow formats.
- Organic photovoltaics [Hachmann, J. et al., 2014]
- candidate structures for organic electronic materials in particular photovoltaics
- Harvard Clean Energy Project.
- promising compounds that have emerged after studying 2.3 million molecular motifs by means of 150 million density functional theory calculations.
- Protein graph [Dobson, P. and Doig, 2003]
- 1178 high-resolution proteins in a structurally non-redundant subset of the Protein Data Bank using simple features such as secondary-structure content, amino acid propensities, surface properties and ligands.
- two functional groupings, enzymes and non-enzymes.
- nodes are amino acids and two nodes are connected if they are less than 6 Angstroms apart.
- HMDD [Lu M et al., 2008] [Li Y et al., 2014]
- the Human microRNA Disease Database
- database that curated experiment-supported evidence for human microRNA (miRNA) and disease associations.
HMDD v3.0, released June 28 2018, 32281 miRNA-disease association entries which include 1102 miRNA genes, 850 diseases from 17412 papers.
- DrugTargetCommons [Jing Tang et al., 2018]
- Drug Target Commons (DTC) is a crowd-sourcing platform to improve the consensus and use of drug-target interactions.
- IDG Pharos
- compound and target data resources on public domain
- Ligand, disease, target
- DDIExtraction2013 [BioNLP Challenge]
- Extraction of Drug-Drug Interactions from BioMedical Texts
- Task 1: Recognition and classification of drug names.
- Task 2: Extraction of drug-drug interactions.
- Biocreative PPI [BioNLP Kaggle]
- BioCreAtIvE (Critical Assessment of Information Extraction Systems in Biology)
- text mining and information extraction systems applied to the biological domain.
- Gene mention tagging [GM]
- Gene normalization [GN]
- Extraction of protein-protein interactions from text
- Polysearch2 [Liu Y et al., 2015]
- online text-mining system for identifying relationships between human diseases, genes, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies.
- SuperTarget [BioNLP]
- database developed in the first place to collect informations about drug-target relations. It consist mainly of three different types of entities: DRUGS, PROTEINS, SIDE-EFFECTS.
- database that contains a core dataset of about 7300 drug-target relations of which 4900 interactions have been subjected to a more extensive manual annotation effort. SuperTarget provides tools for 2D drug screening and sequence comparison of the targets. The database contains more than 2500 target proteins, which are annotated with about 7300 relations to 1500 drugs
- data from DrugBank, BindingDB and SuperCyp
- ConsensusPathDB [Kamburov, A. et al., 2013]
- [2018.10.04] unique physical entities: 170,276, unique interactions: 603,543, gene regulations: 17,410, protein interactions: 397,088, genetic interactions: 1,738, biochemical reactions: 23,482, drug-target interactions: 163,825, pathways: 5,359
- Data originate from currently 32 public resources for interactions and interactions that we have curated from the literature.
- ChemDataExtractor [Swain, M. C., & Cole, J. M, 2016]
- ChemDataExtractor is a python toolkit for automatically extracting chemical information from scientific documents.
- HTML, XML and PDF document readers
- Chemistry-aware natural language processing pipeline
- Chemical named entity recognition
- Rule-based parsing grammars for property and spectra extraction
- Table parser for extracting tabulated data
- Document processing to resolve data interdependencies
- Cancer Gene Expression Datasets
- two datasets termed GCM and Acute Leukemia datasets
Datasets in Papers
Hybrid Approach of Relation Network and Localized Graph Convolutional Filtering for Breast Cancer Subtype Classification
Sungmin Rhee, Seokjun Seo, Sun Kim
Interpretable Drug Target Prediction Using Deep Neural Representation
Kyle Yingkai Gao, Achille Fokoue, Heng Luo, Arun Iyengar, Sanjoy Dey, Ping Zhang
Drug Similarity Integration Through Attentive Multi-view Graph Auto-Encoders
Tengfei Ma, Cao Xiao, Jiayu Zhou, FeiWang
Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation
Jiaxuan You, Bowen Liu, Rex Ying, Vijay Pande, Jure Leskovec
Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC)
Benjamin Sanchez-Lengeling, Carlos Outeiral, Gabriel L. Guimaraes, Alan Aspuru-Guzik
ChemRxiv e-prints, 8 2017.