The DataWarrior installation already comes with various sample data files. These include FDA-approved drugs, compound collections with physico-chemical properties or measured pKa values, kinase ligands, and a few files with non-chemical content to illustrate various program features.
This page contains download links to larger data files, which are not included in the
DataWarrior installers, because they would significantly increase
its size or because may not be of general interest.
The original reaction collection was extracted by Daniel Lowe using text-mining from United States patents. The collection contains two parts: grants between 1976 and September 2016 and applications from 2001 till September 2016. Both reaction sets are available as CML or reaction SMILES. The original files with about 1.8 million and 1.6 million reactions encoded as reaction SMILES can be downloaded, unzipped using a program like 7-Zip, and then directly opened with DataWarrior V5.0.0 or newer.
The reactions have been extracted using an enhanced version of the reaction extraction code described in https://www.repository.cam.ac.uk/handle/1810/244727 with LeadMine used for chemical entity recognition. They contain many duplicate reactions due to the same or highly similar text occurring in multiple patents, this is especially true when comparing the applications and grant datasets, many reactions from applications will later appear in patent grants. Paragraph numbers are only present for 2005+ patent grants and patent applications. Multiple reactions can be extracted from the same paragraph. Atom maps in the reactions SMILES are derived using Epam's Indigo toolkit. While typically correct, the atom-maps are wrong in many cases and hence should not be entirely relied on.
The reactions have been filtered to remove common cases of incorrectly extracted reactions: All product atoms must be accounted for by the atom-mapping. The product(s) must have >8 heavy atoms. The product must not be charged if it is a single component. The number of products must be <5 and number of reactants+agents<16.
The file includes columns PatentNumber, ParagraphNum, Year, TextMinedYield, and CalculatedYield.
Lit.: Lowe, Daniel (2017): Chemical reactions from US patents (1976-Sep2016). figshare. Fileset.
The DataWarrior file that you can download from this website is a subset of the original files. From the original 3.4 million reactions all reactions were removed, which didn't have a text mined yield between 25 % and 150 %. In addition we also removed duplicate reactions, where we considered reactions sharing the same reactants, products and catalysts as being equal. The remaining reaction set contains about half a million reactions, can be directly searched on your computer by reaction sub-structures (i.e. transformation search), reaction similarity, or a Retron search. The latter search type is a simple, but yet unusual concept, which was suggested by Roger Sayle: One draws a sub-structure intended to be synthesized. A sub-graph search counts whether this sub-structure is found more often in the products than in the reactants. In this case the reaction is considered a match, because then the sub-structure must have been built in the reaction. (File size: 186MB, unzipped:373MB, 530'328 reactions)
Since version 4.2.2 DataWarrior is able to generate conformers. The algorithm uses a combination of self-organization and rule-based approach. The latter is based on statistical data derived from a large number of 3-dimensional, diverse, organic structures from a crystallographic database. The de-facto standard source for organic, crystallographic molecule structures would be the Cambridge Structural Database (CSD). Its license, however, does not permit to derive and publish geometrical statistical data as part of an open source package. Luckily, there is an open alternative, the Crystallography Open Database (COD). While this database consists of one CIF file per structure, Saulius Grazulis and Antanas Vaitkus from the COD have built an automatic procedure to convert the database into DataWarrior format, which is more suitable for cheminformaticians. Their procedure uses a combination of Perl, Java and OpenChemLib. Here you may download a COD snapshot with 255'668 quality-checked 3D-structures in DataWarrior format (136'018 organic, 110'829 metalorganic, and 8821 inorganic structures, 351 MByte, COD snapshot, June 13, 2020).
The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 8250 drug entries including 2016 FDA-approved small molecule drugs, 229 FDA-approved biotech (protein/peptide) drugs, 94 nutraceuticals and over 6000 experimental drugs. This DataWarrior file is a subset of drugbank 5.0.10 downloaded from https://www.drugbank.ca. DrugBank is offered to the public as a freely available resource. Use and re-distribution of the data, in whole or in part, for commercial purposes (including internal use) requires a license. We ask that users who download significant portions of the database cite the DrugBank paper in any resulting publications. Citing DrugBank: Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72. 16381955.