The DataWarrior installation already comes with various sample data files. These include FDA-approved drugs, compound collections with physico-chemical properties or measured pKa values, kinase ligands, and a few files with non-chemical content to illustrate various program features.
This page contains download links to larger data files, which are not included in the
DataWarrior installers, because they would significantly increase
its size or because may not be of general interest.
All Non-Covalent Binding Sites From the PDB-Database (Dec 2024)
This files contains most non-covalent binding sites from most highly-resolved protein-ligand complexes from the PDB-database. They can be searched by ligand substructure, by resolution, PDB-ID, and upload date. The original structure files were downloaded as '.pdb.gz' files and processed around December 20, 2024.
Then, for every pdb-file the protein around all qualifying ligands was cropped beyond a 10Å distance from any ligand atom. All entries contain the original title and keywords from the PDB-entry. They also contain the PDB-ID as link that, when clicked, opens that PDB-database entry in the web browser. About 10% of the entries contain affinity (EC50, IC50, Kd) information taken from the BindingDB database. Two open filters allow to search all entries by either text or ligand sub-structure.
The 3D-binding site view shows the molecular surface of the binding site colored by the atom types of close-by protein atoms. The protein backbone is shown as cartoon and its side chains are displayed only if their atoms are close to the ligand. The view slightly rotates back and forth around a vertical axis to improve 3D-perception.
All shown interactions are calculated according to PLIP: fully automated protein–ligand interaction profiler by Michael Schroeder et al, doi: 10.1093/nar/gkv315.
You may change display options of the ligand or protein from the popup menu that appears upon right mouse click when the mouse pointer is on top of the respective molecule.
Important: You need DataWarrior V6.4.0 or newer to properly work with these files!
The original file created by the process described above contains 185'474 binding sites out of 61'505 PDB entries and requires 1.3 GB of hard disk space and slightly more RAM, when opened by DataWarrior. If your resources are limited, consider to download one of the subset files. The first subset was created by removing all duplicates from the original file, considering the same ligand in the same PDB-entry as a redundant building site. The second subset was created by keeping from every PDB-entry just that binding site with the largest ligand structure.
Chemical reactions from US patents (1976-Sep2016)
The original reaction collection was extracted by Daniel Lowe using text-mining from United States patents. The collection contains two parts: grants between 1976 and September 2016 and applications from 2001 till September 2016. Both reaction sets are available as CML or reaction SMILES. The original files with about 1.8 million and 1.6 million reactions encoded as reaction SMILES can be downloaded, unzipped using a program like 7-Zip, and then directly opened with DataWarrior V5.0.0 or newer.
The reactions have been extracted using an enhanced version of the reaction extraction code described in https://www.repository.cam.ac.uk/handle/1810/244727 with LeadMine used for chemical entity recognition. They contain many duplicate reactions due to the same or highly similar text occurring in multiple patents, this is especially true when comparing the applications and grant datasets, many reactions from applications will later appear in patent grants. Paragraph numbers are only present for 2005+ patent grants and patent applications. Multiple reactions can be extracted from the same paragraph. Atom maps in the reactions SMILES are derived using Epam's Indigo toolkit. While typically correct, the atom-maps are wrong in many cases and hence should not be entirely relied on.
The reactions have been filtered to remove common cases of incorrectly extracted reactions: All product atoms must be accounted for by the atom-mapping. The product(s) must have >8 heavy atoms. The product must not be charged if it is a single component. The number of products must be <5 and number of reactants+agents<16.
The file includes columns PatentNumber, ParagraphNum, Year, TextMinedYield, and CalculatedYield.
Lit.: Lowe, Daniel (2017): Chemical reactions from US patents (1976-Sep2016). figshare. Fileset.
The DataWarrior file that you can download from this website is a subset of the original files. From the original 3.4 million reactions all reactions were removed, which didn't have a text mined yield between 25 % and 150 %. In addition we also removed duplicate reactions, where we considered reactions sharing the same reactants, products and catalysts as being equal. The remaining reaction set contains about half a million reactions, can be directly searched on your computer by reaction sub-structures (i.e. transformation search), reaction similarity, or a Retron search. The latter search type is a simple, but yet unusual concept, which was suggested by Roger Sayle: One draws a sub-structure intended to be synthesized. A sub-graph search counts whether this sub-structure is found more often in the products than in the reactants. In this case the reaction is considered a match, because then the sub-structure must have been built in the reaction. (File size: 186MB, unzipped:373MB, 530'328 reactions)
January 2025 Snapshot of the Crystallography Open Database (COD)
Since version 4.2.2 DataWarrior is able to generate conformers. The algorithm uses a combination of self-organization and rule-based approach. The latter is based on statistical data derived from a large number of 3-dimensional, diverse, organic structures from a crystallographic database. The de-facto standard source for organic, crystallographic molecule structures would be the Cambridge Structural Database (CSD). Its license, however, does not permit to derive and publish geometrical statistical data as part of an open source package. Luckily, there is an open alternative, the Crystallography Open Database (COD). While this database consists of one CIF file per structure, Saulius Grazulis and Antanas Vaitkus from the COD have built an automatic procedure to convert the database into DataWarrior format, which is more suitable for cheminformaticians. Their procedure uses a combination of Perl, Java and OpenChemLib. Here you may download a COD snapshot with 257'823 quality-checked 3D-structures in DataWarrior format (139'602 organic, 111'467 metalorganic, and 6754 inorganic structures, 354 MByte, COD snapshot, Aug 13, 2025).
DrugBank Version 5.0.10 (Subset in DataWarrior format)
The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 8250 drug entries including 2016 FDA-approved small molecule drugs, 229 FDA-approved biotech (protein/peptide) drugs, 94 nutraceuticals and over 6000 experimental drugs. This DataWarrior file is a subset of drugbank 5.0.10 downloaded from https://www.drugbank.ca. DrugBank is offered to the public as a freely available resource. Use and re-distribution of the data, in whole or in part, for commercial purposes (including internal use) requires a license. We ask that users who download significant portions of the database cite the DrugBank paper in any resulting publications. Citing DrugBank: Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72. 16381955.