DataWarrior User Manual

Accessing External Resources


DataWarrior, as its name already implies, lives from data. Since the amount of structured data being freely available over the internet is increasing daily, DataWarrior has functionality built in to access some of the most obvious data sources directly. Our intention is to make more data sources available with future software updates.


Wikipedia

Wikipedia is undoubtedly one of the most important freely available sources of knowledge on the internet. Many thousand of its articles are about chemical substances. DataWarrior allows to retrieve an up-to-date list of all Wikipedia compound structures with their names and formulas into a local DataWarrior file. For that choose Retrieve Wikipedia Molecules from the Database menu. This offers an easy way to search or otherwise process the chemical content of Wikipedia offline. Within the local file every compound has a link back into the corresponding Wikipedia web page. A right mouse click within DataWarrior opens a popup menu that allows to open the compound's Wikipedia article in the web browser.


Google Patents

At the time of this writing Google had compiled, translated and indexed more than 120 million patent publications from 22 patent offices around the world including the USA, Europe, Japan, China, South Korea, WIPO, Russia, etc. During this process chemical structures are extracted and added to the index to allow for structure searching. DataWarrior provides an easy interface to access that wealth of information. In DataWarrior Google Patents can be searched using keywords, by meta data like assignee, inventor, or date, and of course, combined with a chemical structure query.

Exact structure searches in Google Patents or Google Scholar can be launched from any compound within any DataWarrior view with a right mouse click and choosing Search Structure in Google Patents (or Scholar). For more flexibility select Search Google Patents... from the Database menu, which opens the following dialog.

Google Patents search for strawberry aroma with given substructure.

Query Structure:If your patent search shall be restricted to patents containing a specific, similar, or a super-structure, then you need to draw, paste, or drop the query structure into this field. Note that for sub-structure searches (super-structure) DataWarrior does not allow to specify query features.

Search Type:Here you may choose from these structure search modes: exact, similar, and super-structure, which often is called sub-structure search.

Keywords:In this field you may specify one or more keywords that have to be found in the patent text. Each word automatically includes plurals and close synonyms. If you provide multiple words, they are logically AND connected. You may form logical expressions using the keywords AND and OR, e.g. "(safety belt) OR B60R22/00". In addition you may specify proximity operators as NEAR, ADJ, and SAME, e.g. "(safety ADJ/5 belt) NEAR/10 (baby OR child) SAME vehicle". Proximity operators, however, only have an influence on the ranking of patents, but not on whether a patents is returned as a result.

More information about using keywords or meta data is available on Google's About Google Patent Searches page.

Results of Google Patents search with links to original patents.

If a patent search is successful, then a list of matching patents is returned to DataWarrior. In addition to the patent ID, its title, assignee, inventor, and some dates it contains a link into the original patent as well as links to supportive drawings and other information, if available.


The ChEMBL Activity Database

The ChEMBL database is one of the most widely used compound activity databases available. DataWarrior permits to query the database by specifying search criteria like biological targets, pubmed IDs, or chemical structures. Structure searches may be based on chemical similarities, contained substructures, structure equivalence or tautomers.

ChEMBL Database Query Dialog

In the following the ChEMBL database query dialog is explained in more detail:

Target Contains: This field allows to quickly filter the target list by target name, gene accession number, organism or other available target related information. One may specify multiple search phrases as for instance "renin rat".

Hierarchical Protein Family Filter: In the ChEMBL database many target proteins are assigned to a small protein family. Multiple small related protein families are grouped on a higher level into a less specific larger family. Related larger families are again grouped into an even larger one and so forth. This forms a hierarchical tree of related proteins. This filter allows to filter targets by first selecting a coarse protein family and then successively selecting subfamilies of the previously selected protein family.

Select Target(s): This text area contains a list of targets that are available in the ChEMBL database and not filtered out by any filter criteria. If one or multiple targets of this list are selected, then the search will only retrieve activity values on these targets.

Target Detail: If one target is selected in the target list above, then this text area shows some detail about the target, e.g. the name, UniProt accession number, type, organism and its protein family classification.

Structure search options: For running a structure search as part of the ChEMBL database search one can either specify (draw, paste or drag&drop) a structure in the Structure field of the dialog or you may select one or more structure in a DataWarrior window before opening the ChEMBL search dialog. The type of structure search can be selected from a dedicated menu among sub-structure, exact match, non-stereo specific match, tautomer match and structure similarity, which would activate a slider to define the similarity limit. If some structures were pre-selected and the option any selected structure is chosen, then a database compound is considered a hit, if its structure matches any of the selected structures.

Pubmed-ID(s) or DOI(s): Here one can specify one of multiple source papers separated by comma, semicolon or space. For most papers the ChEMBL database contains Pubmed-IDs, while the DOI is often missing. Thus, searching for Pubmed-IDs is much more likely to yield useful results than searching for DOIs.

Group results with same compound, target, and result type: If this option is selected then all results from the same target and same chemical structure and merged into one result row. Within these rows individual result values appear in separated lines within one table cell.

For running a query one needs to specify at least one of the three kinds of search criteria, targets, (a) compound structure(s), or paper references. If your intention is to download the entire database, you should do that from the ChEMBL web site.


The Enamine Building Block Database

Commercially available chemicals are the basis for every custom synthesis and searching catalogs of chemical building block providers is a day-to-day business of every bench chemist. Acknowledging the role of Enamine as one of the largest building block providers DataWarrior offers a flexible structure search on the Enamine building block database, which can be combined with filter criteria on price and amount. To define and run a search select Search Enamine Building Blocks... from the Database menu. The following dialog opens:

Enamine building block search dialog with aniline substructure

The Enamine building block search dialog offers these options:

Structure search options: The type of the structure search can be selected at the top left menu from these options: sub-structure, exact match, non-stereo specific match, tautomer match, and structure similarity. When the similarity option is chosen, then a slider is activated allowing to adjust a similarity threshold. The query structure itself can be defined in the dialog's Structure field by dragging or pasting a structure into it and/or by editing the structure in the structure editor that opens upon a double click. Alternatively, you may select one or more structures in a DataWarrior window before opening the building block search dialog. If you then select the option any selected structure in the top right pupup menu, then any database compound is considered a hit, which matches at least one of the preselected structures.

Maximum price: Here you define the maximum price you intend to pay for a building block.

Minimum package size: This is the smallest amount in mg you consider sufficient for your purpose.

Molweight: This allows to restrict the molecular weight building blocks to be restricted to be below or above a certain value or to be within a deifned range.

Maximum row count: This limits the number of retrieved building blocks to the given number.

Random, Cheapest, Diverse: If a maximum result count is defined and if the server finds more building blocks matching your search criteria, then this option defines the server's strategy for selecting a reasonable building block subset that is finally sent to the DataWarrior client. This may be a random, the cheapest, or the most diverse subset of all matching compounds.

Result table after Enamine building block search.


The Crystallography Open Database (COD)

The COD is probably the most comprehensive open database for crystallographic structures of small molecules. It is a valuable resource for studying conformational aspects of molecules like typical torsion angles, bond lengths and atom distances. The COD contains more than 360.000 organic and organo-metallic structures. DataWarrior uses conformational knowledge extracted from the COD for its conformation generation algorithm.

DataWarrior allows to run sub-structure queries on the COD-database, which returns the matching 3D-structures along with meta data. The picture below shows a DataWarrior window with organic structures from the COD and a form view containing a large 3D-structure view.

COD database entry showing the torsion angle of an sulfonamide

If you have measured or will create X-ray structures from small molecules, please consider uploading them to the COD. Open source software needs open data and the well known Cambridge Structural Database does not qualify as open data, since neither its structures nor any extracted information may be used as part of open source software.


Data From Custom URL

The Internet is full of WebServices that offer the retrieval of data tables of any kind typically in TAB delimited or other formats. For instance, the Open Source Malaria Project maintains data in a Google spreadsheet, which can be accessed in TAB delimited format via this URL:
https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/export?format=tsv
One of its columns contains the chemical structures in SMILES format. DataWarrior's Database menu contains an item Retrieve Data From URL to retrieve data from such web resources. Whenever a column of such a web resource contains SMILES codes, then DataWarrior recognizes them and creates an additional column showing the associated chemical structures. If one frequently retrieves updated data from the same URL, then it may be a good idea to create a short macro with the URL retrieval task and save it in the macro folder within the DataWarrior installation folder. This way the Macro menu contains an item to directly retrieve the data into a new DataWarrior window.


Accessing Relational Databases

The majority of all databases are relational databases, which can be accessed using a dedicated language called Structured Query Language (SQL). SQL is used to specify the tables and columns from which to retrieve data, to define the query conditions, and how to logically join tables, when information from multiple tables needs to be retrieved. For accessing a relational database directly from DataWarrior one needs to provide the so-called connect-string or connection URL as well as the SQL command to be executed.

SQL-Query to retrieve all reliable activity data from the ChEMBL database

The screenshot above shows the SQL-Query dialog configured to retrieve data from a ChEMBL database in MySQL-format running on a server with the domain name myserver.mycompany.com. In addition to MySQL DataWarrior also supports PostgreSQL, Oracle, and Microsoft SQL-Server databases. The SQL statement shown is a typical example. After the SELECT keyword it defines a few columns from various tables to retrieve data from. The primary table activities after the FROM keyword is logically bound to other tables with the JOIN keyword and after the WHERE keyword some conditions are defined that specify and limit the data to be retrieved.

Note that one of the retrieval columns contains chemical structures in SMILES format. These are automatically recognized by DataWarrior and an additional column containing the corresponding chemical structures is generated.

Note also: If DataWarrior has an open document window when an SQL query is performed, and if a column of DataWarrior's data table contains identifiers used in the database, then one may construct an SQL statement using those identifiers as retrieval condition in the SQL statement's WHERE-clause as in "WHERE my_table.compound_id IN($IDNUMBER)". In this example the DataWarrior document would contain a column named 'IDNUMBER' containing identifiers used in the database column 'compound_id' of table 'my_table'. Before processing the SQL statement, DataWarrior would resolve 'IN($IDNUMBER)' to IN('ID1','ID2','ID3') assuming that the IDNUMBER column would contain the entries 'ID1', 'ID2', and 'ID3'.


Synthesis Planning

Computational Retrosynthetic Synthesis Planning is one of the oldest challenges in cheminformatics with early substantial contributions from E. J. Corey, J. B. Hendrickson, and others starting already in the 1960s. In the late 1980s multiple large pharmaceutical companies joined forces in the CASP project yielding software and thousands of hand-coded transformation rules. Unluckily, none of the software solutions from the last century achieved wide usage in the chemistry community. For 20 years it was rather silent about computer aided synthesis planning tools until in recent years multiple papers triggered a new interest in the field, e.g. by M.H.S. Segler and M.P. Waller, who introduced a Montecarlo Tree Search to handle the combinatorial explosion and trained neural networks to replace rule based approaches to assess transformation feasibility and functional group compatibility. At the same time rule based systems had also made quite some progress, e.g. Chematica by A. Grzybowsky et al. Currently, multiple commercial software packages offer synthesis planning functionality to the chemistry community.

DataWarrior does not have its own synthesis planning functionality, but allows to directly access the free version of Spaya using a right mouse click and choosing Suggest Synthesis Route -> of Structure using Spaya.ai from any of the DataWarrior main views. This opens the Spaya.ai web size prefilled with your molecule in your preferred web browser. Spaya.ai requires you to register the first time you access the platform. Later only one click is needed to login again. The login is needed to persistently store user setting.

Some companies run their own internal Spaya server. To configure DataWarrior to access your own server rather than the public spaya.ai, press CTRL and at the same time the right mouse button over a structure. Then choose Suggest Synthesis Route -> Change Spaya Server URL. This opens a dialog to update the server name, which is then stored and used for any further requests.

.

Continue with Creating and using row lists...