www.openmolecules.org

Loading Data into DataWarrior

In addition to its own native file formats, DataWarrior may read or writes TAB-delimited and comma-separated text files containing pure data tables of alphanumerical data. DataWarrior also imports and exports SD-files, which are the de-facto industry standard for exchanging chemical information. Beyond reading data from files, data may be imported from the clipboard or retrieved from databases. After reading data from any source, DataWarrior analyses every column to understand the kind of data it contains, i.e. whether it contains numerical and/or category data, whether columns contain chemical structures, and which columns contains unique values or empty values, etc. It also checks for numerical correlations between columns. Then it creates default views and filters.

If DataWarrior was installed correctly, then every file type discussed in this section should have a proper icon assigned and double clicking a file's icon should result in DataWarrior opening the file. This section explains the interaction with files and the clipboard.

Native DataWarrior Files

Typically, when saving a data window with DataWarrior, then a native DataWarrior file ending with the .dwar extension is written. Technically, native DataWarrior files are plain text files combining a TAB-delimited table of the data itself and a some additional XML like sections providing additional information. A more detailed .dwar file format explanation can be found here. In addition to the plain table data, .dwar files may contain the following kind of information:

Which views are visible, how they are arranged and what they display

Which filters are visible, how they are configured and, thus, which data rows are visible

Which row lists are defined and which rows belong to them

Which row currently selected and which row is chosen to be the reference row

An HTML based text describing the file's content

Column meta data describing column content, e.g. defining range, rounding, references to other columns, etc.

Cell related detail data like formatted text and images

Keys and links for on-the-fly retrieval of cell related detail data from external sources

Hidden columns with molecule and reaction data such as descriptors and atom coordinates

Macros that allow to completely automate DataWarrior (from version 4.0)

To open a native DataWarrior file, choose Open from the File menu or just double-click an icon representing a .dwar file.

DataWarrior Template Files

A DataWarrior Template file contains the complete configuration of views and filters, as they have been, when the Template file was saved. If you want to store the current state of views and filters of an open DataWarrior window in order to possibly restore it later with the same or another dataset, you may save a Template file. To re-apply a formerly stored template to an open DataWarrior window, choose Open Special -> Apply Template... from the File menu. You may then select either a .dwat or a .dwar file. In both cases the template will be read from the file and all views and filters will be replaced by new ones as defined in the file.

DataWarrior Macro Files

DataWarrior version 4.0 and above support recording, editing and replaying entire workflows. These may be stored as part of a native DataWarrior file or can be exported into a dedicated macro file. Similar to templates you may run a macro by opening a dedicated macro file with Open Special -> Run Macro... from the File menu.

DataWarrior SOM Files

By creating a self organized map (SOM) DataWarrior can position chemical molecules or other objects on a two-dimensional area in a way, that any object's closest neighbours in the plane are those objects that are the most similar ones in the dataset. A calculated SOM is actually a 2-dimensional grid of reference vectors of which everyone resembles one or more molecules/objects of the dataset. Once these reference vectors are calculated, the objects are one by one assigned to that reference vector, which is the most similar to the object. If one intends to map a second set of objects from an external file to a previously calculated SOM, then these vectors must have be available. For that reason they can be saved as SOM file, which can later be used to map external objects, which is effectively creating compatible 2-dimensional object coordinates.

DataWarrior Query Files

A .dwaq file or Query File does not contain any data. It rather contains a database query that is performed when the file is opened. Moreover, it may contain the template information needed to construct certain views and filter settings after the query result data has been retrieved. Query files are used if data in a database is frequently changing or to confidentially communicate new results, e.g. via e-mail. To open a .dwaq file, select Open Special -> Run Query... from the File menu, or double-click the icon representing the file.

SD-Files

SD-Files are the de-facto industry standard for exchanging chemical structures and associated alpha-numerical information. It has been developed and published by Molecular Design Ltd. (MDL). The version most widely used is version 2, which has limited support for stereo chemistry: A so-called chiral flag defines for the entire molecule, whether it is a racemate of a mixture of enantiomers. With version 2 SD-files it is not possible to define epimers, mixtures of diastereomers, etc. In order to tackle the deficiencies, MDL introduced an updated concept called Enhanced Stereo Recognition along with an updated file format: Version 3. DataWarrior consistently uses this new concept, which allows to define for any stereo center within a molecule, whether it is absolute or whether it belongs to a group of stereo centers with a specific relative stereo configuration.

From the File menu, select Open... and use the dialog window to select the SD-file(s) (the file extension is .sdf) to import. DataWarrior reads the entire content of the SD-File, displays rows in the Table View, creates default 2D- and 3D-Views, a Structure View and generates a structure index (FragFp descriptor), which is needed internally for some structure related tasks. While the indexing process is underway and its progress bar is visible in the status area, these functions e.g. sub-structure search are not yet available.

DataWarrior also opens compressed SD-files without the need to unpack them, if they have been gzipped and their name ends with '.sdf.gz'.

Text Files

TAB delimited and comma separated text files ('.txt' and '.csv') are among the most portable file formats because they can be created by many programs. In these text files each line represents a row and all fields within the row are separated by TABs or commas. If one or more columns of the text file contain chemical structures in SMILES format or chemical reaction encoded as reaction SMILES, then DataWarrior automatically recognizes them and creates additional columns with structures, reactions and default decriptors for every SMILES containing column. From the File menu, select Open..., select the file and click OK

Example Data Files

In the standard DataWarrior installation, the File menu contains two submenus with direct access to some example files. The option Open Reference File covers various files with chemical structures and related data, e.g. known drugs, pKa values, bioactive compounds, and other datasets of interest. Open Example File provides examples that illustrate non-chemistry related aspects of DataWarrior. Depending on the installation, further submenus may provide quick access to files within user defined directories.

Paste Data from the Clipboard

If you copy tabular data from any text editor or spreadsheet application, you may paste it directly into DataWarrior. This will open the data as if it were loaded from a text file. By analyzing the data DataWarrior will try to evaluate, whether a header row is present. If it believes that there is none it will generate default column names.

In most cases DataWarrior will correctly predict, whether the clipboard content starts with a header row. If it fails because of insuffucient clues, then one may use one of the Paste Special options to hint that a row header is present or not.

In the following example some data was selected within a spreadsheet application and then copied to the clipboard with Ctrl-C.

After switching to DataWarrior and after choosing Paste (Ctrl-V) DataWarrior responds by displaying the clipboard's content in a new window. It has recognized that the column named "Smiles" contains valid SMILES codes and automatically created an additional column with chemical structures from the SMILES strings. It also created two graphical default views and, since the data now contains chemical structures, it also created a dedicated structure view.

While the Paste item of the main menu creates a new DataWarrior document, one may also paste TAB-delimited text directly into an existing DataWarrior table. This is achieved with a right mouse click on a specific cell and selecting Paste Into Table from the popup menu. This causes DataWarrior to replace cell content starting from the selected cell and extending to the right and bottom cells to cover a table area reflecting the row and column count of the clipboard's data. If the table to be pasted contains more columns or row than the current table has available extending from the chosen cell, then DataWarrior will add new columns and/or rows to create the needed space. Please not, that the content of hidden rows or columns stay untouched, because only visible cell's content will be replaced. Of course, after the data has been pasted the current filter settings apply and may cause changed or new rows to be hidden immediately.

Importing Data From Databases

Depending on your particular installation DataWarrior is able to directly retrieve data from a variety of databases. These include:

Commercially available building block chemicals from Enamine

Compounds & Activities from the ChEMBL database

3D-Structures from the Crystallography Open Database (COD)

All chemical structures from Wikipedia

Data from relational databases (Oracle, MySQL, Microsoft SQL-Server, PostgreSQL)

Data from any Web-based data source in TAB or comma delimited format, e.g. Google spreadsheets

If your installation contains plugins from your organization or from other sources, then your database menu may contain additional items, which open custom dialogs to define and run queries on internal or external databases. Such plugins can be easily developed using the DataWarrior Plugin SDK.

At Idorsia the following additional database options exist:

Chemical and biological data from the Osiris Database

Reactions and reaction procedures from the chemical lab-notebook Mercury

Compounds and data from Actelion's Chemical Inventory

Compounds & products from the Commercial Chemicals Database

Compounds from the Screening Compounds Database

Protein Crystallization Data

Micro Array Data

Gene Expression Data

Merging Data From Files

Merging data from multiple tables into one table is a frequent activity for any data scientist or analyst. For this purpose DataWarrior provides powerful merge functionality. For merging data from one file into another file you need to open one file first. Then select Merge File... from the File menu. After selecting the file from which to merge information into the first file, a dialog opens and shows available merge options.

"Merge File" dialog configured to merge geo-locations into cost of living data.

For every column in the second file you need to decide, whether you want to neglect its information, put it into a new column, put it into an existing column, or use it as a merge key that identifies matching rows between both files. In the example above the first file contains cost of living data for many cities of the world. The second file contains cities, states, and the city's geolocation as longitude and latitude. For every column of the second file there is a row in the dialog. The left combo boxes ask you to either assign these columns existing in the first file, to a new column, or you may choose to neglegt the column's information. If you assign it to an existing column, then the right combo box lets you define a merge option:

As merge key:If you select this option, then the row values of this column are used to uniquely identify the matching row. Therefore, this column must contain unique values in the second file. In the example above, every city may exist only once on the second file. You may define multiple columns as merge keys. In this case all column's values must match that a row is considered a match.

As merge key (ignore case):This works as the previous option, but neglects whether letters are capital or not.

As merge key (word search):This option allows to search longer text for words or sequences of words that serve as merge key. The picture below shows an example for this merging option. In this case drug names, which may consist of multiple words (column 'Name' in second file), are found in study titles within the first file. Longer word sequences take preference, e.g. if both keys 'ethanol' and 'ethanol acetate' are found, then the text will be matched to the 'ethanol acetate' key. This search option is not case-sensitive and cannot be combined with another merge option.

Append values:If this option is selected, then the column value of the second file is appended to the value in the matching row and assigned column of the first file.

Replace with new:This option replaces an existing value with the one from the second file if a second file's row can be matched to the first file's row.

Replace if empty:This option only copies the value from the second file if the first file does not have a value in the matching row.

Merge example with "Word Search": Chemical structures and identifiers from a drug dictionary shall be merged into clinical trial information.

Continue with Main Views...

DataWarrior User Manual