DataWarrior supports standard formats as TAB-delimited or comma-separated text files to import or export plain table data. Moreover, it can read and write SD-files to transfer chemical structures along with alpha-numerical information from and to other chemistry related applications.
However, an open DataWarrior window contains much more that just a plain data table that may contain chemical structures in one column. DataWarrior allows multiple columns to contain chemical structures, substructures, or reactions. Columns may have assigned properties. Invisible columns may contain chemical descriptors or 2D- or 3D-atom-coordinates. Additional views may show scatter plots, bar or pie charts, etc. Filters may exist and may be configured to hide parts of the data. Macros may be defined to automate complex tasks. And additional information, e.g. pictures, may be associated with individual table cells.
The ‘.dwar’ file format was designed to safely store the data itself along with all information needed to recreate everything within an open DataWarrior window. Its design principles were to combine compactness, extensibility, backwards compatibility, and modularity within an interpretable text file. Therefore, it contains the data as TAB-delimited table plus multiple XML-like sections, which contain information about the data itself or about how the data shall be presented. These sections contain general file properties, column related information, view and filter settings, macros, and row lists. None of these sections are mandatory, neither must a '.dwar' file contain data rows, but as a very minimum it must contain one TAB delimited line with column names. Except for the data table, all sections typically consist of some sort of key-value pairs. Within a ‘.dwar’ file all sections, if they exist, appear in the following order:
The file header section starts and ends with datawarrior-fileinfo tags. In between are key-value pairs defining file version, creation time in milliseconds elapsed since 1-Jan-1970 UTC, and the number of table rows. None of these tags are mandatory.
This section, if present, is supposed to contain a description in HTML format of the '.dwar' file's data, the source of the data, and interesting aspects of the data. It may describe what can be seen in configured graphical view, or how to interact with filters to gain insights into subsets of the data. It also may describe the purpose of embedded macros or how to connect to referenced external data. Within a DataWarrior window the content of the explanation section is displayed inside an 'Explanation View'.
<p>This file contains data about...</p>
<p>The original source of the data is ...</p>
<p>Views visualize... Use filters to drill down...</p>
Macros are used to automate task sequences in DataWarrior. They can be either globally defined or exist just within the scope of a document window. Such macros with local scope are saved within the macro section of a '.dwar' file. Since there may be multiple macros within one file, the section starts and ends with datawarrior macroList tags. Inside are one or more macro definitions, which themselfes are surrounded by macro tags. These contain one or more task definitions indicated by task tags, which finally contain the task's configuration as zero or more key-value pairs.
The next section surrounded by column properties tags provides information regarding the data itself. Individual column properties always refer to a single table column, but not every column must have associated column properties. Table columns, which are not described by this section, are treated as containing alpha-numerical data. For them DataWarrior will analyse, whether the content is numerical, whether and which categories the column contains, and more. Where column properties exist, they may declare a column to contain chemical structures, reactions, descriptors, identifiers. Column properties may define relationships between columns, e.g. may assign logically related columns to one column group. Or a column that contains 3D-atom coordinates may be assigned to a parent column, which contains the corresponding chemical structures. In such a case the data from both columns together are used to build and display row specific conformers. If the conformers are the result of a docking procedure, then another column property may define the structure and orientation of the protein pocket used during the docking. Then DataWarrior would display ligand poses of every row always together with the same shared protein structure.
Column properties may also be used to define the data range for numerical columns. For columns that contain identifiers which can be used to retrieve further information, a lookup URL may be defined. For columns that contain results of evaluated expressions, column properties may define the formula that is needed to re-evaluate the column content if rows were changed or added. The complete list of available column properties can be found in the Java source code of the CompoundTableConstants interface (look for all Strings named cColumnProperty...).
A simple and typical column property section for a file that contains one column with chemical structures, a column with associated 2D-atom coordinates, a 'FragFp' descriptor column, and some more alpha-numerical columns is shown in the example below. In a column property sections a column name is always followed by all properties assigned to that column. Thus, the following example defines the column named 'Structure' to have a special type 'idcode', which means that this column contains ID-Codes, which are DataWarrior's way to encode chemical structures as canonical, very compact, text strings that cover aromaticity and enhanced stereochemistry. If an ID-Code describes a substructure, then it also may cover atom and bond query features and exclude groups. Since ID-Codes are canonical, they cannot contain atom coordinates. Therefore, if specific atom coordinates shall be stored in a '.dwar' file, then they must be stored in an additional column. The example declares a column named 'coords' to contain a special data type 'idcoordinates2D'. It also defines for that column that its parent column is the column named 'Structure'. For a third column named 'FragFp' we read that it contains special information of type 'FragFp', its parent column is the 'Structure' column and the version of its 'FragFp' entries is '1.2.1'.
and two associated columns for 2D-atom coordinates and FragFp descriptor
The Data Table
The column property section is followed by the data table itself, which usually is the largest section in the file unless the data consists of a small number of rows only. The table starts with one line defining table column titles and continues with zero or more lines defining one data row each. Column names and column content within every row are separated by TAB symbols (ASCII-code:9), which are depicted in the examples as TAB.
Column titles serve as case-sensitive keys to address their respective columns. They cannot be changed by users and are used in the column properties section as well as the template (datawarrior properties) section, whenever something refers to or is assigned to a column. Users may assign a column alias to a column, which would then be displayed instead of the column title, while the original column title still serves as the column identifier under the hood.
If defined in the column properies section, then some columns may contain non-alphanumerical information as for instance molecules, chemical reactions, chemical descriptors, or 2D- or 3D-coordinates. Within a DataWarrior file these special data types are always encoded as text strings. Molecular structure, substructure, and reaction encoding uses a unique and very efficient algorithm that produces a canonical representation of the original chemical object, which means that the same structure, when encoded, will always be represented by the same text string, independent of the original atom order, atom coordinates, or how stereo centers were defined. If two structures represent the same stereo isomer or the same epimeric mixture, then the created string is the same. Likewise, if two chemical structure representing strings (called idcodes) are equal, then they represent the same chemical entity. Since canonical representations are independent of atom coordinates, these are not part of the idcode itself. Chemists, however, expect drawn structures to be shown with their original atom coordinates. Therefore, the encoding algorithm may produce a second string that contains original atom coordinates in a compact way. Typically, but not necessarily, structure columns (specialType=idcode) are associated with a second column that contains 2-dimensional atom coordinates. In addition, there may be one or more columns containing 3-dimensional atom coordinates, which together with the associated idcode would represent conformers.
The algorithm to canonicalize and string-encode chemical structures involves ring and aromaticity detection, recursive enhanced stereo perception, meso fragment perception, and more. Conceptually, it is similar to the SEMA (Stereo-enhanced Morgan Algorithm) algorithm and format used my MACCS, REACCS, and Isis-Base originally developed by MDL Information Systems. A detailed description of the algorithm would be beyond the scope of this document, but the interested reader may study the Java source code of the Canonizer class, which contains the main part of the algorithm.
The basic idea of the algorithm is to first rank all atoms using atom properties like the atomic number, charge, pi-electron count, ring membership, and more. Atoms sharing the same properties have the same rank. In the next step a sorted list of all neighbour ranks is built for every atom. Atoms with the same previous rank, but with a different neighbour rank list are now assigned different ranks, while the ranking order of the previous step is retained. An updated list of neighbour ranks is built and equal ranks are again split where neighbour ranks differ. This is repeated until no equally ranked atoms can be distinguished anymore. Then, stereo centers and stereo bonds are perceived and stereo configurations are used to distinguish otherwise equally ranked atoms. Since stereo feature assignment may depend on other stereo features, the process is done repeatedly until no new stereo features can be detected. In a next step called tie-breaking remaining equally ranked atoms are assigned different ranks in a reproducible way, such that we finally end up with a reproducible sorting of all atoms. Now the molecular canonical graph is written into a string using as little bits as possible.
Individual table cells may reference external binary or text objects like images or HTML-formatted text information, which in the DataWarrior application are displayed inside the detail area if the corresponding row is highlighted. A table cell may point to zero or one object of a defined kind, while an object may be accessed by many table cells at the same time. These objects may either be stored as part of the 'dwar' file, or they may reside in external files. If objects are part of a '.dwar' file, then they are located in the 'detail' section that looks like this:
Three sections must play together for this mechanism to work:
- In the column properties section a name, the mime-type, and the source location need to be defined for every detail object type that shall be referenced by cells of a particular column. If details are stored within the '.dwar' file itself, then the detail source is defined as 'embedded'.
- Detail references must be concatenated to some table cells in the defined column within the table data section. The reference consists of a '|#|' indicator, followed by a digit indicating the detail type index, followed by a colon ':', followed finally by the detail-ID, which is the unique detail name in the detail section. If a cell references multiple detail types, then these are concatenated one after another.
- Detail-IDs referencing '.dwar' file embedded detail objects must be represented in the detail area. Binary objects use a simple encoding with 6 bits per character added to ASCII value 64. Text detail objects are zipped before the encoding, which is indicated by negative detail-IDs.
Row Selection and Row Lists
In DataWarrior one may define row lists, which are nothing more than a named subset of all rows. Typically, row lists are created after applying one or more filters that hide parts of the data and by assigning a name to the remaining visible set of rows. Alternatively, one may interactively select rows, e.g. markers in a scatter plot or chemical structures in a structure view, and then assign a name to the selection. When a '.dwar' file with defined row lists is saved, then a hitlist data section is created that contains two entries for every list, first the list name, then the list data. If rows are selected, then the row selection is also saved as a pseudo list using the predefined name "<selection>". The list data itself contains one character for every six rows, of which the ASCII code is 64 plus a binary 6-digit number that encodes the membership of those rows.
The Template Section
Apart from the data table itself, the template section is usually the largest section of a '.dwar' file. It defines how the data is presented within a DataWarrior window. It defines which views are visible, their sizes and relative orientations; it describes the nature and configuration of every view like which chart type is used, which categories are presented in which color, which columns are assigned to which axes, where labels are shown, zoom states, etc... It also defines which filters exist, on which columns they operate, and how they are configured. Therefore, indirectly it also defines, which part of the data is currently visible.
Since the template section is not part of the data itself and just describes how the data is presented, this section can be useful outside the context of a particular '.dwar' file. In the case where multiple datasets share a similar structure, i.e. use the same column titles, one may easily apply a template to all datasets after it was designed for the first one. Typically, this is done, if there are frequent updates of a dataset and one wants to re-use a template created for an earlier version of the data. In DataWarrior this is done with 'Apply Template' from the 'File' menu. This task opens a '.dwar' or '.dwat' file and applies the contained template to the current data window. '.dwat' files are meant for this purpose, because they contain a template section only.
The following simple example shows a template for a small '.dwar' file containing just a structure column and two numerical columns ('logP' and 'logS'). In addition to the mandatory table view, it defines two more views, an explanation view and a 2-dimensional graphical view with a scatter plot. Three filters are defined, a structure filter in substructure mode, and two numerical slider filters. Explaining each individual setting would exceed the scope of this document. Again, the interested reader is pointed to the Java source code of the DERuntimeProperties class, which works in both directions: it either applies a template to an open DataWarrior window and creates views and filters accordingly; or it constructs a new template from all components of an open DataWarrior window.
<filterAnimation1="state=stopped low2=80% high1=20% time=10">
<filterAnimation2="state=stopped low2=80% high1=20% time=10">