DataWarrior User Manual

DataWarrior Main Views

The Main Views of DataWarrior are the windows to your data. They display all rows or the visible subset of all rows in a view specific way. When you open data from a text file, paste data from the clipboard or retrieve data from a database, then DataWarrior automatically creates a few default views. You may, however, at any time close existing views or create new views from the popup menu that appears after a right mouse button click on any main view's header area.

While default views are created side-by-side, views may also be stacked on top of each other. Main Views can be arranged freely by dragging one view's header area to another view and dropping it either on the target view's center or close to one of its four edges. A semitransparent ghost view indicates what will happens, when the view is dropped at the current position. When dropped at the center, the dragged view would be stacked on top of the target view. Otherwise, it would horizontally or vertically split the space of the target view and occupy half of it for itself. Views can be resized by dragging the divider area between the views. At any time any visible view can be maximized in size to cover the entire main view area by clicking the maximize button on the view's header area.

The Table View

The Table View is the only mandatory view, which means that it is automatically created when you open data from any source and it cannot be closed. It offers a spreadsheet like view on all visible rows, i.e. those data rows that are not hidden by any filter. It also shows visible columns only. Columns may be inherently invisible or they may be temporarily defined to be invisible.

Any column that is shown in the Table View displays data of an associated column of the underlying data table. Typically, all cells of a column contain data of the same kind, e.g. numerical data, dates, categories, chemical structures, identifiers, or text. Some columns of the underlying data table may not be displayable, e.g. chemical descriptors or 3D-atom coordinates. These columns are connected to a displayable parent column and hold supporting information for various purposes. Descriptors are used for the calculation of chemical similarities or to accelerate sub-structure filterung. Atom coordinates or atom color information are used by the molecule depiction of 2D-structures or conformers.

Cell coloring: Typically, text and chemical structures are drawn in black or white, depending on the chosen look and feel. However, for any column the foreground drawing color as well as the cell's background colors can be chosen to reflect the data values of this or other columns' cell values. Colors may also represent the similarity of this row's molecule compared to the molecule in the Reference Row. In this case colors change dynamically, whenever the Reference Row is changed. Column color assignments can be done with Set Text/Structure/Background Color... from the popup menu that opens upon a right click on the respective column title.

Table View with buttons; 1: Scroll to Reference Row, 2: Expand Column Titles, 3: Column Filter

Column content is automatically analysed whether it is numerical data, date values, categories, or just unstructed text. For numerical and date columns the covered data range is determined and for category data a list of found categories is maintained. What DataWarrior can do with a column depends on the kind of data it contains. When you hover with the mouse over a column title, then a tooltip with the column's characteristics is shown.

Tooltip over column title showing column characteristics

Table cells may contain multiple values of the same kind, e.g. multiple numerical values. If a cell contains multiple values, then these are either separated by "; " (semicolon and space) or they are in separated lines within the same cell. If a column contains category values and if one of its cells contains multiple distinct values, then this row belongs to multiple categories in regard to this column.

Note: There are two popup menus to access table view related functionality. The popup menu that opens on the table view's header (i.e. title bar) give you acces to general view and table view related functionality as font size and column order. The second popup menu available from any column title mainly related to functionality concerning that respective column. However, there is a gray zone.

Popup menus available from table view header and column title

Display mode: Numerical columns may be defined to show rounded values by selecting the Show Rounded Values... option from the column header's popup menu. For columns containing multiple values in at least one cell, the Show Multiple Values As... option from the same popup menu allows to select a summary mode. Multiple values will then be summarized into one value being shown in the cell. Typically, average or median values are shown, but available options also include the largest, the smallest, or the sum of all cell values. If a summary mode is defined for a column, then it is not only shown in the Table View, but it is used whenever DataWarrior accesses numerical values of this column, e.g. in filters, 2D- or 3D-views, in calculations, and even when cells are copied to the clipboard. If, for instance, a filter is set to hide objects smaller than 100 cm, and if an object's size column contains "80; 110", and if this column's summary mode is set to Show Highest Value, then this object is not hidden by the filter. If All Values are shown, then DataWarrior automatically uses a mean value in graphical views and filters. If the column is set to logarithmic mode then geometric mean values are used.

Selecting cells in the Table View works in a 2-dimensional way, which means that by selecting cells you modify the selection of columns and rows. The latter affects all views, because the row selection is shared among all views. The column selection, however, is affecting the Table View only. Nevertheless, the are some issues that need to be understood: First, if no columns are selected and you select some rows outside of the Table View, then you don't see any selection in the Table View, because no columns are selected. Thus, if you intend to select rows in another view in order to later copy cells from the Table View, then select the needed columns first. Continuous selection can be achieved by dragging the mouse from one cell to another or by selecting one cell first and clicking another cell while pressing the Shift key. Doing the same thing with the Ctrl key (Cmd key on the Macintosh) allows to select disconnected cells, columns and rows.

The Reference Row is indicated by a red border around it. In the Table View you may change the Reference Row by clicking on the row number on the left hand side. If the Reference Row is not visible because it is scrolled out of the visible range, then the table can be scrolled to reveal the current Reference Row by clicking the top left arrow-button.

Sorting: The rows of a Table View may be sorted by clicking on the respective column header. Depending on the column's data content the sorting will be numerically, alphabetically or by the size of chemical structures. Where cells have the same content, the previous sort order is retained. The selection of rows and columns is unaffected by sorting. Other views that display rows in order (Structure View and Form View) share the Table View's sort order. You may use this behaviour to sort the Table View for instance by some ascending biological activity and then switch to the Structure View, where the most active structures now start at the top. If some rows are selected and the Shift key is pressed during sorting, then all selected rows are moved to the top of the table; selected and unselected rows are still sorted independently.

Column order: The order of visible columns in a table can easily be changed by dragging one column title horizontally to a new position. For files with hundreds of columns, however, a complete re-organization of the column order this way would be a very tiresome task. This can be done more efficiently using a dedicated dialog that opens when selecting Change Column Order... from the view header popup menu. The dialog allows to drag single or multiple column titles at once to a new position.

Column grouping: Another set of features, which is particularly useful for files with many columns is the ability to associate a subset of all column in a group. A column may be member of multiple groups. If groups are defined every group has an associated color, which is used to decorate its members' column titles. A column that is member of multiple groups, has a decoration in gray.

Column visibility: Another set of features, which is particularly useful for files with many columns is column hiding. The column title popup menu allows to reversibly hide individual or selected columns from the view. When column groups are defined then one can show only those columns that belong to a particular group.

Column filtering: Clicking the spy glass button at the top left of the table view opens a text field that can be used to define filter criteria, which immediately hide not matching columns. When the field contains a text string, then only columns are shown, whose title contains the query string. Multiple query strings can be given by separating them with commas (','). If the text string starts with 'regex:', then everything that follows is interpreted as a regular expression for full search flexibility.

Editing: The Table View also permits editing the data. After double clicking into a cell a data type specific editor lets you edit the cell's content. After choosing OK the entire column is re-evaluated, because its perceived data type may have been changed after the edit. Filters and views may be updated to reflect the change and to avoid conflicts that might arise from a changed data type.

Detail data: In DataWarrior a table cell may have associated detail information like pictures or formatted text. Such detail information is not shown in the Table View. Detail information is shown in the Detail View in the lower right part of every DataWarrior window. And it may be shown in form views. In the Table View cells with detail content display a number on a small colored rectangle in the right upper corner indicating how many detail objects are available. The detail content itself may be either part of the DataWarrior file or the file may only contain references to the details along with the information of how to access it. Green or brown backgrounds indicate file internal detail or referenced details, respectively.

Table with green detail markers indicating file-embedded detail text and pictures.

Links: Columns may also contain clickable web links. These may be valid complete URLs, or they may just contain keys or names of some kind and the column meta data tells DataWarrior how to construct a full URL from it. Examples are the tables generated by a Wikipedia retrieval or a Google Patents search.

Graphical 2D- and 3D-Views

The graphical 2D- and 3D-Views incorporate plenty of functionality. Most important are the two or three axes that define the numerical/category space in which the data is shown. The xyz button on the the view's header area opens a popup menu for defining, which axis is used in which way. Here you may assign a data column to any of the axes by selecting it within the respective axis popup menu. Columns that don't contain numerical values, date values, category information or chemical descriptors cannot be assigned to axes and, therefore, don't appear in the column menus. Columns with incomplete data, i.e. with some empty cells, appear in red color indicating that some rows of the data may not be shown, if this column is selected. Non-numerical category columns appear in blue color, while complete numerical or date columns are shown in black.
You may use the double ended range sliders to zoom into the numerical or category space of any axis. This is done by dragging the slider's left or right thumb area towards the middle. Pressing the Ctrl-key while dragging the thumb increases the resolution by a factor of 20. You may also double click the thumb area to type-in an exact value. Finally, using the scroll wheel of a mouse or or a scroll gesture on a touch pad will also zoom into or out of the view. After zooming into a 2D-view, you may move the visible area by dragging the middle of the range slider. Alternatively, you may drag the view itself with the mouse while pressing the Ctrl-key.

Graphical views can be linked to another view, such that their axes-column assignments and their zoom and rotation states are kept synchronized to those of the linked view. When you change the axis-assigment of a synchronized view or when you zoom into it or rotate it, then all synchronized views behave immedately the same. This way you may use some 3D-views side by side, whose markers encode different properties, but show the same objects and rotate and zoom synchronously. A view can be synchronized to another view by selecting Synchronize View To from the popup menu in the view's header area. The controlling view, which propagates its settings to one or more linked views, mare have more but not less dimensions than any client view. This means, a 2D-view may be attached to and controlled by a 3D-view, but not the other way round.

Synchronized views show different properties with same perspective

It is a DataWarrior principle that all visible views, usually, display the same rows of data. 2D- and 3D-views, however, have thenselfes an influence on the overall visibility of rows. For instance, markers that are zoomed out of the view, are not visible anymore. Rows are also hidden, if a column with incomplete data is assigned to an axis and if the view is configured to not show empty values. In such a case these rows are usually hidden in other views as well. If the ristricting view itself gets invisible, e.g. because is is send to the background and because another view is displayed on top of it, then the formerly hidden rows reappear again in all remaining visible views. There are situations, however, where zooming into the data should not affect the row visibility of other views. Therefore, graphical views can be configured to not Hide invisible rows in other views in the view's Set General View Options... dialog.

All view specific settings are accessible via a popup menu, which opens with a right mouse click anywhere in the view (2D-view) or over any marker (3D-view).

Since the graphical views support multiple fundamentally different chart/view types, often the first thing to do after creating a new view is defining, which view type to show. This is done by choosing Set Preferred Chart Type... and configuring the options in the upcoming dialog. If the columns, which are assigned to the axes, contain data that is incompatible with the preferred chart type, then DataWarrior shows a scatterplot instead, because scatterplots are compatible with any data type. Bar and pie charts require one or more category columns while box and whisker plots need one numerical and one category column. Being presented with an unintended scatterplot, one may still change the axis assignment to make the preferred chart possible.

Markers and their corresponding data rows can be selected by dragging the mouse pointer around markers. If you press the Alt key before starting the selection, you automatically select a rectangular area. Pressing the Shift key causes any new selection to be added to a previous selection. The effects of the Alt and Shift keys can be combined.

The 3D-View can be rotated by pressing and immediately dragging the mouse with the right mouse button. If the Shift key is kept pressed while the view is rotated, then the rotation is restricted to a pure horizontal or vertical rotation. Note that in 3D-views the right mouse button is used to rotate views as well as to show a popup menu to configure view details. If you press and drag, then the view rotates. If you press and release or wait for 800 ms then the popup appears.

To visualize more dimensions than already given by the 2D- or 3D-coordinate systems, marker colors, sizes or shapes can be associated to data columns to reflect numerical or categorical cell values. Naturally, marker shapes can not represent numerical data and can only represent category information, if the number of categories does not exceed the number of available shapes.

Moreover, the 2D-view allows to tinge the background according to closeby markers' column values. In this case every marker emits a corona of color that fades with increasing distance. All overlapping colors mix in hue and saturation to create a smooth colorful property landscape, where areas with similar colors visualize areas of similar properties. Consequently, background coloring is particularly useful if the property associated to it correlates to some extend with the column values that are assigned to the x- and y-axis. In other words background colors are used best, if close markers have similar properties, e.g. in a chemical space visualization. The background color dialog is accessible from the Set Marker Background Color... item from the view popup menu.

Both, the radius of the affected area around markers and the color fading with distance can be defined in the dialog. One needs to play around a little to find optimal settings. As default all visible rows contribute to background coloring. If row lists are defined one may alternatively select a row list. The color landscape than visualizes the properties of the list members only, even if the markers of these list members are not visible in the view. This way a view's background may visualize some property space, while an independent set of visible rows is shown on top of it.

Set Marker Labels... offers decorating markers with resizeable labels at up to eight positions around the marker. A label may even be shown instead of the marker. Labels show the content of an associated column, which may be alphanumerical or a chemical structure. Labels may be shown for all markers or only for a subset of markers. In the latter case one needs to define a row list containing the rows to be decorated with labels. This list then needs to be selected from the Show labels on choices. An alternative way of restricting labels to a subset of all rows would be setting the view's focus on a data subset (see below).

The view above shows airplane names for some of the markers. Notice, that the labels were manually moved to less populated locations to avoid label collisions with data and other labels. In DataWarrior labels may simply be dragged from their default locations to other locations using the mouse.

The Set Focus To Row List... option puts a visual focus on a certain group of markers. The markers in focus are drawn on top of the remaining markers, while the remaining ones are drawn in a dimmed way. If marker labels are set to be shown, then only those row markers are decorated with labels, which are in the focus. This lets you show a certain group of rows in the context of a bigger group. The rows in the focus are either those rows that belong to a row list or all selected rows are considered to be in the focus. In the latter case the focus is changing, whenever the selection changes.

The above chart shows GDP and life expectancy of the states of the world in 1950. The focus is set to highlight the current G8 states.

Set Marker Jittering... adds an adjustable random displacement to the positions of any marker. This can be useful if you have groups of markers that overlap at exactly the same positions and therefore appear as single markers. Typically, this is happens frequently with category data. Adding some jittering lets you visually perceive the number of markers within every category. Displaying the data with a bar or pie chart may often be an alternatively.

In the example above all compounds of a drug discovery project were clustered by a self-organizing map, such that every molecule was assigned to one of 18 by 18 nodes. The above view shows the nodes in a 2-dimensional grid. Some jittering was added to scatter all members of every cluster around the respective grid position. The view background and the markers are colored according to the biological activity and compound synthesis date, respectively. Note that background coloring works well in this example, because the compound on every node are very similar and, therefore, usually have similar activities causing a consisten background color around the node.

The menu item Split View By Categories... lets you split one 2D view into multiple equally scaled 2D-views of which each contains only those rows that belong to a certain category. Therefore, defining view splitting means selecting one or even two category columns as basis for the splitting. If you select one column, then the individual views are automatically arranged in a grid such that the grid cell widths and heights are similar. If two category columns are defined then the grid will have an n*m topology with n and m reflecting the number of categories in each column. For an example see the image below.

Connection Lines

Whenever the change of a numerical parameter over time shall be visualized, or when some logical connection between objects needs to be shown, or whenever a tree graph needs to be displayed, then drawing lines between markers is a crucial functionality. In DataWarrior connection lines can be defined in the dialog accessible through Set Connection Lines....

Group & connect by: This setting defines whether to show connection lines and whether in this case a category column is used to group markers. If a category column is chosen, then all markers belonging to one category are independently connected by lines. Select the option <Don't group, connect all> if you want all markers to be connected with one line.
Some datasets, e.g. after an Activity Cliff Analysis contain columns, with define links beween rows. Such row connecting columns also show up in this menu and cause, if selected, those links to be shown as connecting lines.

Connection order by: This option defines in which order individual markers of the selected groups are connected. Typically, marker are simply connected from left to right, i.e. along the x-axis. Alternatively, any numerical column can be selected here to define the order.

Relative line width: This defines the width of the connection lines.

Invert arrow direction: If markers are connected by a column, which references key values of another column in a directional way, e.g. after creating an Evolutionary Library, then connection lines are automatically drawn with arrows indicating the direction to the parent marker. This option allows to invert the arrow direction..

In the example above the Country column was selected for grouping and connecting markers. <X-axis> was selected as the connection order. The marker size was reduced to zero to effectively hide any markers.

Detail tree mode: This mode can only be selected, if in Group & connect by: a column is selected that contains references to other rows. If a detail tree mode is chosen, and if a Reference Row is selected, then the view does not show markers being positioned by values of columns assigned to axes. Instead, it takes the Reference Row as root node of a detail graph. Then it adds all directly connected rows as second layer of the graph. All not yet used rows connected to any second layer row are added as third layer and so forth until the defined number of graph levels is reached. The layout of the detail graph is either radial or a horizontal or vertical tree, depending on this menu's setting (see following image).

Show all markers if no tree root is chosen: If a detail tree mode is selected, one may toggle between seeing all markers and a specific detail tree by clicking either on a marker to make this the new root or by clicking into the empty space to see all markers again. If a view is meant to show a detail graph view only and should never show marker assigned to selected axes, then one may deselect this checkbox.

If nodes get invisible, re-arrange and remove sub-branches: Like all other views, a detail tree view shows markers only, if they are not hidden by filters or other views. If this option is not selected, then the graph is constructed once after the root is chosen and using all its visible and non-visible neighbours. When rows are hidden, they just disappear with their connection lines from the view. If, however, this option is selected, then the layout of the graph is freshly created, whenever rows are hidden or un-hidden, taking only those neighbours into consideration that are currently visible. This way marker positions change with visibility, but the avaliable space is evenly used.

Invert tree line directions: If markers are connected by a column, which references key values of another column in a directional way, e.g. after creating an Evolutionary Library, then a detail graph layer is compiled by taking all rows referenced by the previous layer's rows. If this option is chosen, then the next layer is always build from rows referencing the current layer's rows. This effectively inverts the direction of reference.

Detail tree layers: This defines the maximum number of displayed neighbor levels around or beneath the root row.

Bar Charts And Pie Charts

If all axis of a graphical view are either unassigned or connected to a column with category data, then the view can be configured to show a bar chart or pie chart. This is done by selecting Set Preferred Chart Type... from the view's popup menu and selecting Bar Chart in the upcomping dialog.

Per default bar charts are histograms, which means that any bar's size represents the number of rows in the respective category. Alternatively, the bar's size may reflect numerical values of a specific column. Since one bar often represents multiple rows, the values of these rows need to be summarized somehow that one bar's size can reflect them. Therefore, one needs to select one of the summary modes in the Bar/Pie size by: menu. Current options are mean, minimum, maximum or the sum of the individual values.

In the example above bar sizes directly represent values of the Amount column. The chart is defined to show Mean Values, but since in this case every bar contains one data row only, the mean values are equal the exact row values.
This example uses Case Separation to split bars by the 'Kind' column to show 'Earnings' and 'Spendings' for every year side by side rather than the mean value of both values in one bar, which would not make much sense.

If marker colors are associated to a data column with Set Marker Color..., then bars are drawn in multiple colors such that the bar visualizes the property distribution within the category that is represented by the bar. Marker background coloring, marker shapes and marker sizes cannot be used with bars, because there are no distinct markers anymore.

If a column contains numerical values, e.g. molecular weights of some molecules, and if a histogram of the distribution of this value shall be shown, then a category column must be created artificially with a process called Binning.

Pie charts can be configured analogously to bar charts and can represent the same kind of information. Since bar bars are typically long and pies are as wide as they are high, pie charts are often preferred when both axes show categories of similar sizes. Then the pies just use the space more efficiently. Relative pie sizes can be adjusted with Set Marker Size.... This allows oversizing large pies in order to recognize very small pies as well. Overlapping pies are usually tolerable and therefore pies can often visualize large size variations better than bars.

Box Plots And Whisker Plots

Box Plots and Whisker Plots are popular, when measured data can be assigned to groups and where the variation of the data within the groups between the groups is of concern. Often the median or mean of a group is compared to other groups and statistical significance is deduced from p-values as a result of a t-test.

This example shows a simple box plot with the three main cases split a second time by genders. The dialog that opens after selecting Set Statistical View Options... from the view's popup menu lets one define, which statistical parameters are directly shown in the view, how fold-change and p-values are calculated, and how mean or median values are indicated in the view.

Whisker Plots are similar to Box Plots. While the latter depicts a box from the 1st to the 3rd quartile of the data, the Whisker Plot shows all individual values as markers. If there are many individual values, then it is often difficult to judge where all of them are, because they highly overlap. In these cases it often helps to add some transparency with Set Marker Transparency...

In the example above we see a whisker plot with lines connecting the Main Cases, which are the primary categories shown in a box or whisker plot. Primary cases are separated by animal species and marker colors also reflect the species. Since we separate three cases, we also have three lines instead of one. Each line connect five size groups of one species. Since every species in shown in a different color and since every line connects markers of one species only, the lines automatically are drawn in the color associated to the particular species.

The Form View

With the Form View, you can display some or all fields of one row in a customizable format using the entire main view area. This view is particularly appropriate if your data contains detail information that is not visible in the Table View. A built-in form designer lets you custom tailor your forms by defining an underlying grid, column and row resizing rules, and the locations of form items on the grid.

To create a new Form View right-click within the header area of any existing view and choose New Form View from the popup menu. This action creates a new Form View with a default layout containing form items for every available column or detail information. You can keep it unchanged or modify the layout using the Form Designer. This layout editor allows to customize the resizing behaviour of rows and columns of the layout and move, add, delete or modify fields within the Form View. You may toggle between the Form View and the Form Designer by checking and unchecking Design Mode in the popup menu, which appears upon a right mouse click on the header area of the Form View.

DataWarrior's forms and their items adapt automatically to the available space, which you notice when you resize the form or print it on different paper formats. This behaviour is achieved through a layout manager that works behind the scenes. The way in which the layout manager assigns the available space to form items works as follows: You may define a certain number of columns. The resizing behaviour of every column is defined by assigning one of these types: Fixed Size, Relative Size and Share Remaining Space. Fixed size columns are served first. They get a fixed number of pixels from the available width. Relative size columns are defined to get a certain percentage from the remaining width. What is left from the total width after all columns of these two kinds have claimed their part, is devided equally between the columns that are of type Share Remaining Space. Sharing the available height among all rows works in the same way. Columns and rows together define a grid which can then be used to attach form items. These may just occupy one grid cell or strech across multiple columns and rows. Typically rows or columns, which are used as borders or as spacing between form items are often defined as being of fixed size. Those that bear form items often share the remaining space.

In the Form Designer you insert or remove colums or rows by means of a popup menu which appears after a right mouse click on the appropriate lila colored column or row header area.

This popup menu also allows to switch between the associated column or row modes. If you need to remove many of a form's rows or columns this can be quickly done by selecting them in the row or column header area and then choosing Remove Selected Columns from the popup menu. When you remove columns or rows from a form then form items which are are positioned exclusively in these rows or columns then these form items are removed as well.

Form items may be introduced by right clicking into a light blue cell of a form and selecting Add <field name> from the popup menu. Form items may be repositioned by dragging them around on the form's grid or by moving their edge and corner positions with the mouse. A right mouse click onto a form icon opens a poopup menu that allows to assign it to a different data field or to remove it entirely from the form.

When all form editing is done, then you may leave the form designer by unchecking the Design Mode option in the forms main view popup menu.

The Structure View

This view renders all molecules of a data column within a 2-dimensional grid. The number of structures per row and with it the size of the structures may be adjusted by right-clicking into the view and selecting number of columns from the popup menu. The structure pictures may be decorated with further information from the same data row. To do so choose Show/Hide Label from the same popup menu. A dialog lets you assign column titles to predefined label positions around the chemical structure.

The sort order of structures and the Structure Highlight Mode reflect the settings defined in the Table View.

Structure Filter TIP: You can drag and drop a molecule from the Structure View to a Structure Filter or directly into the Structure Editor.

Continue with Analysing Data...