DataWarrior User Manual

Working with Data


DataWarrior supports you with various means in the task of exploring and comprehending large datasets, be it with or without chemical structures.


Pivoting Data

Consider a data table with three columns, of which the first contains product numbers, the second column contains a year and the third the profits gained with this item in the given year. Now imagine you intend to correlate profits between the different years. Then you need the data table to be arranged differently, i.e. you would need rows to contain products, columns to contain years and the individual cells to contain the profit values achieved. Column titles would look like 'Product Number', '2010', '2011', '2012', ... This kind of data transformation technique is called Pivoting or Cross-Tabulation. This data transformation can be done in DataWarrior by choosing New From Pivoting... in the File menu, which lets you define involved columns in a dialog and then creates a new window with the copied and converted data table.

The example used a file with Dopamine receptor antagonists from the ChEMBL database. Besides various other information it contained chemical structures, measured effects and receptor subtype (D1 - D5) in separate columns. Columns were selected to group results in one row per chemical structure and to split data column into separate columns for every receptor subtype. Three data columns were selected, the measured numerical value, the unit and the m column, which sometimes contains a modifier (e.g. '<' symbols).


Reverse Pivoting Data

The Reverse Pivoting transformation is indeed the opposite of the pivoting conversion. It creates a new DataWarrior document from the original data table. For every original data row it takes the cell content from multiple selected data columns into one new table column, but multiple new rows. A second column is created, which receives the column names of the original data columns. This way every new row is assigned to one of the original selected data columns. Original cell content of other non-selected columns is redundantly copied to any new row.

The Reverse Pivoting dialog lets one select those column that shall be merged into one new column. One also needs to specify a column name for this column as well as for the second new column, which receives the original column names and serves to assign rows to original categories.


Merging Columns

This rather simple functionality merges cell content from multiple columns into a new one. This may be useful, if multiple columns contain data of the same kind, which shall be used together in views, filters or further calculations. After merging into one column, the values from different source columns are still separated properly, such that DataWarrior still perceives them as multiple values. With Show Multiple Values As... from the column title popup menu one may choose to display mean, median, or other display modes.

This rather simple functionality gets interesting when chemical structures are involved, especially after an R-group deconvolution (SAR Analysis). In this case chemical structures are not only merged into one new structure, they are also rejoined where they contained an R-substituent that is defined in one of the columns to be merged. In the following example R2 and R3 are re-joined with the scaffold, while R1 is left out and, thus, kept as remaining multiplicity in regard to the new column.

Merge Column dialog and table view with new column after structure merge.

If this way a Scaffold is re-united with all but one or two R-group columns, then this allows to investigate the influence of these one or two R-groups more closely, when focusing on any of the molecule categories that are defined by the merged structure column. This is somewhat similar to a molecular matched pair analysis.


Binning Numerical Data

If the statistical distribution of all numerical values within one column is of concern, and if a histograms shall be displayed, then one needs to create a new column from the numerical values, which assigns a bin to each value, i.e. converts the numerical column into a special kind of category column.
The Add Binned Column... from the Data menu opens the following dialog, where the bin size and the exact location of the first bin can be defined, either by moving sliders or by typing in the desired values. A histogram preview shows how all rows are distributed among the bins with the current settings.

Binning dialog and histogram view from created new column.


Calculating Column Data From Custom Expressions

Often there is the need to calculate or extract numerical, date, or text data from other columns, e.g. to use it for filtering, as graph input values, as merge keys or for other purposes. For instance, one may extract year and day of the year from complete date values in order to overlay multiple curves in one graph, of which every one represents a different year's evolution of some value.

For calculating a new column from existing column values DataWarrior has built in the open source expression parser JEP. It lets you define an arithmetical, logical or textual expression from constants, operators (+, -, *, >, &, !, ==, etc.), functions as sin(), sqrt(), if(), contains(), etc. and variables that represent data from other columns.

In addition to the mathematical standard functions, DataWarrior understands various functions related to dates, to molecules, to the data table, or being relevant in macros. For instance there are functions to calculate molecule similarities, ligand efficiencies, or the occurance frequency of specific cell values.

To define and process a new expression choose Add Calculated Values... from the Data menu. When typing in a formula you don't need to type variable names. You rather should select the column name from the selection in the dialog and click Add Variable. This ensures correct spelling, especially since some column names are slightly modified to avoid syntax errors when the expression is parsed. The expression syntax, all operators and available functions are explained below.


Operators

In expressions all common arithmetic and boolean operators are fully supported. Boolean expressions are evaluated to be either 1 or 0 (true or false respectively). In the following tables a Check indicates that the operator can be used with the specific type of variable.

Description Function Double String
Power ^ Check  
Boolean Not ! Check  
Unary Plus, Unary Minus +x, -x Check  
Modulus % Check  
Division / Check  
Multiplication * Check  
Addition, Subtraction +, - Check Check (only +)
Less or Equal, More or Equal <=, >= Check  
Less Than, Greater Than < ,> Check  
Not Equal, Equal !=, == Check Check
Boolean And && Check  
Boolean Or || Check  

Mathematical Standard Functions

This function set contains more or less all common numerical or string functions, which one would expect to find on an electronic calculator, spread sheet or computer language. Again the Check indicates whether the function can be used with numerical or string parameters.

Description Function Double String
Sine sin(x) Check  
Cosine cos(x) Check  
Tangent tan(x) Check  
Arc Sine asin(x) Check  
Arc Cosine acos(x) Check  
Arc Tangent atan(x) Check  
Arc Tangent (with 2 parameters) atan2(y, x) Check  
Hyperbolic Sine sinh(x) Check  
Hyperbolic Cosine cosh(x) Check  
Hyperbolic Tangent tanh(x) Check  
Inverse Hyperbolic Sine asinh(x) Check  
Inverse Hyperbolic Cosine acosh(x) Check  
Inverse Hyperbolic Tangent atanh(x) Check  
Natural Logarithm ln(x) Check  
Logarithm base 10 log(x) Check  
Exponential exp(x) Check  
Absolute Value / Magnitude abs(x) Check  
Integer Value int(x) Check  
Rounded Value round(x, scale) Check  
Random number (between 0 and 1) rand()    
Modulus mod(x, y) = x % y Check  
Square Root sqrt(x) Check  
Sum (of three values) sum(x, y, z) Check Check
Maximum (returns larger value) max(x,y) Check  
Minimum (returns smaller value) min(x,y) Check  
If (returns x if condition is true, y if false) if(condition, x, y) Check  
Is Empty (String or Number; returns 1 or 0) isempty(x) Check Check
Replace Empty (returns y if x is empty, otherwise returns x) replaceempty(x, y) Check Check
String of str(x) Check Check
String length len(s)   Check
Checks whether string s1 contains s2 (1 or 0) contains(s1, s2)   Check
n-th position of String s2 in String s1 indexof(s1, s2, n)   Check
Substring of string s from m-th char to (n-1)th char substring(s, m, n)   Check
n-th entry, when splitting s1 at all occurences of s2 entry(s1, s2, n)   Check
String s after replacing all occurrences of regex f with r replace(s, f, r)   Check
Check if s matches the regular expression r matchregex(s, r)   Check
Binomial coefficients binom(n, i) integers  

Date Functions

If a column contains date values, then DataWarrior recognizes them automatically. Internally, date values are represented as the number of days that have been passed since Jan 1st, 1970. Therefore, when a date column is used in filters or assigned to an axis, the behaviour is nicely proportional and covers the full date range in a linear way. When used in expressions the following date functions work seemlessly with date values from date columns.

Description Function
Day of Week (1 := Monday) dayOfWeekEU(date-value)
Day of Week (1 := Sunday) dayOfWeekUS(date-value)
Day of Week (1 := Saturday) dayOfWeekME(date-value)
Day of Month from Date dayOfMonth(date-value)
Day of Year from Date dayOfYear(date-value)
Week of Year (ISO 8601, week starts on Monday; Jan,4 is week 1) weekISO(date-value)
Week of Year (week starts with Monday; Jan,1 is week 1) weekEU(date-value)
Week of Year (week starts with Sunday; Jan,1 is week 1) weekUS(date-value)
Week of Year (week starts with Saturday; Jan,1 is week 1) weekME(date-value)
Month from Date month(date-value)
Year from Date year(date-value)
Year from Date (corresponding to weekISO) yearISO(date-value)
Year from Date (corresponding to weekEU) yearEU(date-value)
Year from Date (corresponding to weekUS) yearUS(date-value)
Year from Date (corresponding to weekME) yearME(date-value)
Date of today today()

Date functions allow to extract the day, month or year from a date value. If a column was recognized by DataWarrior to contain date values, then one can use these functions to get a numerical value that reflects the day, month or year.

The dayOfWeekEU() function returns values from 1 to 7 representing Monday to Sunday according to ISO-8601, which defines Monday as the first day of the week. While this is mainly used in Europe and Asia, other regions of the world consider Saturday or Sunday the first day of the week and may use dayOfWeekUS() or dayOfWeekME(), respectively.

The dayOfMonth() and dayOfYear() functions also return values starting with 1, which always indicates the first day of the period. Analogously, the month() function returns values from 1 to 12 representing January to December, respectively. The year() function extracts the year as a 4-digit number.

The weekISO(), weekEU(), weekUS(), and weekME() functions return the calendar week according to regional conventions. These define which day is considered the first day of the week and which date defines week 1.

The yearISO(), yearEU(), yearUS(), and yearME() functions return a year that matches the respective weekXXX() functions. Here the new year starts with the first day that is considered to belong to week 1. This day may still be in December or it may be one of the first days of January.

The today() function returns the current date as the number of days that have been passed since Jan 1st, 1970. Because DataWarrior represents date values the same way, the today() function can be used to convert an absolute date into the number of days that have been passed since then. For this use an expression like today()-myDataColumn.


Table Related Functions

When DataWarrior calculates new column values applying a given custom formula, then it solves the expression for every row individually. During any of these calculations variables represent data from the currently processed row only. Data from other rows is not visible to the expression evaluator and cannot be used in the formula. For certain purposes, however, the data from other rows must be taken into account. For instance, if a value must be normalized and centered relative to all values of the same column, then these must be known. Some functions need to know about values of other rows, which belong to a shared category. Or sometimes access of the previous row needed in order to calculate a value increase. The following functions have in common that they all preprocess the entire table before starting to calculate values for every row. This way all background information is available under the hood, when the particular row's expression is evaluated.

Note: Some of these function results depend on the table sort order, e.g. the value increase from the previous row obviously must depend on which row is located above the row for which the increase is calculated.

Note: Many of these functions require a column name as parameter, which is not the variable holding the row's value. It is the exact column name as it appears in the table header put into double quotes.

Description Function
Row index of processed row row()
Numerical value of cell numvalue(column-name, row)
Number of cell values including empty ones valueCount(column-var)
Number of non empty cell values nonEmptyValueCount(column-var)
Standard deviation of cell values from mean valueStdDev(column-var)
Frequency of occurance frequency(value, column-name)
Frequency of same value in same category frequencyInCategory(category-column, value-column)
Absolute value increase from previous row increase(value-column)
Absolute value increase in category increaseInCategory(category-column, value-column)
Percent increase from previous row percentIncrease(value-column)
Percent increase in category percentIncreaseInCategory(category-column, value-column)
Cumulative sum of previous rows cumulativeSum(value-column)
Cumulative sum in category cumulativeSumInCategory(category-column, value-column)
Moving sum in category movingSumInCategory(category-column, value-column, n1, n2)
Moving average in category movingAverageInCategory(category-column, value-column, n1, n2)
First value within category categoryFirst(category-column, value-column)
Smallest value in category categoryMin(category-column, value-column)
Largest value in category categoryMax(category-column, value-column)
Sum of values in category categorySum(category-column, value-column)
Mean value in category categoryMean(category-column, value-column)
Median value in category categoryMedian(category-column, value-column)
Reference value in category refvalue(ref-ID-column, category-column, value-column, ref-ID)
Normalize and center value normalize(column-name)

The row() function just returns the row index of the currently processed row starting with 1 for the first row of the table.

The numvalue() function returns the numerical content of a cell defined by its column name and its row index starting from 1 for the first row in the table.

The valueCount() function counts the number of entries in a particular cell, no matter whether these are separated by '; ' or whether they are in separate lines. Empty values are included. The function takes any variable name that refers to a displayable table column.

The nonEmptyValueCount() function works like valueCount(), but does not include empty values.

The valueStdDev() function calculates the standard deviation of all value within one cell in regard to their mean value.

The frequency() function counts how many rows of the entire table contain the given value in the specified value-column. When using this function, then typically, but not necessarily, value is a column variable, while value-column is the name of the same column, e.g. 'frequency(AnimalKind, "Animal Kind")'.

The frequencyInCategory() function is similar to the frequency() function, but in addition it considers the row's category membership. It counts, how many table rows contain the given value in the value-column and belong at the same time to the specified category, i.e. have the given category value in the category-column.

The increase() and percentIncrease() functions calculate the value increase from the previous row, as absolute difference or as percentage relative to the previous value. The cumulativeSum() function calculates the total sum of this value and all previous values. All three functions depend on the table sort order.

The increaseInCategory(), percentIncreaseInCategory(), cumulativeSumInCategory(), movingSumInCategory(), and movingAverageInCategory() functions all calculate a value from multiple rows, which belong to the same specified category. They all depend on the table's current sort order. increaseInCategory() calculates the absolute increase of the current row's value in regard to the value of the closest row in the table above, which belongs to the same category. percentIncreaseInCategory() gives you the relative increase in percent. cumulativeSumInCategory() calculates the sum of all previous values plus the current row's value within the same category.

movingAverageInCategory() calculates an average from the row's value, n1 previous and and n2 following values in the table that belong to the same category. If the table contains less than n1 previous rows or less than n2 following rows matching the same category, then an average value in calculated anyway from the available rows only.

Similarly, the movingSumInCategory() calculates the value sum from the row's value, the n1 previous and the n2 following values that belong to the same category. If the table contains less than n1 previous rows or less than n2 following rows matching the same category, then NaN is returned.

The categoryFirst() function returns the first value within a given value-column that belongs to a specified category. Thus, the result depends on the current order of table rows.
Example: If a table contains stock prices from different companies and different dates, then in order to visualize the relative price evolution of the different companies over time, one might sort the table by ascending date and then calculate a new column 'Relative price' using the formula 100*price/categoryFirst("Company","Price"). A scatter plot showing relative prices over the date with lines connecting companies would nicely show the price evolution curves of all companies starting all at 100%.

All other categoryXxxx() functions work similar as categoryFirst(), but these are independent of the table row order. These functions all summarize different aspects of all numerical values in a given column, whose rows belong to the same category, defined by another column. Depending on the specific function, the lowest or highest value, the mean or median or even the sum of all individual values within that category are returned. Empty cells are not considered. If single cells contain multiple values, then these are first summarized using the defined multiple value mode. Thus, every cell contributes one (potentially summary) value to the summarized category value.

The refvalue() function is a look-up function that finds within all rows of a given category that particular row, which contain a given ID in a given ID-column. The identified row is then considered the reference row for the entire category and contains supposedly a reference value in a specified column. For all rows of the same category the refvalue() function returns the same value.
The following example shall make it more clear. Let's assume that we have measured a set of compounds in multiple experiments. For every compound we have a result value for every experiment and the corresponding dataset contains these three columns:
'Experiment-ID': an experiment identifier that serves as a category name
'Compound-ID': the compound identifier (ID-Column)
'Result': contains a measured numerical experimental result value Let's further assume that one compound named 'A-123' shall serve as a reference compound within all experiments. We intend using its measured value to normalize all other result values within the same experiment. In this case we we would use the function refvalue("Compound-ID","Experiment-ID","Result","A-123") to get the result of the reference compound within the same experiment. Note that all function parameters are hard-coded text strings, three column names and one compound identifier.

The normalize() function normalizes and centers an input value based on the distribution statistics of one column's numerical values. Typically, the input value itself is from the same column, but this doesn't need to be. The normalization is achieved by subtracting the column's mean from the input value and then dividing by the square root of the column's variance. Value normalization is needed if multiple columns shall be used as input for an unbiased machine learning approach.


Biology and Chemistry Related Functions

In addition to the functions above, DataWarrior provides a few special functions that may be useful in the context of drug discovery and chem- or bio-informatics. Different to the functions above, these functions may also use chemical structures, reactions or descriptors as parameters.

Description Function
Ligand Efficiency (HTS) ligeff1(ra, conc in μmol/l, structure)
Ligand Efficiency (IC50) ligeff2(ic50 in nmol/l, structure)
Chemical Similarity (A) chemsim(descriptor, idcode)
Chemical Similarity (B) chemsim(descriptor1, descriptor2)
Highest Chemical Similarity maxsim(descriptor)

The ligeff1() and ligeff2() functions calculate ligand efficiencies as relative free binding energy in kcal/mol per non-H atom. While the first function ligeff1() requires the remaining activity of an HTS result, the second syntax ligeff2() needs IC50 values to work on. Ligand efficiency values are a much more reasonable basis for selecting leads of an HTS campaign than remaining activities, because this avoids the strong bias towards high molecular weight compounds, which is an implicit drawback of selecting those compounds as leads, which have a remaining activity below a certain threshold. Also during lead optimization one should compare target affinities based on ligand efficiencies rather than pure IC50 values.
"For the purposes of HTS follow-up, we recommend considering optimizing the hits or leads with the highest ligand efficiencies rather than the most potent..." (Ref.: A. L. Hopkins et al., Drug Disc. Today, 9 (2004), pp. 430-431).

To give an example: A compound with 30 atoms (400 MW) that binds with a Kd=10 nM has a ligand efficiency value of 0.36 kcal/mol per non-H atom. Another compound with 38 non-H atoms (500 MW) and the same ligand efficiency would have a 100 fold higher activity with Kd=0.106 nM. Let us assume an HTS screening revealed two hit compounds A and B with equal activities of IC50=10 nm, but different molecular weights of 400 and 500, respectively. Based on activities both compounds look equally attractive. Considering, however, that a synthetic introduction of a new group with 8 non-H atoms into compound A would match compound B in terms of weight, but would increase the activity by a factor of 100, if its ligand efficiency value can be maintained, it becomes clear that compound A is the by far more attractive alternative.

The remaining activity ra supplied to the ligeff1() function should be roughly between 0 and 100. The second parameter to this function is the assay concentration conc of the potential inhibitor in μmol/l. The third parameter is the molecular structure from which the number of non-hydrogen-atoms is determined automatically. In order to avoid misinterpretations one should understand the way the ligeff1() function works:
1) ra values below 1.0 are set to 1.0. Those above 99.0 are set to 99.0.
2) IC50 values are calculated from these range limited ra values as ic50 = conc / (100/ra - 1.0)
3) Assuming that the ic50 values are equivalent to the Kd the free energy of the ligand binding is calculated as dG = -RT * ln(ic50) with R=1.986 cal/(mol*K) and T=300K
4) The ligand efficiency is then calculated as ligeff = dG/Nnon-hydrogenatoms.
The consequences from the calculation are: Calculated ic50 values cover 4 log units with those values at the lower and upper end of this range having the highest uncertainty, i.e. the higher the noise of the screening the higher is the uncertainty of ic50 values and also of ligand efficiency values, especially those at the lower and higher end of the scale.
If one can use the second function ligeff2() based on measured IC50 values, one avoids the error potential of the calculation of IC50 values from remaining activities. Ligand efficiency values from this function are therefore much more reliable and only contain the error margin of the original IC50 value.

The chemsim() function calculates similarities between two chemical structures or reactions. This function is available in two variations:

A Use syntax A to calculate the similarities of one column's chemistry objects against one reference compound or reaction. The first parameter of this function defines the kind of similarity to be calculated. It must be the name of a descriptor column from the popup menu. The second parameter is the idcode of the reference structure. The following example calculates the 3D-pharmacophore similarity of the compounds in column Structure to pyridin (gFx@@eJf`@@@ is the idcode of pyridin).

Example: chemsim(PP3DMM2_of_Structure,"gFx@@eJf`@@@")

B Alternatively, you may use syntax B to calculate the similarities between two diffent columns containing chemistry objects. In this case you may need to calculate chemical descriptors first. Be aware that the descriptors supplied to the chemsim() function need to be of the same type. This example calculates the similarities between a Reactant and a Product column.

Example: chemsim(FragFp_of_Reactant,FragFp_of_Product)

The maxsim() function calculates the chemical similarity to all other compounds and returns the highest of these similarity values. In other words it calculates the similarity to the most similar other compound in the data set. The function's only parameter is chemical descriptor, which determines the kind of similarity to be used. One application for this function would be in the field of machine-learning. When structure derived descriptors serve as input data for a machine-learning approach, then it is crucial that the compounds used for model testing are substantially different in structure from any compound used in the model training phase. A reasonable cut-off value depends on the particular problem and on the descriptor used. As a rule of thumb values below 0.6 (SkelSpheres), 0.55 (SphereFp), 0.45 (PathFp), 0.4 (FragFp), and 0.2 (Flexophore) ensure that all other molecules in the data set are pretty dissimilar in the eyes of the descriptor. Example: maxsim(SkelSpheres_of_Structure)

Macro related Functions

These functions are typically used in expressions that are evaluated as part of a macro.

Description Function
Ask for string value askString(dialog-message)
Ask for a numerical value askNumber(dialog-message)
Ask for a column variable askColumnVar(dialog-message)
Ask for a column title askColumnTitle(dialog-message)

The askColumnVar(), askColumnTitle(), askString() and askNumber() functions are special functions, which do not calculate anything. Instead they interactively ask the user to provide some information before any calculation is done. The argument dialog-message is shown within a dialog to the user. Typically it will be something like 'Which column do you want to base your calculation on?'.
The askColumnVar() function lets the user select a column from the current dataset and returns the name of the variable, which refers to the column content.
The askColumnTitle() function also asks the user to select a column. However, it returns the column-name as a String value.
The askString() function asks for a text string to be entered, which is then returned as double-quoted String value.
The askNumber() function asks for a numerical value, which can be used in the formula, wherever a numerical value is appropriate.
These functions help automating DataWarrior with task sequences, since they allow to specify task conditions at run-time rather than at task definition time.


The Gini Selectivity Score

The Gini Coefficient was originally invented to represent the income or wealth distribution of a nation's residents, and is today the most commonly used measure of inequality. For instance, in the context of medicinal chemistry P. Graczyk applied it to express the selectivity of kinase inhibitors (doi:10.1021/jm070562u). A Gini Coefficient of 0.0 expresses perfect equality, where all values are the same, while a value of 1.0 represents maximal inequality, e.g. a totally selective inhibitor that is only active on one of many kinases. Typically, the Gini Coefficient is used for a large number of positive numerical values of which a few are much larger than the rest. Thus, inhibition values are a perfect input, while logarithmic data is a less meaningful data source.

To calculate the Gini Coefficient in DataWarrior one needs to provide the input values in columns. The Calculate Selectivity Score... item in the Data menu opens a dialog that lets you define a set of columns by either selecting them indiviually from a list or by specifying a common element of the columns' names. In addition one may select an option that converts effect values to inhibition values by subtracting the original value from 100. DataWarrior then calculates a new column containing the Gini Coefficient using the trapez method as described in the paper by Graczyk. However, DataWarrior does not change input values to limit the value range to 0 to 100, because this would reduce the noise in an unsymmetrical fashion. We consider the Gini Coefficient robust enough to digest negative input data without sacrificing its usefulness.


Defining a Fuzzy Score

Sometimes it is helpful to express the fitness or usefulness of an object with a single numerical value, which somehow summarizes the fitness of multiple properties. For this purpose DataWarrior allows to calculate a fuzzy score from multiple contributing property values. These values may be taken from numerical columns of the data set or they may be calculated on-the-fly from a chemical structure. No matter, which individual properties are used, DataWarrior derives from every contributing property an individual score between 0.0 and 1.0. This is done using a smooth sigmoid curve defined for every property. In the dialog (see below) this curve is drawn in magenta color. The curve's slope and inflection point(s) can be adjusted by setting the Minimum and Maximum values as well as by moving the Slope slider. An additional Weight slider allows to reduce or increase the contribution of the associated property's score to the overall score.

The individual valuation scores are then summarized by calculating their mean value or by multiplying all individual scores (geometrical mean). Usually the latter method is chosen, if an object can be considered useless, if one bad property alone can ruin it.

The properties that contribute to the overall fuzzy score are compiled by adding property specific panels to the lower scrollable part of the dialog. This is done by selecting the property's name in the Contributing Values popup menu and pressing the Add Value button. Then a new panel appears at the end of the property panel list letting you define the valuation curve. If the property refers to the current data set, then the distribution of the associated column's data is shown as a histogram in the curve area. This helps defining proper valuation curves.

Defining a fuzzy score profile is often done once and then applied multiple times on the same, but updated data set or on different data sets. Therefore, this functionality is often used within a macro.

Fuzzy score dialog defined to reward low IC50 values and low molecular weights.


Working with Graph Data

Often the rows of a data table represent objects that are logically connected or related, e.g. if some objects are similar to others or if objects are parent to or predecessors of others. DataWarrior uses a simple way to define these kinds of relationships between rows, provided that one column contains names or numbers, which uniquely identify every row. If then a second column's cells contain one or more of these identifiers, then these rows effectively point to (or reference) other rows of the same table. This turns the table of rows into a graph whose nodes are represented by table rows and whose edges are defined by row references. If you want DataWarrior to understand and make use of row links you need to tell DataWarrior which column references which identifier column. Choose Set Column Reference... from the popup menu that appears upon a right mouse click within the referencing column's title area.

Popup menu and dialog to define a column reference.

Column Name: This is the column that references the identifier column. Typically, it contains zero, one, or multiple identifiers in every cell, which establish the link or reference to other rows. If the dialog was opened through a column popup menu as shown above, then this column was implicitly chosen in that step and cannot be changed in the dialog.

Referenced column: This popup menu defines the column containing unique row identifying names or numbers. It does not contain columns with duplicate entries.

Strength column: This popup menu allows to assign a numerical column, whose values define the strength of the link. If choosing a strength column it must have the same number of values in every row as the referencing column, i.e. every referencing identifier must have a matching strength value.

Links are bi-directional: If links are defined in the referencing column by one entry only, then we have a directed graph, which may define one or more trees, cycles or just linear strings of connected rows. Typical examples are parent or child relationships. Alternatively, relationships may not have a direction. In this case every link is defined by two identifier entries: each of the two linked rows points to its opposite row. This kind of link is typical to represent neighbourship or high row similarities.

Once a column reference is defined and therefore rows are crosslinked with each other, what can you do with it?

  • Most obvious is that these links can be shown as connection lines in scatter plots. For that to happen choose Set Column Reference... after a right mouse click within the particular 2D- or 3D-view. In the connection line dialog choose your referencing column from the Group & connect by menu. You may also adjust the line width or invert the arrow direction in case of directed links. Note that the position of your row markers is not touched when showing direction lines. These still depend on which columns are assigned to the two or three axes.
  • You may configure a view to show a sub-graph starting from the currently selected reference row as root. Since reference rows can be changed with one mouse click, such a view is a very dynamic one and shows a different sub-graph with every change of the reference row. These sub-graphs are trees whose nodes are automatically arranged to optimize the tree layout. You may choose between a circular tree layout with the root in its center or a traditional hierarchical tree with the root node at one of the view edges. To show dynamic sub-graphs choose anything but none from the Detail tree mode options in the connection line dialog.
  • You may ask DataWarrior to calculate an optimized set of 2D-coordinates for your rows, which are optimized to visualize your row relations. The used algorithm tries to optimize two criteria: Linked rows are placed close to each other and all rows are distributed evenly over the available space. The following section explains this in more detail.


Arranging Rows According to Graph Relationships

If one column is defined to reference another one and, therefore, rows are linked to other rows then DataWarrior may use this information to arrange all rows on a 2-dimensional plane such that the distance of linked rows is minimized. For that DataWarrior uses a Rubberband Scaling algorithm that generates new X and Y coordinates for each row. When these coordinates are assigned to the two axes of a 2D-view, then even complex graphs can be displayed nicely. DataWarrior uses the same algorithm to visualize chemical space after a similarity or activity cliff analysis.

To calculate row coordinates using this method choose Arrange Graph Nodes... from the Data menu, choose the referenced, the referencing, and optionally a connection strength column. Then click OK. Now DataWarrior creates a new 2D-view that shows all rows using freshly generated coordinates. It also shows a connection line for every defined row relationship.

2D-View displaying directed graph after Rubberband Scaling.

In addition, a second 2D-view is created and configured to display a sub-graph tree with the current reference row as root, provided that a reference row is chosen. Whenever the reference row changes, because a different row is clicked with the mouse in any view, then the sub-graph is automatically updated to show multiple layers of the newly selected reference row's neighbours.

2D-View configured to show circular detail graph with 'Lightning Injuries' as root.

For more information about how to use connection lines and detail tree sub-graphs check the section on connection lines.


Continue with Machine Learning...