www.openmolecules.org

Working with Data

DataWarrior supports you with various means in the task of exploring and comprehending large datasets, be it with or without chemical structures.

Pivoting Data

Consider a data table with three columns, of which the first contains product numbers, the second column contains a year and the third the profits gained with this item in the given year. Now imagine you intend to correlate profits between the different years. Then you need the data table to be arranged differently, i.e. you would need rows to contain products, columns to contain years and the individual cells to contain the profit values achieved. Column titles would look like 'Product Number', '2010', '2011', '2012', ... This kind of data transformation technique is called Pivoting or Cross-Tabulation. This data transformation can be done in DataWarrior by choosing New From Pivoting... in the File menu, which lets you define involved columns in a dialog and then creates a new window with the copied and converted data table.

The example used a file with Dopamine receptor antagonists from the ChEMBL database. Besides various other information it contained chemical structures, measured effects and receptor subtype (D1 - D5) in separate columns. Columns were selected to group results in one row per chemical structure and to split data column into separate columns for every receptor subtype. Three data columns were selected, the measured numerical value, the unit and the m column, which sometimes contains a modifier (e.g. '<' symbols).

Reverse Pivoting Data

The Reverse Pivoting transformation is indeed the opposite of the pivoting conversion. It creates a new DataWarrior document from the original data table. For every original data row it takes the cell content from multiple selected data columns into one new table column, but multiple new rows. A second column is created, which receives the column names of the original data columns. This way every new row is assigned to one of the original selected data columns. Original cell content of other non-selected columns is redundantly copied to any new row.

The Reverse Pivoting dialog lets one select those column that shall be merged into one new column. One also needs to specify a column name for this column as well as for the second new column, which receives the original column names and serves to assign rows to original categories.

Merging Columns

This rather simple functionality merges cell content from multiple columns into a new one. This may be useful, if multiple columns contain data of the same kind, which shall be used together in views, filters or further calculations. After merging into one column, the values from different source columns are still separated properly, such that DataWarrior still perceives them as multiple values. With Show Multiple Values As... from the column title popup menu one may choose to display mean, median, or other display modes.

This rather simple functionality gets interesting when chemical structures are involved, especially after an R-group deconvolution (SAR Analysis). In this case chemical structures are not only merged into one new structure, they are also rejoined where they contained an R-substituent that is defined in one of the columns to be merged. In the following example R2 and R3 are re-joined with the scaffold, while R1 is left out and, thus, kept as remaining multiplicity in regard to the new column.

Merge Column dialog and table view with new column after structure merge.

If this way a Scaffold is re-united with all but one or two R-group columns, then this allows to investigate the influence of these one or two R-groups more closely, when focusing on any of the molecule categories that are defined by the merged structure column. This is somewhat similar to a molecular matched pair analysis.

Another way to merge multiple chemical structure columns is to specify a transformation reaction, which considers the molecules to be merged as reactants and defines the chemical transformation of how to convert them into a product. If a reaction is specified, then the number and order of selected structure columns must match the number and order of the reactants. Moreover, all atoms that exist on the reactant and on the product side should be properly mapped, i.e. assigned to each other using the editor's mapping tool.

Merge Column dialog with specified transformation and table view after product formation.

Binning Numerical Data

If the statistical distribution of all numerical values within one column is of concern, and if a histograms shall be displayed, then one needs to create a new column from the numerical values, which assigns a bin to each value, i.e. converts the numerical column into a special kind of category column.
The Add Binned Column... from the Data menu opens the following dialog, where the bin size and the exact location of the first bin can be defined, either by moving sliders or by typing in the desired values. A histogram preview shows how all rows are distributed among the bins with the current settings.

Binning dialog and histogram view from created new column.

Calculating Column Data From Custom Expressions

Often there is the need to calculate or extract numerical, date, or text data from other columns, e.g. to use it for filtering, as graph input values, as merge keys or for other purposes. For instance, one may extract year and day of the year from complete date values in order to overlay multiple curves in one graph, of which every one represents a different year's evolution of some value.

For calculating a new column from existing column values DataWarrior has built in the open source expression parser JEP. It lets you define an arithmetical, logical or textual expression from constants, operators (+, -, *, >, &, !, ==, etc.), functions as sin(), sqrt(), if(), contains(), etc. and variables that represent data from other columns.

In addition to the mathematical standard functions, DataWarrior understands various functions related to dates, to molecules, to the data table, or being relevant in macros. For instance there are functions to calculate molecule similarities, ligand efficiencies, or the occurance frequency of specific cell values.

To define and process a new expression choose Add Calculated Values... from the Data menu. When typing in a formula you don't need to type variable names. You rather should select the column name from the selection in the dialog and click Add Variable. This ensures correct spelling, especially since some column names are slightly modified to avoid syntax errors when the expression is parsed. The expression syntax, all operators and available functions are explained below.

Operators

In expressions all common arithmetic and boolean operators are fully supported. Boolean expressions are evaluated to be either 1 or 0 (true or false respectively). In the following tables a indicates that the operator can be used with the specific type of variable.

Description	Function	String
Power	^
Boolean Not	!
Unary Plus, Unary Minus	+x, -x
Modulus	%
Division	/
Multiplication	*
Addition, Subtraction	+, -	(only +)
Less or Equal, More or Equal	<=, >=
Less Than, Greater Than	< ,>
Not Equal, Equal	!=, ==
Boolean And	&&
Boolean Or	\|\|

Mathematical Standard Functions

This function set contains more or less all common numerical or string functions, which one would expect to find on an electronic calculator, spread sheet or computer language. Again the indicates whether the function can be used with numerical or string parameters.

Description	Function	Double
Sine	sin(x)
Cosine	cos(x)
Tangent	tan(x)
Arc Sine	asin(x)
Arc Cosine	acos(x)
Arc Tangent	atan(x)
Arc Tangent (with 2 parameters)	atan2(y, x)
Hyperbolic Sine	sinh(x)
Hyperbolic Cosine	cosh(x)
Hyperbolic Tangent	tanh(x)
Inverse Hyperbolic Sine	asinh(x)
Inverse Hyperbolic Cosine	acosh(x)
Inverse Hyperbolic Tangent	atanh(x)
Natural Logarithm	ln(x)
Logarithm base 10	log(x)
Exponential	exp(x)
Absolute Value / Magnitude	abs(x)
Integer Value	int(x)
Rounded Value	round(x, scale)
Random number (between 0 and 1)	rand()
Modulus	mod(x, y) = x % y
Square Root	sqrt(x)
Sum (of three values)	sum(x, y, z)
Maximum (returns larger value)	max(x,y)
Minimum (returns smaller value)	min(x,y)
If (returns x if condition is true, y if false)	if(condition, x, y)
Is Empty (String or Number; returns 1 or 0)	isempty(x)
Replace Empty (returns y if x is empty, otherwise returns x)	replaceempty(x, y)
String of	str(x)
String length	len(s)
Checks whether string s1 contains s2 (1 or 0)	contains(s1, s2)
n-th position of String s2 in String s1	indexof(s1, s2, n)
Substring of string s from m-th char to (n-1)th char	substring(s, m, n)
n-th entry, when splitting s1 at all occurences of s2	entry(s1, s2, n)
String s after replacing all occurrences of regex f with r	replace(s, f, r)
Check if s matches the regular expression r	matchregex(s, r)
Binomial coefficients	binom(n, i)	integers

Date Functions

If a column contains date values, then DataWarrior recognizes them automatically. Internally, date values are represented as the number of days that have been passed since Jan 1st, 1970. Therefore, when a date column is used in filters or assigned to an axis, the behaviour is nicely proportional and covers the full date range in a linear way. When used in expressions the following date functions work seemlessly with date values from date columns.

Description	Function
Day of Week (1 := Monday)	dayOfWeekEU(date-value)
Day of Week (1 := Sunday)	dayOfWeekUS(date-value)
Day of Week (1 := Saturday)	dayOfWeekME(date-value)
Day of Month from Date	dayOfMonth(date-value)
Day of Year from Date	dayOfYear(date-value)
Week of Year (ISO 8601, week starts on Monday; Jan,4 is week 1)	weekISO(date-value)
Week of Year (week starts with Monday; Jan,1 is week 1)	weekEU(date-value)
Week of Year (week starts with Sunday; Jan,1 is week 1)	weekUS(date-value)
Week of Year (week starts with Saturday; Jan,1 is week 1)	weekME(date-value)
Month from Date	month(date-value)
Year from Date	year(date-value)
Year from Date (corresponding to weekISO)	yearISO(date-value)
Year from Date (corresponding to weekEU)	yearEU(date-value)
Year from Date (corresponding to weekUS)	yearUS(date-value)
Year from Date (corresponding to weekME)	yearME(date-value)
Date of today	today()

Date functions allow to extract the day, month or year from a date value. If a column was recognized by DataWarrior to contain date values, then one can use these functions to get a numerical value that reflects the day, month or year.

The dayOfWeekEU() function returns values from 1 to 7 representing Monday to Sunday according to ISO-8601, which defines Monday as the first day of the week. While this is mainly used in Europe and Asia, other regions of the world consider Saturday or Sunday the first day of the week and may use dayOfWeekUS() or dayOfWeekME(), respectively.

The dayOfMonth() and dayOfYear() functions also return values starting with 1, which always indicates the first day of the period. Analogously, the month() function returns values from 1 to 12 representing January to December, respectively. The year() function extracts the year as a 4-digit number.

The weekISO(), weekEU(), weekUS(), and weekME() functions return the calendar week according to regional conventions. These define which day is considered the first day of the week and which date defines week 1.

The yearISO(), yearEU(), yearUS(), and yearME() functions return a year that matches the respective weekXXX() functions. Here the new year starts with the first day that is considered to belong to week 1. This day may still be in December or it may be one of the first days of January.

The today() function returns the current date as the number of days that have been passed since Jan 1st, 1970. Because DataWarrior represents date values the same way, the today() function can be used to convert an absolute date into the number of days that have been passed since then. For this use an expression like today()-myDataColumn.

Table Related Functions

When DataWarrior calculates new column values applying a given custom formula, then it solves the expression for every row individually. During any of these calculations variables represent data from the currently processed row only. Data from other rows is not visible to the expression evaluator and cannot be used in the formula. For certain purposes, however, the data from other rows must be taken into account. For instance, if a value must be normalized and centered relative to all values of the same column, then these must be known. Some functions need to know about values of other rows, which belong to a shared category. Or sometimes access of the previous row needed in order to calculate a value increase. The following functions have in common that they all preprocess the entire table before starting to calculate values for every row. This way all background information is available under the hood, when the particular row's expression is evaluated.

Note: Some of these function results depend on the table sort order, e.g. the value increase from the previous row obviously must depend on which row is located above the row for which the increase is calculated.

Note: Many of these functions require a column name as parameter, which is not the variable holding the row's value. It is the exact column name as it appears in the table header put into double quotes.

Description	Function
Row index of processed row	row()
Numerical summary value of cell	numvalue(column-name, row)
Numerical individual value of cell	numcellvalue(column-name, n, sortMode)
Number of cell values including empty ones	valueCount(column-var)
Number of non-empty cell values	nonEmptyValueCount(column-var)
Standard deviation of cell values from mean	valueStdDev(column-var)
Frequency of occurance	frequency(value, column-name)
Frequency of same value in same category	frequencyInCategory(category-column, value-column)
Previous non-empty value in column	previousNonEmpty(column-name)
Previous value in category	previousInCategory(category-column, value-column, n)
Absolute value increase from previous row	increase(value-column)
Absolute value increase in category	increaseInCategory(category-column, value-column)
Percent increase from previous row	percentIncrease(value-column)
Percent increase in category	percentIncreaseInCategory(category-column, value-column)
Cumulative sum of previous rows	cumulativeSum(value-column)
Cumulative sum in category	cumulativeSumInCategory(category-column, value-column)
Moving sum in category	movingSumInCategory(category-column, value-column, n1, n2)
Moving average in category	movingAverageInCategory(category-column, value-column, n1, n2)
First value within category	categoryFirst(category-column, value-column)
Last value within category	categoryLast(category-column, value-column)
Smallest value in category	categoryMin(category-column, value-column)
Largest value in category	categoryMax(category-column, value-column)
Sum of values in category	categorySum(category-column, value-column)
Mean value in category	categoryMean(category-column, value-column)
Median value in category	categoryMedian(category-column, value-column)
Reference value in category	refvalue(ref-ID-column, category-column, value-column, ref-ID)
Normalize and center value	normalize(value, column-name)

The row() function just returns the row index of the currently processed row starting with 1 for the first row of the table.

The numvalue() function returns the numerical content of a cell defined by its column name and its row index starting from 1 for the first row in the table. If a cell contains multiple values, then this function retrieves the summary value of the current summary mode (mean, median, min, ...).

The numcellvalue() function returns the numerical value of the n-th values of a cell's content. The cell is defined by its column name and the currently processed row. If sortMode is not 0, then the cell entries are sorted before taking the n-th value (1:ascending; 2:descending). Modifiers as '<' are neglected. Columns in logarithmic mode return the logarithms of the original value. When sorting, then empty and NaN values are considered larger than the highest numbers.

The valueCount() function counts the number of entries in a particular cell, no matter whether these are separated by '; ' or whether they are in separate lines. Empty values are included. The function takes any variable name that refers to a displayable table column.

The nonEmptyValueCount() function works like valueCount(), but does not include empty values.

The valueStdDev() function calculates the standard deviation of all value within one cell in regard to their mean value.

The frequency() function counts how many rows of the entire table contain the given value in the specified value-column. When using this function, then typically, but not necessarily, value is a column variable, while value-column is the name of the same column, e.g. 'frequency(AnimalKind, "Animal Kind")'.

The frequencyInCategory() function is similar to the frequency() function, but in addition it considers the row's category membership. It counts, how many table rows contain the given value in the value-column and belong at the same time to the specified category, i.e. have the given category value in the category-column.

The previousNonEmpty() function returns a value from the specified column. If the cell of the currently processed row is not empty, then its value is returned. Otherwise, the last non-empty value abobe the current row is returned. If all rows above the current one are empty then an empty string is returned.

The previousInCategory() function locates the n-th row above the currently processed row, while only considering rows that belong to the same category. Then it returns the specified column's value of that row. The function returns NaN, if less than n rows exist above the currently processed row.

The increase() and percentIncrease() functions calculate the value increase from the previous row, as absolute difference or as percentage relative to the previous value. The cumulativeSum() function calculates the total sum of this value and all previous values. All three functions depend on the table sort order.

The increaseInCategory(), percentIncreaseInCategory(), cumulativeSumInCategory(), movingSumInCategory(), and movingAverageInCategory() functions all calculate a value from multiple rows, which belong to the same specified category. They all depend on the table's current sort order. increaseInCategory() calculates the absolute increase of the current row's value in regard to the value of the closest row in the table above, which belongs to the same category. percentIncreaseInCategory() gives you the relative increase in percent. cumulativeSumInCategory() calculates the sum of all previous values plus the current row's value within the same category.

movingAverageInCategory() calculates an average from the row's value, n1 previous and and n2 following values in the table that belong to the same category. If the table contains less than n1 previous rows or less than n2 following rows matching the same category, then an average value in calculated anyway from the available rows only.

Similarly, the movingSumInCategory() calculates the value sum from the row's value, the n1 previous and the n2 following values that belong to the same category. If the table contains less than n1 previous rows or less than n2 following rows matching the same category, then NaN is returned.

The categoryFirst() and categoryLast() functions return the first (or last) row's value from the given value-column and those rows, which belong to the specified category. Thus, the result depends on the current order of table rows.
Example: If a table contains stock prices from different companies and different dates, then in order to visualize the relative price evolution of the different companies over time, one might sort the table by ascending date and then calculate a new column 'Relative price' using the formula 100*price/categoryFirst("Company","Price"). A scatter plot showing relative prices over the date with lines connecting companies would nicely show the price evolution curves of all companies starting all at 100%.

All other categoryXxxx() functions work similar as categoryFirst(), but these are independent of the table row order. These functions all summarize different aspects of all numerical values in a given column, whose rows belong to the same category, defined by another column. Depending on the specific function, the lowest or highest value, the mean or median or even the sum of all individual values within that category are returned. Empty cells are not considered. If single cells contain multiple values, then these are first summarized using the defined multiple value mode. Thus, every cell contributes one (potentially summary) value to the summarized category value.

The refvalue() function is a look-up function that finds within all rows of a given category that particular row, which contain a given ID in a given ID-column. The identified row is then considered the reference row for the entire category and contains supposedly a reference value in a specified column. For all rows of the same category the refvalue() function returns the same value.
The following example shall make it more clear. Let's assume that we have measured a set of compounds in multiple experiments. For every compound we have a result value for every experiment and the corresponding dataset contains these three columns:
'Experiment-ID': an experiment identifier that serves as a category name
'Compound-ID': the compound identifier (ID-Column)
'Result': contains a measured numerical experimental result value Let's further assume that one compound named 'A-123' shall serve as a reference compound within all experiments. We intend using its measured value to normalize all other result values within the same experiment. In this case we we would use the function refvalue("Compound-ID","Experiment-ID","Result","A-123") to get the result of the reference compound within the same experiment. Note that all function parameters are hard-coded text strings, three column names and one compound identifier.

The normalize() function normalizes and centers an input value based on the distribution statistics of one column's numerical values. Typically, the input value itself is from the same column, but this doesn't need to be. The normalization is achieved by subtracting the column's mean from the input value and then dividing by the square root of the column's variance. Value normalization is needed if multiple columns shall be used as input for an unbiased machine learning approach.

Biology and Chemistry Related Functions

In addition to the functions above, DataWarrior provides a few special functions that may be useful in the context of drug discovery and chem- or bio-informatics. Different to the functions above, these functions may also use chemical structures, reactions or descriptors as parameters.

Description	Function
Ligand Efficiency (HTS)	ligeff1(ra, conc in μmol/l, structure)
Ligand Efficiency (IC50)	ligeff2(ic50 in nmol/l, structure)
Chemical Similarity (A)	chemsim(descriptor, idcode)
Chemical Similarity (B)	chemsim(descriptor1, descriptor2)
Highest Chemical Similarity	maxsim(descriptor)
Substructure Count (A)	chemsss(structure, idcode, mode)
Substructure Count (B)	chemsss(structure, substructure, mode)

The ligeff1() and ligeff2() functions calculate ligand efficiencies as relative free binding energy in kcal/mol per non-H atom. While the first function ligeff1() requires the remaining activity of an HTS result, the second syntax ligeff2() needs IC50 values to work on. Ligand efficiency values are a much more reasonable basis for selecting leads of an HTS campaign than remaining activities, because this avoids the strong bias towards high molecular weight compounds, which is an implicit drawback of selecting those compounds as leads, which have a remaining activity below a certain threshold. Also during lead optimization one should compare target affinities based on ligand efficiencies rather than pure IC50 values.
"For the purposes of HTS follow-up, we recommend considering optimizing the hits or leads with the highest ligand efficiencies rather than the most potent..." (Ref.: A. L. Hopkins et al., Drug Disc. Today, 9 (2004), pp. 430-431).

To give an example: A compound with 30 atoms (400 MW) that binds with a K_d=10 nM has a ligand efficiency value of 0.36 kcal/mol per non-H atom. Another compound with 38 non-H atoms (500 MW) and the same ligand efficiency would have a 100 fold higher activity with K_d=0.106 nM. Let us assume an HTS screening revealed two hit compounds A and B with equal activities of IC₅₀=10 nm, but different molecular weights of 400 and 500, respectively. Based on activities both compounds look equally attractive. Considering, however, that a synthetic introduction of a new group with 8 non-H atoms into compound A would match compound B in terms of weight, but would increase the activity by a factor of 100, if its ligand efficiency value can be maintained, it becomes clear that compound A is the by far more attractive alternative.

The remaining activity ra supplied to the ligeff1() function should be roughly between 0 and 100. The second parameter to this function is the assay concentration conc of the potential inhibitor in μmol/l. The third parameter is the molecular structure from which the number of non-hydrogen-atoms is determined automatically. In order to avoid misinterpretations one should understand the way the ligeff1() function works:
1) ra values below 1.0 are set to 1.0. Those above 99.0 are set to 99.0.
2) IC50 values are calculated from these range limited ra values as ic50 = conc / (100/ra - 1.0)
3) Assuming that the ic50 values are equivalent to the Kd the free energy of the ligand binding is calculated as dG = -RT * ln(ic50) with R=1.986 cal/(mol*K) and T=300K
4) The ligand efficiency is then calculated as ligeff = dG/N_{non-hydrogenatoms}.
The consequences from the calculation are: Calculated ic50 values cover 4 log units with those values at the lower and upper end of this range having the highest uncertainty, i.e. the higher the noise of the screening the higher is the uncertainty of ic50 values and also of ligand efficiency values, especially those at the lower and higher end of the scale.
If one can use the second function ligeff2() based on measured IC50 values, one avoids the error potential of the calculation of IC50 values from remaining activities. Ligand efficiency values from this function are therefore much more reliable and only contain the error margin of the original IC50 value.

The chemsim() function calculates similarities between two chemical structures or reactions. This function is available in two variations:

A Use syntax A to calculate the similarities of one column's chemistry objects against one reference compound or reaction. The first parameter of this function defines the kind of similarity to be calculated. It must be the name of a descriptor column from the popup menu. The second parameter is the idcode of the reference structure. The following example calculates the 3D-pharmacophore similarity of the compounds in column Structure to pyridin (gFx@@eJf`@@@ is the idcode of pyridin).

Example: chemsim(PP3DMM2_of_Structure,"gFx@@eJf`@@@")

B Alternatively, you may use syntax B to calculate the similarities between two diffent columns containing chemistry objects. In this case you may need to calculate chemical descriptors first. Be aware that the descriptors supplied to the chemsim() function need to be of the same type. This example calculates the similarities between a Reactant and a Product column.

Example: chemsim(FragFp_of_Reactant,FragFp_of_Product)

The maxsim() function calculates the chemical similarity to all other compounds and returns the highest of these similarity values. In other words it calculates the similarity to the most similar other compound in the data set. The function's only parameter is chemical descriptor, which determines the kind of similarity to be used. One application for this function would be in the field of machine-learning. When structure derived descriptors serve as input data for a machine-learning approach, then it is crucial that the compounds used for model testing are substantially different in structure from any compound used in the model training phase. A reasonable cut-off value depends on the particular problem and on the descriptor used. As a rule of thumb values below 0.6 (SkelSpheres), 0.55 (SphereFp), 0.45 (PathFp), 0.4 (FragFp), and 0.2 (Flexophore) ensure that all other molecules in the data set are pretty dissimilar in the eyes of the descriptor. Example: maxsim(SkelSpheres_of_Structure)

The chemsss() function runs a substructure search of the second argument in the first one and returns the number distinct substructure matches. The mode parameter defines whether multiple distinct matches are counted and what is considered a distinct match:

0: Matches are not counted. If a match is found, 1 is returned, otherwise 0 is returned.

1: Matches don't share any atoms, which means that any two matches must not overlap. Note that match counting with mode=1 is not deterministic and depends on atom orders.

2: Matches may partially overlap; any two matches may share some atoms, but not all atoms.

3: Any match is counted including symmetrically flipped and rotated permutations, which match on the same atoms just with a different order.

This function is available in two variations:

A Use syntax A to determine the substructure count of one predefined substructure for all structures within a given structure columns. The first parameter of this function is the variable holding the current row's structure. The second parameter must be an idcode of the substructure. To determine the number of distinct, but potentially overlapping pyridin rings within all the structures of the column 'Structure' one would use the following syntax (gFx@@eJf`@@@ is the idcode of pyridin).

Example: chemsss(Structure,"gFx@@eJf`@@@", 2)

B Alternatively, if your table contains multiple structure columns, then you may use syntax B to count substructure matches of one column's (sub)substructure within the structure of another column.

Example: chemsss(Structure, Fragment, 2)

Macro related Functions

These functions are typically used in expressions that are evaluated as part of a macro.

Description	Function
Ask for string value	askString(dialog-message)
Ask for a numerical value	askNumber(dialog-message)
Ask for a column variable	askColumnVar(dialog-message)
Ask for a column title	askColumnTitle(dialog-message)

The askColumnVar(), askColumnTitle(), askString() and askNumber() functions are special functions, which do not calculate anything. Instead they interactively ask the user to provide some information before any calculation is done. The argument dialog-message is shown within a dialog to the user. Typically it will be something like 'Which column do you want to base your calculation on?'.
The askColumnVar() function lets the user select a column from the current dataset and returns the name of the variable, which refers to the column content.
The askColumnTitle() function also asks the user to select a column. However, it returns the column-name as a String value.
The askString() function asks for a text string to be entered, which is then returned as double-quoted String value.
The askNumber() function asks for a numerical value, which can be used in the formula, wherever a numerical value is appropriate.
These functions help automating DataWarrior with task sequences, since they allow to specify task conditions at run-time rather than at task definition time.

The Gini Selectivity Score

The Gini Coefficient was originally invented to represent the income or wealth distribution of a nation's residents, and is today the most commonly used measure of inequality. For instance, in the context of medicinal chemistry P. Graczyk applied it to express the selectivity of kinase inhibitors (doi:10.1021/jm070562u). A Gini Coefficient of 0.0 expresses perfect equality, where all values are the same, while a value of 1.0 represents maximal inequality, e.g. a totally selective inhibitor that is only active on one of many kinases. Typically, the Gini Coefficient is used for a large number of positive numerical values of which a few are much larger than the rest. Thus, inhibition values are a perfect input, while logarithmic data is a less meaningful data source.

To calculate the Gini Coefficient in DataWarrior one needs to provide the input values in columns. The Calculate Selectivity Score... item in the Data menu opens a dialog that lets you define a set of columns by either selecting them indiviually from a list or by specifying a common element of the columns' names. In addition one may select an option that converts effect values to inhibition values by subtracting the original value from 100. DataWarrior then calculates a new column containing the Gini Coefficient using the trapez method as described in the paper by Graczyk. However, DataWarrior does not change input values to limit the value range to 0 to 100, because this would reduce the noise in an unsymmetrical fashion. We consider the Gini Coefficient robust enough to digest negative input data without sacrificing its usefulness.

Defining a Fuzzy Score

Sometimes it is helpful to express the fitness or usefulness of an object with a single numerical value, which somehow summarizes the fitness of multiple properties. For this purpose DataWarrior allows to calculate a fuzzy score from multiple contributing property values. These values may be taken from numerical columns of the data set or they may be calculated on-the-fly from a chemical structure. No matter, which individual properties are used, DataWarrior derives from every contributing property an individual score between 0.0 and 1.0. This is done using a smooth sigmoid curve defined for every property. In the dialog (see below) this curve is drawn in magenta color. The curve's slope and inflection point(s) can be adjusted by setting the Minimum and Maximum values as well as by moving the Slope slider. An additional Weight slider allows to reduce or increase the contribution of the associated property's score to the overall score.

The individual valuation scores are then summarized by calculating their mean value or by multiplying all individual scores (geometrical mean). Usually the latter method is chosen, if an object can be considered useless, if one bad property alone can ruin it.

The properties that contribute to the overall fuzzy score are compiled by adding property specific panels to the lower scrollable part of the dialog. This is done by selecting the property's name in the Contributing Values popup menu and pressing the Add Value button. Then a new panel appears at the end of the property panel list letting you define the valuation curve. If the property refers to the current data set, then the distribution of the associated column's data is shown as a histogram in the curve area. This helps defining proper valuation curves.

Defining a fuzzy score profile is often done once and then applied multiple times on the same, but updated data set or on different data sets. Therefore, this functionality is often used within a macro.

Fuzzy score dialog defined to reward low IC50 values and low molecular weights.

Working with Graph Data

Often the rows of a data table represent objects that are logically connected or related, e.g. if some objects are similar to others or if objects are parent to or predecessors of others. DataWarrior uses a simple way to define these kinds of relationships between rows, provided that one column contains names or numbers, which uniquely identify every row. If then a second column's cells contain one or more of these identifiers, then these rows effectively point to (or reference) other rows of the same table. This turns the table of rows into a graph whose nodes are represented by table rows and whose edges are defined by row references. If you want DataWarrior to understand and make use of row links you need to tell DataWarrior which column references which identifier column. Choose Set Column Reference... from the popup menu that appears upon a right mouse click within the referencing column's title area.

Popup menu and dialog to define a column reference.

Column Name: This is the column that references the identifier column. Typically, it contains zero, one, or multiple identifiers in every cell, which establish the link or reference to other rows. If the dialog was opened through a column popup menu as shown above, then this column was implicitly chosen in that step and cannot be changed in the dialog.

Referenced column: This popup menu defines the column containing unique row identifying names or numbers. It does not contain columns with duplicate entries.

Strength column: This popup menu allows to assign a numerical column, whose values define the strength of the link. If choosing a strength column it must have the same number of values in every row as the referencing column, i.e. every referencing identifier must have a matching strength value.

Links are bi-directional: If links are defined in the referencing column by one entry only, then we have a directed graph, which may define one or more trees, cycles or just linear strings of connected rows. Typical examples are parent or child relationships. Alternatively, relationships may not have a direction. In this case every link is defined by two identifier entries: each of the two linked rows points to its opposite row. This kind of link is typical to represent neighbourship or high row similarities.

Once a column reference is defined and therefore rows are crosslinked with each other, what can you do with it?

Most obvious is that these links can be shown as connection lines in scatter plots. For that to happen choose Set Column Reference... after a right mouse click within the particular 2D- or 3D-view. In the connection line dialog choose your referencing column from the Group & connect by menu. You may also adjust the line width or invert the arrow direction in case of directed links. Note that the position of your row markers is not touched when showing direction lines. These still depend on which columns are assigned to the two or three axes.
You may configure a view to show a sub-graph starting from the currently selected reference row as root. Since reference rows can be changed with one mouse click, such a view is a very dynamic one and shows a different sub-graph with every change of the reference row. These sub-graphs are trees whose nodes are automatically arranged to optimize the tree layout. You may choose between a circular tree layout with the root in its center or a traditional hierarchical tree with the root node at one of the view edges. To show dynamic sub-graphs choose anything but none from the Detail tree mode options in the connection line dialog.
You may ask DataWarrior to calculate an optimized set of 2D-coordinates for your rows, which are optimized to visualize your row relations. The used algorithm tries to optimize two criteria: Linked rows are placed close to each other and all rows are distributed evenly over the available space. The following section explains this in more detail.

Arranging Rows According to Graph Relationships

If one column is defined to reference another one and, therefore, rows are linked to other rows then DataWarrior may use this information to arrange all rows on a 2-dimensional plane such that the distance of linked rows is minimized. For that DataWarrior uses a Rubberband Scaling algorithm that generates new X and Y coordinates for each row. When these coordinates are assigned to the two axes of a 2D-view, then even complex graphs can be displayed nicely. DataWarrior uses the same algorithm to visualize chemical space after a similarity or activity cliff analysis.

To calculate row coordinates using this method choose Arrange Graph Nodes... from the Data menu, choose the referenced, the referencing, and optionally a connection strength column. Then click OK. Now DataWarrior creates a new 2D-view that shows all rows using freshly generated coordinates. It also shows a connection line for every defined row relationship.

2D-View displaying directed graph after Rubberband Scaling.

In addition, a second 2D-view is created and configured to display a sub-graph tree with the current reference row as root, provided that a reference row is chosen. Whenever the reference row changes, because a different row is clicked with the mouse in any view, then the sub-graph is automatically updated to show multiple layers of the newly selected reference row's neighbours.

2D-View configured to show circular detail graph with 'Lightning Injuries' as root.

For more information about how to use connection lines and detail tree sub-graphs check the section on connection lines.

Continue with Machine Learning...

DataWarrior User Manual