May 2014, revised December 2018

**Abstract:** The statistic σ_{1} = ∑ max(min(x, y), min(-x, -y))
is proposed as an analog to the usual covariance. The L1 statistic is shown to have the same relation to its L2
counterparts as fuzzy logic does to probability theory. Distance functions based on correlation are shown to be a sum
of metric and non-metric parts, that may be interpreted as the sum of disagreement and agreement, respectively, of
two objects. Applications in data clustering are discussed.

A fundamental task in statistics is to seek relations between variables as a means of explaining the world. For
this purpose, the Pearson correlation has been long universally applied in search of a relation between two variables.
The result is a number *r* between -1 and 1, which is positive when an increase in one tends to be accompanied
by an increase in the other, and negative when an increase in one tends to be accompanied by a decrease in the
other.

A closely related task is to quantify the dissimilarity between entities rather than variables. In a 1936 article,
P.C. Mahalanobis introduced a generalized statistical distance. Given a sample of measurements of *p* variables
for each of *n* objects, each pair of objects is characterized by a nondimensional distance. The Mahalanobis
distance has often been used as a measure of dissimilarity between objects in a sample. The building block of both of
these statistics is the covariance,

s_{xy} = (1/N) ∑ xy = (1/N) ∑ (x - μ)(y - ν)

where μ and ν are the means of x and y, respectively.

A weakness of the standard covariance (and L2 statistics in general) is their sensitivity to outliers. A few
extreme objects have an disproportionate effect on the sample mean and variance. In contrast, the L1 statistics median
and mean absolute deviation, are comparatively robust against these effects. For example, increasing the largest
observation by a arbitrary factor of *f* has no effect at all on the median. The effect of an extreme
measurement on the mean absolute deviation is proportional to its difference from the average, but its effect on the
variance is proportional to the *square* of the difference. If the outliers are spurious, the estimate of the
standard deviation may be severely corrupted.

Previous work in developing a robust generalized distance has focused on choosing a subset of reliable
observations.^{5} Since the number of possible non-empty subsets of *n* observations is 2^{n}-1,
there is exponential growth in the the number of possibilities to consider, and a consequently high computational
cost. Perhaps more importantly, choosing to banish an observation from the data set necessarily discards possibly
valid information. Silently weighing the analyst's prior assumptions about what sort of distribution the data should
have sacrifices scientific objectivity.

An L1 variant of covariance is thus desirable for at least two reasons: it achieves a more robust estimate without
willfully discarding information, and it extends a correspondence with L2 statistics.

The Euclidean distance, given by the Pythagorean theorem

d_{2} = √∑(x-y)²

has its counterpart in the Manhattan distance

d_{1} = ∑|x-y|

Likewise, the standard deviation

σ_{2} = √[(1/N) ∑(x-μ)²]

has its L1 counterpart in the mean absolute deviation:

σ_{1} = (1/N) ∑|x-μ|

L1 analogs of covariance and correlation will now be introduced. Consider the familiar covariance

(1/N) ∑xy = (1/N) ∑[(x-μ)(y-ν)]

The expression xy is positive when x and y have the same sign, and negative when they have different signs. Employing
the identities

(a+b)² = a² + 2ab + b²

(a-b)² = a² - 2ab + b²

∴ 4ab = (a+b)² - (a-b)²

factor:

xy = ¼[(x+y)² - (x-y)²]

Notice that the covariance is the difference of an agreement and a disagreement. Extending the analogy of mean
absolute deviation to standard deviation, define:

σ_{1} = (1/2N) ∑(|x+y| - |x-y|)

as a new measure of the relationship between x and y. The factor 1/2 ensures that for x = y, the expression has the
nice property that it reduces to mean absolute deviation. Thus, the mean absolute deviation is a special case of the
L1 codeviation, just as the L2 variance is a special case of the L2 covariance.

r

By analogy, define a Manhattan correlation

r

For p variables, there is a p×p covariance matrix S. The squared Mahalanobis distance, neglecting a factor of
det(S), is

d_{2} = x_{i}(S^{-1})_{ij}x_{j}

An expression for generalized L1 statistical distance is

d_{1} = ∑|(S^{-1})_{ij}x_{j}|

d = max(min(x, 1-y), min(1-x, y))

To obtain the Manhattan codeviation, begin with an expression for logical equivalence:

x ⇔ y = (x ∧ y) ∨ (¬x ∧ ¬y)

= max(min(x, y), min(1-x, 1-y))

Since this article is mostly concerned with scalar data, switch from logical variables to scalar variables. In
doing so, the effects of the min and max operations are unchanged. For negation, values must be reflected across 0
instead of ½, thus ¬x becomes -x rather than 1-x, resulting in:

x ⇔ y = max(min(x, y), min(-x, -y))

½(|x+y| - |x-y|) ≡ max(min(x, y), min(-x, -y))

This demonstrates a remarkable relationship between fuzzy logic and L1 statistics. Two distinct approaches have led to the same formula. The first approach was an L1 modification of L2 statistics. The second approach was a fuzzy logic modification of calculus of probabilities. It may be concluded that fuzzy logic is related to probability theory in the same way that L1 statistics are related to L2 statistics.

For the purposes of computation, the min/max formula is to be preferred, since it is not susceptible to subtractive
cancellation.

The expression for L1 codeviation yields insights into the metric and non-metric properties of distance functions.
Seek a decomposition d = m + n of distance into a metric part and a non-metric part. Then for triples of objects u, v,
w drawn from some data set, the triangle inequality holds for the metric part of distance

m_{w} ≤ m_{u} + m_{v}

but may or may not be violated in the non-metric part. Such a decomposition always exists, but is not
unique.^{6}

Two important distance functions are the cosine distance and the correlation distance. The difference between them
is that the correlation distance operates on the difference from the average, and the cosine distance operates on the
original data. The cosine distance often is used to compare documents represented by a list of words and the frequency
with which each word occurs (the "bag of words" format). The correlation distance has been used to cluster gene
expression data.^{4}

Suppose that the data vectors have been normalized, by dividing each by

√(∑x²)

for the L2 case, or by

∑|x|

for the L1 case. The correlation is

r_{2} = (x · y) / √(∑x²∑y²)

The correlation distance is preferably arccos(r), which is a metric distance. Elsewhere, the correlation distance has
been defined as:^{7, 8}

d = (1-r) / 2

L2 normalizing the data makes √∑x²∑y² equal one. Substituting ¼[(x+y)² -
(x-y)²] for xy yields

d_{2} = [∑(x-y)² / 8] + [½ - ∑(x+y)² / 8]

which is a metric decomposition, as verified by numerical simulations testing for triangle inequality violations. For
the L1 case,

r_{1} = ∑(|x+y| - |x-y|) / ∑(|x| + |y|)

L1 normalizing the data makes the denominator equal two, thus:

d_{1} = [∑|x-y| / 4] + [½ - ∑|x+y| / 4]

which is a metric decomposition. The disagreement term (x-y) is in the metric part of distance, and the agreement term
(x+y) is in the non-metric part of distance, as would be expected by intuition.

An example of inherently non-metric data occurs when trying to synthesize a set of observations. Some number
*n* of events have occurred, which have been observed by some number *q* of sensors. Each sensor tries
to give a single report of each event, but there may be mix-ups. A sensor may miss an event, or it may double-count an
event.

Other things being equal, if two observations were made by the same sensor, they probably refer to different
events. Thus, it is informative to include in the observation an integer [1,q] to identify the sensor that recorded
it. This ID data has the unusual property that the distance is greatest between two ID's when they are identical. Let
such data be called **contra-categorical**. It is inherently non-metric.

The Minkowski distance function:

d = (∑(x-y)^{L})^{(1/L)}

is usually metric. A very simple decomposition is available into d = m + n. Contra-categorical and binary asymmetric
data are non-metric. Scalar, categorical, ordinal, or cyclic data are metric. Specifically, letting x_{m} and
y_{m} denote the metric components of vectors x and y,

m = (∑(x_{m}-y_{m})^{L})^{(1/L)}

n = d - m

A set of objects described by a matrix of non-metric distances cannot be embedded into Euclidean space. It can be
embedded into a pseudo-Euclidean space, in which some components are imaginary numbers, so that the greater the
difference in these components, the closer together two objects are. This is precisely the behavior of
contra-categorical data. The pseudo-Euclidean representation is not esoteric as it may seem at first. Rather, it is
the natural representation of perfectly ordinary objects.

The L1 generalized distance can be used as a replacement for the conventional Mahalanobis distance. Traditional
linear discriminant analysis (LDA) was compared to the L1 variant in classifying 19 data sets from the UCI
repository^{1} and one other. The results are disappointing, showing that the L1 variant is usually quite
similar to traditional LDA and generally inferior. The table below result lists the results for classification
accuracy, and the difference. The 95% confidence interval was computed by a paired Student's t test, according to the
procedure described by Mitchell.^{3} 10-fold cross-validation was used, except as noted.

Data set | # variables | # objects | # classes | L2 | L1 | dif | interval | note |

iris | 4 | 150 | 3 | 0.980 | 0.980 | 0.000 | 0.000 | * in 5-fold cross-validation |

balance-scale | 4 | 563 | 3 | 0.851 | 0.812 | 0.039 | 0.143 | |

banknote_authentication | 4 | 1372 | 2 | 0.976 | 0.966 | 0.010 | 0.000 | |

wilt | 5 | 4839 | 2 | 0.943 | 0.939 | 0.003 | 0.000 | |

Wholesale customers | 6 | 440 | 2 | 0.848 | 0.900 | -0.052 | 0.036 | * ignoring the nominal variable "region" |

seed | 7 | 210 | 3 | 0.962 | 0.957 | 0.005 | 0.049 | * in 5-fold cross-validation |

E. coli | 7 | 336 | 8 | 0.881 | 0.858 | 0.024 | 0.026 | |

pima-indians-diabetes | 8 | 768 | 2 | 0.778 | 0.743 | 0.035 | 0.031 | |

yeast | 8 | 1484 | 10 | 0.587 | 0.577 | 0.010 | 0.018 | |

fertility | 9 | 100 | 2 | 0.830 | 0.800 | 0.030 | 0.075 | * in 3-fold cross-validation |

glass | 9 | 214 | 6 | 0.631 | 0.594 | 0.037 | 0.073 | * in 5-fold cross-validation, omitting 1 empty class |

breast-cancer-wisconsin | 9 | 783 | 2 | 0.960 | 0.940 | 0.020 | 0.013 | * 16 missing data deleted |

hockey | 10 | 796 | 2 | 0.732 | 0.737 | −0.005 | 0.086 | * new data set |

wine | 13 | 178 | 3 | 0.989 | 0.989 | 0.000 | 0.000 | * in 5-fold cross-validation |

leaf | 14 | 340 | 30 | 0.812 | 0.756 | 0.056 | 0.045 | * 30 non-empty classes. Not to be confused with 'One-hundred species plant leaves' or 'folio' data sets |

parkinsons | 22 | 195 | 2 | 0.872 | 0.662 | 0.210 | 0.199 | * in 5-fold cross-validation |

ionosphere | 34 | 351 | 2 | 0.877 | 0.795 | 0.083 | 0.050 | |

Landsat satellite | 36 | 6435 | 6 | 0.838 | 0.804 | 0.034 | 0.007 | |

musk | 166 | 6598 | 2 | 0.945 | 0.811 | 0.134 | 0.000 | |

isolet | 617 | 7797 | 26 | 0.944 | 0.446 | 0.499 | 0.156 | |

average | 0.862 | 0.803 | 0.059 |

The L1 codeviation seems to be chiefly of theoretical interest. It is insensitive to extreme values. This makes it
robust against outliers, but it is not always an advantage. These measures are equal:

½(|x+y|-|x-y|)

max(min(x,y), min(-x,-y))

sign(xy) · min(|x|, |y|)

The last form makes the behavior explicit. If the signs of x and y are the same, the result is positive, and if the
signs of x and y are different, the result is negative. The magnitude is the lesser of x and y.

For example, these data sets have the same σ_{1}

{(1,1), (2,2), (3,3)} and

{(1,1), (2,2), (3,99)}.

Which suggests that standardizing the data may be beneficial.

Improper metric decompositions are valuable aids to building binary tree data structures. If the distance function
is metric, the triangle inequality may be used to set bounds for which partitions must be searched to find the nearest
neighbors of a query point. If a decomposition of the distance function is available, heuristics for searching the
tree can be derived that are faster than exhaustive search, and more accurate than ignoring the metric violations.
Decompositions for important non-metric distance functions such as the Edit Distance would be important developments.
No general method for obtaining such decompositions is known at the present.

- http://archive.ics.uci.edu/ml/datasets.html
- G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, 1995.
- T. M. Mitchell, Machine Learning, New Delhi: McGraw Hill Education, 2013. [1997]
- M. Eisen, P. Spellman, P. Brown, D. Botstein, "Cluster Analysis and display of genome-wide expression data," Proceedings of National Academy of Science, USA v.95 pp.14863-14868, 1998. cited in: [8]
- Peter J. Rousseeuw and Katrien Van Driessen, "A Fast Algorithm for the Minimum Covariance Determinant Estimator," Revised version, 15 December 1998.
- Julian Laub, Non-Metric Pairwise Proximity Data. Technical University of Berlin, 10 December 2004.
- Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Third Edition, Academic Press/Elsevier, 2006.
- Here the correlation distance is defined as 1-r.

Michiel de Hoon, Seiya Imoto, and Satoru Miyano. Users' manual for: The C Clustering Library for cDNA Microarray Data University of Tokyo, Institute of Medical Science, Human Genome Center, 27 December 2010.