Applying Generalized Orthogonality

Factoring matrices and solutions to overdetermined and underdetermined systems of linear equations

A.J. Allinger
Kruger Optical, Inc.
December 2018

Summary:

Software package GOPAK uses an expanded concept of orthogonality to give alternative solutions to standard problems in linear algebra. While the concepts of generalized orthogonality have been known to mathematics for many years, GOPAK appears to be the first program to actually compute such vectors.

Background: Linear independence and Orthogonality
- Overdetermined and underdetermined problems
Generalized Orthogonality
- History
- Minimum residual orthogonality
- Taxicab isosceles orthogonality
- ℓ^∞ isosceles orthogonality
- Lorentzian isosceles orthogonality
- Comedian orthogonality
Factoring matrix A = PDQ^T
A plan for solving undetermined and overdetermined systems
About the code
Example problems
Appendix
References

The introductory essay is intended not to be too technical. The reader should know some linear algebra. Most of the math is deferred to the appendix.

Background

A common task in linear algebra is to ask, given a set of vectors, if one of the set can be represented as a linear combination of the others. If so, then the set is linearly dependent. If not, the set is linearly independent. In the plane, linearly dependent vectors are collinear, and linearly independent vectors meet at an angle other than 0° or 180°. The purest form of linear independence occurs when no vector contains any component of any other. Such a set is called orthogonal.

Collinear vectors, vectors meeting at an acute angle, vectors meeting at a right angle

An application of orthogonality is linear regression. Suppose you want to predict something, for example, the share price of Ford Motor Company after one year. Suppose there is some data on which to base a prediction, for example, the entries on the company's quarterly balance sheet. Suppose further that a satisfactory prediction can be made as a linear combination of these values, and that some data is available that relates past balance sheet entries to changes in share price. Multiple linear regression finds factors β_j so that the linear combination

     y_i ≈ ŷ_i = ∑X_ijβ_j

is an estimate of the value to be predicted.

Overdetermined and underdetermined problems

Suppose an adequate amount of information has been collected on corporate balance-sheet entries and share price changes. Arrange the balance-sheet information into matrix X and the price changes into vector y, something like:

	Inventory	Property, plant & equipment	Accounts payable
Ford	x₁₁	x₁₂	x₁₃
Visa	x₂₁	x₂₂	x₂₃
Pfizer	x₃₁	x₃₂	x₃₃
Goldman Sachs	x₄₁	x₄₂	x₄₃
DowDuPont	x₅₁	x₅₂	x₅₃

and

	Change in share price
Ford	y₁
Visa	y₂
Pfizer	y₃
Goldman Sachs	y₄
DowDuPont	y₅

Let m denote the number of rows of the matrix, and n the number of columns. When m > n, the problem is overdetermined. It is very unlikely that any factors β exist that will exactly satisfy y = Xβ. Instead, seek an estimate ŷ that is as near as possible to y. The shortest path from a point to a line is in the direction perpendicular to the line. Since the difference ŷ-y should be as small as possible, it must be perpendicular to the span of X, that is,

     (ŷ - y) ⊥ Xβ

and so orthogonality is the key to solving multiple regression problems.

When m < n, the problem is underdetermined. There is an infinity of ways to choose β to satisfy y = Xβ. One approach is to choose β so as to give weight to the columns of X that are most nearly aligned to y. Since parallelism is the opposite of orthogonality, the method of assessing orthogonality can also be used to choose the columns most parallel to y. Thus, orthogonality is also the key to solving the underdetermined problem.

It is possible for a system of linear equations to be both overdetermined and underdetermined, because of linear dependence between the rows or columns.

Generalized Orthogonality

Mathematicians who research non-Euclidean geometry take the approach of discarding some axiom and exploring the consequences. The present article will take the opposite position: that it is obligatory to explain beforehand what justification exists for neglecting any property of Euclidean space.

Two squares of (A+B)x(A+B), with four right triangles in each

The Pythagorean theorem can be proved simply by taking four tiles, each a right triangle with sides AxB, and rearranging them inside a square with an edge of A+B. Measuring the remaining area, A² + B² = C². How could this ever be wrong?

First, drawing a plot of one variable against another can be misleading. For example, plotting a curve of temperature against pressure renders these quantities into geometry. This is only an analogy! You can't add apples and oranges, and you can't add squared temperature to squared pressure. Distances on such a plot are without physical meaning.

Second, the condition that the prediction error be orthogonal to the prediction is equivalent to the condition that the sum of the individual squared errors be minimized. (See appendix.) This means, for example, minimizing an error measured in squared dollars. This quantity has no business significance; it would make more sense to minimize the total absolute values of prediction error, a quantity measured in dollars.

Keeping this justification in mind, alternative definitions of orthogonality may be considered in geometries in which the Pythagorean theorem does not apply.

History

The history of non-Euclidean orthogonality goes back at least to Galileo Galilei, who sought to minimize the absolute value of error in estimating the position of a newly discovered star. [Bidabad, 2005] Hermann Minkowski proposed that the distance from point A to point B in N-dimensional space is given by the equation

     d = (∑(B_j - A_j)^p)^1/p

where p > 0. When p = 2, this reduces to the Pythagorean theorem. Other values of p yield different distance functions and a well-studied family of non-Euclidean geometries.

The isosceles orthogonality criterion [Roberts, 1934] says that two vectors u and v are orthogonal if and only if the distance from u to v is the same as the distance from u to negative v.

Isosceles triangle with sides u+v, u-v, and 2v

When distance is given by the Pythagorean theorem, the isosceles criterion is equivalent to Euclidean orthogonality. Other distance functions give different meanings to orthogonal. Isosceles orthogonality is symmetric, so that u ⊥ v ⇔ v ⊥ u

A residual orthogonality criterion [Birkhoff, 1935] says that vectors u and v are orthogonal when no multiple of u can be added to v such that the sum v + su has a magnitude less than v. When distance is given by the Pythagorean theorem, the residual criterion is equivalent to Euclidean orthogonality. Different distance functions yield different notions of residual orthogonality. Residual orthogonality is not symmetric, that is u ⊥ v ⇏ v ⊥ u.

When distance is given by the Pythagorean theorem, isosceles and residual orthogonality coincide. Otherwise they are distinct.

Various other orthogonality criteria have been proposed, generalizing from other properties of orthogonal vectors in Euclidean geometry. Freese et. al. [1985] and Alonso & Benítez [1988] survey this prior research.

Minimum residual orthogonality

The minimum residual orthogonality criterion [Birkhoff, 1935] is especially convenient for the regression problem. If v is the residual error ŷ-y, and u is a column of the data matrix X, then the condition u⊥v means to choose β so that the magnitude of the error ||ŷ-y|| is a minimum, and of course this is exactly what is sought.

When distance is given by the Minkowski function with p=1, the objective is to minimize the sum of the absolute values of the errors. This is sometimes called the Manhattan, Taxicab or ℓ¹ metric.
When p=2, the objective is to minimize the sum of squares of the errors, which is ordinary least squares regression. This is called the Euclidean or ℓ² metric.
When p→∞, the objective is to minimize the greatest absolute error. This is sometimes called the Chebyshev or ℓ^∞ metric.
Another interesting distance function is d = ∑ln(1 + \|B_i - A_i\|) which has been dubbed the Lorentzian [Cha, 2007]. It is not indicated which Lorentz it is named after!

A further distance function considered here is the median absolute value,

     d = med(|B_i - A_i|)

which is motivated by the median absolute deviation (MAD), which is an important measure of dispersion in robust statistics.

Taxicab isosceles orthogonality

A criterion for the orthogonality of two vectors u and v is that the distance from u to v should equal the distance from u to -v. Applying the ℓ¹ metric, this means that

     ∑(|u_i + v_i| - |u_i - v_i|) = 0

Let u ⋄ v denote ½∑(|u_i + v_i| - |u_i - v_i|). It is proposed to call this quantity the codeviation, because it is analogous to the covariance. To orthogonalize v against u, seek a scalar s such that

    u ⋄ (v - su) = 0

This reduces the problem to solving an equation of a single variable. A set of vectors may be made orthogonal by a variant of the Gram-Schmidt process. Normalizing the vectors destroys orthogonality, because (su) ⋄ v ≠ s(u ⋄ v). An iterative method may be employed, alternately normalizing a vector, then subtracting parallel components.

Chebyshev isosceles orthogonality

Applying the ℓ^∞ metric to the isosceles criterion, find that

     ½(max(|u_i + v_i|) - max(|u_i - v_i|)) = 0

Similar to the taxicab isosceles criterion, a set of vectors may be made Chebyshev isosceles orthogonal by numerically solving a system of equations.

Lorentzian isosceles orthogonality

Applying the Lorentzian function to the isosceles criterion, find that

     ∑ln((1+|u_i + v_i|)/(1+|u_i - v_i|)) = 0

Comedian orthogonality

The dot product of two vectors is closely related to their covariance. The comedian is a robust statistic similar to the covariance. Suppose that the comedian may be used as a measure of linear dependence and require

     com(u, v) = med(u_iv_i) = 0

Comedian orthogonality is not of the isosceles family. It has the desirable properties that it is symmetric and homogeneous, that is:

     u ⊥ v ⇔ v ⊥ u
and  u ⊥ v ⇒ su ⊥ v

for a scalar s.

Factoring matrix A = PDQ^T

The generalized Gram-Schmidt orthogonalization leads directly to a generalized QR factorization. A general m×n matrix A may be factored A = PDQ^T, where P is an m×r matrix with orthonormal columns, D is an r×r diagonal matrix, Q is an n×r matrix with orthonormal columns, and r is the rank of A. Begin by setting P := A, D and Q to the identity matrix, and r := n. Apply a sequence of multiplies and factors:
P' := PD
PD := P'
Q' := QD^T
QD^T := Q'
Loop until D is diagonal. The result is a generalization of the singular value decomposition (SVD).

A plan for solving Over- and Under- determined systems of equations

A linear system of equations is Xβ = Y, where X is m×n, β is n×o, and Y is m×o. X and Y are known and β are the unknown weights to be solved for.

Begin by choosing a symmetric orthogonality criterion, and factor X = PDQ^T.
Next choose a residual criterion, and solve the overdetermined problem PE = Y for E.
Solve the underdetermined problem Q^Tβ = D^-1E for β.

About the code

The main entry point is subroutine GORTH. It has the calling sequence:

      SUBROUTINE GORTH (U, MLIM, M, N, RELTOL, CRIT, JOB, PEX, V, 
     $                   R, S, LIIN, VP, W, IND, IFAULT)

Each call to GORTH solves one equation of the form Ur = v for r. It handles both overdetermined and underdetermined problems, and provides several orthogonalities to choose from. The user must specify U, the matrix; M, the rows; N, the columns; RELTOL, a relative tolerance; CRIT, an orthogonality criterion; JOB, 'O' for overdetermined or 'U' for underdetermined; PEX, the exponent in the Minkowski metric; and V, the input vector. The return values are R, the solution vector; S, the magnitude of the residual; LIIN, indicating if V is linearly independent of U; VP, the residual vector V' after components of U have been removed; and IFAULT, an error code. Arrays W and IND are workspace for the routine. The choices for orthogonality criterion are:
  '2' Euclidean, ordinary
  'T 'Use L1 isosceles orthogonality ("Taxicab")
  'S' Use L-infinity isosceles orthogonality (S for "Symmetric")
  'C' Use comedian orthogonality
  'R' Use Lorentzian orthogonality
  '1' Minimize L1 norm of V'
  'P' Minimize Lp norm of V' for user's choice of PEX
  'I' Minimize L-infinity (Chebyshev) norm of V'
  'L' Minimize Lorentzian norm of V'
  'M' Minimize median absolute norm of V'

The first phase is a greedy algorithm which orthogonalizes V against each column of U separately. The second phase seeks the multivariate solution or optimum to satisfy the criteria simultaneously. The isosceles orthogonality criteria require the simultaneous solution of a system of nonlinear equations, such that u_j ·v = 0 for all j. The residual criteria require that ||v'|| be minimized for a specified vector norm.

Individual subroutines

DNQSOL by Lawson and Krogh solves systems of nonlinear equations. It is based on MINPACK routines by Moré et. al., and uses Powell's hybrid method.
DHFTI by Lawson and Hanson solves least-squares problems by Householder reflections. It is included in the package for sake of comparison.
L1 by N. Abdelmalek solves an overdetermined system of linear equations according to the ℓ¹ norm. It is included in the package for sake of comparison.
CHEB by Barrodale and Phillips solves an overdetermined system of linear equations according to the ℓ^∞ metric. It is included in the package for sake of comparison.
QSORTD by R. Renka passively sorts double precision numbers.
DMED by A. Miller finds the median of an array of double precision numbers.
DAFDIV is based on a program by Julio to check if division will overflow.
DAMAX uses the method of Birbil et. al. to find the maximum of an array and its derivative.
DADD DOT compute the sum and the dot product, using W. Kahan's compensated addition. These are placed in a separate file, because they should be compiled with no optimization.
L2NO finds the factor to orthogonalize one vector against another according to the ℓ² metric, using the high accuracy formula given by W. Kahan [1981]
L1NMIN finds the factor to orthogonalize one vector against another according to the ℓ¹ metric. (see appendix)
ROWRED BAKSUB solve systems of linear equations by row reduction with complete pivoting and back-substitution.
L1NFG LA1NFG L8NFG LINFG LPNFG LONFG MANFG For a proposed solution, these subroutines compute the norm of the resultant vector and its gradient, according to the ℓ¹, approximate ℓ¹, ℓ⁸, ℓ^∞, ℓ^p, Lorentzian, and median absolute norms, respectively.
L1OFJ L2OFJ L8OFJ LIOFJ LOOFJ CMOFJ For a proposed solution, these subroutines compute the dot product of the resultant vector with each column of U and the Jacobian, according to the ℓ¹, ℓ², ℓ⁸, ℓ^∞, Lorentzian, and Comedian criteria, respectively.
GOBRAK GOBS Bracket and bisect to solve an equation of a single variable. GOBS has been written specially to consider the possibility of an interval for which f(x) = 0
GOSWEL GOGOL Minimize a function of a single variable by the golden section algorithm.
GOLIN GOCJUG Minimize a function of several variables by conjugate gradient descent, using the hybrid method of Dai & Yuan.
GOLILY Minimize a function of several variables, according to a variant of the Nelder-Mead method. [Nazareth & Tseng, 2000]
GOQR Repeatedly calls GORTH to construct a QR factorization according to a symmetric orthogonality criterion.
PDQ Factors a matrix A = PDQ^T according to a symmetric orthogonality criterion.
GOSOLV Solves a possibly overdetermined, underdetermined, or both, system of linear equations, according to a symmetric orthogonality criterion and a vector norm.

For more information, please refer to the source code. All the important subroutines have prologues explaining how to use them.

Example problems

Estimated prices and quoted prices were collected for a sample of 36 machine parts. This data set is interesting because it has a definite overall trend, but contains several gross errors. This makes it useful to test the robustness of generalized orthogonal regressions. Seven types of regression were tested.

7 regression lines and 36 points for the machine parts data

Example: the yacht problem

The yacht-hydrodynamics data [Gerritsma, et. al., 1981] considers the efficiency of a sailboat hull as a function of some parameters of its shape. Specifically, the input data is:

Longitudinal position of the center of buoyancy, adimensional.
Prismatic coefficient, adimensional.
Length-displacement ratio, adimensional.
Beam-draught ratio, adimensional.
Length-beam ratio, adimensional.
Froude number, adimensional.

and the quantity to be predicted is residuary resistance per unit weight of displacement, adimensional.

Machine learning is powerful because you don't have to understand how it works in order to use it. You don't have to understand the subject being studied either — although the practitioner should if possible. If not possible, he should defer to a better practitioner. In the present problem, the purpose really is not to say anything about yacht hulls, it is only to illustrate techniques for linear regression.

The test program YACHT calls the classical subroutines DHFTI, CHEB, and L1 to compute the ℓ², ℓ^∞, and ℓ¹ regressions. Subroutine GORTH is then called to attempt to replicate these results.

Accuracy of the computed solution may be assessed with the coefficient of determination, defined as:

	c_d = 1 - ∑(y_i - ŷ_i)² / ∑(y_i - y_ave)²;

This statistic lends itself to being modified to suit the ℓ¹ and ℓ^∞ metrics:

	c_d¹ = 1 - ∑|y_i - ŷ_i| / ∑|y_i - y_med|
     c_d^∞ = 1 - max(y_i - ŷ_i) / ½(max(y_i) - min(y_i))

The GORTH ℓ² result has the same coefficient of determination as DHFTI up to machine precision. The GORTH ℓ¹ result has the same coefficient of determination as L1 up to 11 significant digits. The GORTH ℓ^∞ result has a coefficient of determination 0.3% less than that of CHEB.

Regressions are computed for the Lorentzian and Median residual criteria, and the Comedian, Lorentzian, ℓ¹, and ℓ^∞ isosceles criteria. As should be expected, the best coefficient of determination according to the ℓ¹, ℓ², and ℓ^∞ norms was found by minimizing these residuals.

Graph of accuracy regression by residual minimization on the yacht data

From the graph it appears that the Lorentzian criterion is lower order than ℓ¹, and that the median criterion is lower still.

The isosceles criteria did not perform especially well on this problem. The isosceles-Lorentzian was quite similar to the ℓ² result. The taxicab criterion was noticeably bad.

Graph of accuracy of regression by isosceles orthogonality of the yacht data

Example: factoring the iris data

This well-known data set [Anderson, 1935] consists of 4 measurements each of 150 plant specimens, representing 3 species of iris. The 150×4 matrix was factored by algorithm PDQ according to the ℓ¹, ℓ^2,, ℓ^∞, and Lorentzian norms. The ℓ² factorization is the singular value decomposition; the other factorizations are analogues of the singular value decomposition. The first two columns of the resulting matrix P were plotted.

Plot of points in one distinct and two indistinct clusters

The graphs are not identical, but the differences are slight. It is not apparent that there is any advantage of using one of these alternative factorizations instead of the SVD.

Example: a toy AX = B problem

Program t_go solves a small artificial linear system by a scheme of factoring matrix A, then in turn solving the overdetermined and underdetermined sides of the problem, according to the ℓ², ℓ¹, ℓ^∞, and Lorentzian norms and their corresponding isosceles orthogonality criteria. It produces the results:

 at begin, A is: 
    2.000       -3.000       -1.000       -6.000       -2.000        0.000        5.000       -5.000    
   -1.000        1.000        0.000        5.000        4.000       -1.000        3.000        2.000    
    0.000        2.000        2.000        3.000        7.000       -4.000        2.000        2.000
    
 B is: 
   -3.000        4.000        6.000       -1.000       -1.000    
    2.000        2.000       -1.000        0.000        0.000    
    0.000       -2.000        0.000        1.000        2.000
    
 Call gosolv with criterion: 2

 Solution X:
   -1.779        0.000      -0.1073E-02   0.1692E-02    0.000    
  -0.7131E-05   0.1455E-03  -0.2394E-03   0.3388E-03    0.000    
  -0.6376E-02   -2.247      -0.3310E-01   0.5405        1.000    
  -0.7910E-02   0.6870E-01  -0.5134E-01  -0.9386E-05    0.000    
  -0.5331E-04   0.5328E-03   0.3486      -0.3292E-01    0.000    
   0.4114E-01  -0.3542       0.4314E-04  -0.4993E-04    0.000    
   0.1007       0.4333       0.8089E-01  -0.1562E-01    0.000    
  -0.5400E-07  -0.3287E-04   -1.191       0.8994E-01    0.000    

 Call gosolv with criterion: T

 Solution X:
   -1.626       0.1074       0.7898E-01  -0.1406E-01    0.000    
   0.6986E-02  -0.2407E-01  -0.3202E-01   0.5318E-02    0.000    
  -0.1004E-01   -3.145       0.4602E-01   0.5829        1.000    
   0.3411E-01  -0.1178      -0.1564       0.2596E-01    0.000    
  -0.3366E-01   0.5994       0.1337      -0.2234E-01    0.000    
   0.2920E-01  -0.1008      -0.1339       0.2223E-01    0.000    
   0.1041       0.1292       0.2545      -0.2847E-01    0.000    
   0.2401E-01  -0.8285E-01  -0.7697       0.2390E-01    0.000    

 Call gosolv with criterion: S

 Solution X:
   0.4649E-02  -0.7797E-02  -0.6659E-02  -0.2578E-03    0.000    
  -0.9296E-01   -1.790       -1.329       0.9056E-02    0.000    
  -0.6749E-01  -0.1109E-01   0.1383        0.000        0.000    
   0.6018        1.097       0.4295E-01    0.000        0.000    
  -0.2022       0.2533E-01   0.4417E-01    0.000        0.000    
  -0.3019E-03   0.5061E-03  -0.6873      -0.2849      -0.6000    
  -0.3253E-01   0.6059E-01   0.4958E-01  -0.1366      -0.2000    
  -0.2686E-02  -0.9932      -0.4526       0.5784E-01    0.000    

 Call gosolv with criterion: R

 Solution X:
  -0.2310E-01  -0.9739       0.1212        0.000       0.4919    
   0.9348E-02  -0.4316      -0.6656       0.2311       0.7251    
   -1.141      -0.9810       0.7040E-01   0.3141       0.1186E-01
   0.4742E-01   0.1531      -0.2576      -0.2926E-01  -0.6250E-01
   0.1508       0.4928E-01   0.1809      -0.8519E-07   0.3152E-01
   0.3612E-01    0.000      -0.1961      -0.1621E-01  -0.1630    
  -0.4572E-01   0.4741       0.2428      -0.3366E-01  -0.5064E-01
   0.6503      -0.4637      -0.2867        0.000      -0.2895E-01

This shows that for an underdetermined system, widely different answers are possible. Solution by generalized orthogonality offers a distinct approach.

Conclusion

Generalized orthogonality seems to be mostly of theoretical mathematical interest. Package GOPAK is able to compute these vectors using standard optimization techniques for some small problems. High computational expense and numerical difficulties are prohibitive for larger problems.

It has not been shown that there is a compelling practical reason to prefer these generalized methods over traditional linear algebra. They are offered as curiosities, with the hope that a use for them remains to be discovered.

Acknowledgement

I would like to thank Kevin Vixie for a very productive discussion.

Appendix

Formula for Gram-Schmidt orthogonalization

	Given vectors u and v, seek a scalar c such that u and (v-cu) are orthogonal.

     u · (v-cu) = 0
     ∑(u_iv_i - cu_i²) = 0
     c = ∑u_iv_i / ∑u_i²
     c = u·v / u·u

This also minimizes the length of v-cu, because:

     ℓ² = ∑(v_i-cu_i)²
     
Differentiate with respect to c and set equal to zero:

     dℓ²/dc = 2 ∑(v_i-cu_i)(-u_i) = 0
     ∑(-u_iv_i + cu_i²) = 0
     c = u·v / u·u

Expanding absolute value bars

The composite function f(g(x)) = |g(x)| may be written:

     |g(x)| = {-g(x), g(x) < 0}
              {0,     g(x) = 0}
              {+g(x), g(x) > 0}

Its derivative df/dx = (df/dg)(dg/dx) is then:

     df/dx = {-dg/dx,    g(x) < 0}
             {+dg/dx,    g(x) > 0}
             {undefined, g(x) = 0}

For present purposes, the derivative at zero shall be treated as zero, 
as this is the average of the value on either side.

A common approximation of the absolute value

     a = |u| ≈ (u² + τ)^½
     da/dx ≈ (u / (u² + τ)^½) du/dx
where τ is very small.

Differentiable approximation for a quantile of a data set

Given an array w_i, 1...i...n, any quantile w_q must
satisfy the counting function:

     ∑{-1, w_i < w_q} = (1 - q)n - qn
      {+1, w_i > w_q}
           
Using the approximation of the absolute value, this may be written:

     ∑(w_i - w_q) / ((w_i - w_q)² + τ)^½ = (1 - 2q)n
     
Differentiating with w as a function of x shows that:

∂w_q/∂x ≈ (∑ ∂w_i/∂x ((w_i - w_q)² + τ)^-3/2) / (∑ ((w_i - w_q)² + τ)^-3/2)

Smooth approximation to the maximum and its derivative (Birbil et. al., 2003)

Define the function f(a,b) = ½(a + b + ((a-b)² + τ)^½) 
where τ is very small.  Then,

     max(w₁, w₂, ... w_i ... w_n) ≈ 
     h = f(f(...f(...f(0, w₁), w₂), ..., w_i), ..., w_n)
     
Let f⁽ⁱ⁾ denote the approximate maximum of the first i elements of w.  Then,

     ∂h/∂x = ∂f⁽ⁿ⁾/∂w_n ∂w_n/∂x + ...
        + ∂f⁽ⁱ⁺¹⁾/∂f⁽ⁱ⁾(∂f⁽ⁱ⁾/∂w_i ∂w_i/∂x + ...
          + ∂f⁽²⁾/∂f⁽¹⁾(∂f⁽¹⁾/∂w₁ ∂w₁/∂x))
          
Now let p(a,b) = ∂f/∂a = ½(1 + (a-b)/((a-b)² + τ)^½)
        q(a,b) = ∂f/∂b = ½(1 - (a-b)/((a-b)² + τ)^½)

Observe that while f is symmetric with respect to a and b, its derivatives are not.
The approximate derivative of the maximum may be written: 

     ∂h/∂x = q(f^(n-1),w_n) ∂w_n/∂x + p(f^(n-1),w_n)(q(f^(n-2),w_n-1) ∂w_n-1/∂x + ...
       + p(f⁽ⁱ⁾,w_i+1)(q(f^(i-1),w_i) ∂w_i/∂x + ...
         + p(f⁽¹⁾,w₂)(q(0,w₁) ∂w₁/∂x) ...) ...)

Comedian orthogonality existence and uniqueness

Given vectors u and v, seek a scalar c such that:

     com(u, (v-cu)) = 0
     med(u_i(v_i-cu_i)) = 0

Define the function

     f(c) = med(u_iv_i - cu_i²)

Each component of the resultant vector is continuous with respect to c.
Let the components be written in ascending order.  In response to a change in
c, for the component in median position in this ordering to change places
with its adjacent component, u_mv_m - cu_m² = u_m+1v_m+1 - cu_m+1²,
because they are continuous.  Therefore, f(c) is continuous.

    df/dc = -u_m² ≤ 0, so f(c) is decreasing.

For a sufficiently large negative c, f(c) = med(u_iv_i - cu_i²) > 0
For a sufficiently large positive c, f(c) = med(u_ib_i - cu_i²) < 0

f(c) is continuous, monotonic, and has values less than and greater than zero.
Therefore the equation f(c) = 0 has a solution, either at a unique point
or a single interval.

Algorithm to minimize the ℓ1 length of a linear combination of two vectors

Given vectors u and v, seek a scalar c to minimize:

     ℓ¹ = ∑|v_i - cu_i| 

     ∑|v_i - cu_i| = ∑ { v_i - cu_i, v_i - cu_i > 0}
                     {-v_i + cu_i, v_i - cu_i < 0}

Differentiate with respect to c and set equal to zero:

     dℓ/dc = ∑{-u_i, v_i - cu_i > 0} = ∑ {-u_i, {c < v_i/u_i, u_i > 0}} = 0
              { u_i, v_i - cu_i < 0}     {     {c > v_i/u_i, u_i < 0}}
                                      { u_i, {c > v_i/u_i, u_i > 0}}
                                      {     {c < v_i/u_i, u_i < 0}}

Combining cases:

    ∑{-|u_i|, c < v_i/u_i} = 0
     {0,     c = v_i/u_i}
     {+|u_i|, c > v_i/u_i}
     

Thus, c must be chosen so that the sum of absolute values of u_i for 
terms satisfying c > v_i/u_i must equal the sum of absolute values 
of u_i for terms for which c < v_i/u_i.

begin
Compute ½∑|u_i|
Compute v_i/u_i for all i and sort.
Accumulate total t, adding each term |u_i| in order of ascending v_i/u_i
When t_j ≥ ½∑|u_i| at some step j, return c = v_j/u_j

ℓ1 Isosceles orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ij,
seek factors x and y, such that the linear combination  v' = yv_i - ∑x_lu_il
satisfies the conditions

     f_j = v' ⋄ u_j = 0
     v' ⋄ u_j = ½∑(|yv_i - ∑x_lu_il + u_ij| - |yv_i - ∑x_lu_il - u_ij|)

To solve this system of equations numerically, use the Jacobian matrix:


     ∂f_j/∂x_k = ∑{+u_ik, +u_ij < v_i' < -u_ij}
                {-u_ik, -u_ij < v_i' < +u_ij}
                {0, otherwise          }

     ∂f_j/∂y = ∑{-v_i, +u_ij < v_i' < -u_ij}
               {+v_i, -u_ij < v_i' < +u_ij}
               {0, otherwise         }

Because ℓ1 Isosceles orthogonality is not homogeneous, that is, u ⊥ v ⇏ u ⊥ av
in general for a scalar a, require also unit length of v'
For the ℓ1 norm, this gives:

     f₀ = ∑|v_i'| - 1 = ∑|yv_i - ∑x_lu_il| - 1 = 0

     ∂f₀/∂x_k = ∑{+u_ik, v_i' < 0}
                {0,    v_i' = 0}
                {+u_ik, v_i' > 0}

     ∂f₀/∂y = ∑{-v_i, v_i' < 0}
               {0,   v_i' = 0}
               {+v_i, v_i' > 0}

Relationship of ℓ1 geometry and fuzzy logic

Interestingly, the taxicab-isosceles formula can be obtained by the use 
of fuzzy logic. The standard intersection operator in fuzzy logic is 
the minimum.  The standard union operator in fuzzy logic is the maximum. 
To obtain the codeviation, begin with an expression for logical equivalence:

     u ⇔ v = (u ∧ v) ∨ (¬u ∧ ¬v)
     max(min(u, v), min(1-u, 1-v))

In switching from logical variables to scalar variables, the effects of the
min and max operations are unchanged. For negation, values must be
reflected across 0 instead of ½, thus ¬u becomes -u rather than 1-u,
resulting in:

     u ⇔ v = max(min(u, v), min(-u, -v))

Compare the expression derived from geometry with the expression derived
from fuzzy logic; they are identical.

     ½(|u+v| - |u-v|) ≡ max(min(u, v), min(-u, -v))

This demonstrates a remarkable relationship between fuzzy logic and ℓ¹
geometry. Two distinct approaches have led to the same formula. 
For the purposes of computation, the min/max formula is to be
preferred, since it is not susceptible to subtractive cancellation.

ℓ∞ Isosceles orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ij,
seek factors x and y, such that the linear combination  v' = yv_i - ∑x_lu_il
satisfies the condition 

     f_j = v' ▫ u_j = 0
     v' ▫ u_j = ½(max|yv_i - ∑x_lu_il + u_ij| - max|yv_i - ∑x_lu_il - u_ij|) = 0

To solve this system of equations numerically, use the Jacobian matrix:

∂f_j/∂x_k = ½({+u_pk, v_p' + u_pj < 0} - {+u_qk, v_q' - u_qj < 0})
            {0,    v_p' + u_pj = 0}   {0,   v_q' - u_qj = 0}
            {-u_pk, v_p' + u_pj > 0}   {-u_qk, v_q' - u_qj > 0}

∂f_j/∂y = ½({-v_p, v_p' + u_pj < 0} - {-v_q, v_q' - u_qj < 0})
           {0,   v_p' + u_pj = 0}   {0,   v_q' - u_qj = 0}
           {+v_p, v_p' + u_pj > 0}   {+v_q, v_q' - u_qj > 0}

where p is the index of the maximum element of |v' + u_j|
and q is the index of the maximum element of |v' - u_j|

Because ℓ∞ Isosceles orthogonality is not homogeneous, require unit length of v'

     f₀ = max|v_i'| - 1 = max|yv_i - ∑x_lu_il| - 1 = 0

     ∂f₀/∂x_k = {+u_rk, v_r' < 0}
               {0,    v_r' = 0}
               {-u_rk, v_r' > 0}

     ∂f₀/∂y = {-v_r, v_r' < 0}
              {0,   v_r' = 0}
              {+v_r, v_r' > 0}

where r is the index of the maximum element of |v'|

Lorentzian-Isosceles orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ij,
seek factors x and y, such that the linear combination  v' = yv_i - ∑x_lu_il
satisfies the condition 

     f_j = ∑log(1 + |v' + u_j|) - log(1 + |v' - u_j|) = 0
     
To solve this system of equations numerically, use the Jacobian matrix:

∂f_j/∂x_k = ∑({+1/(1 - v_i' - u_ij), v_i'+u_ij < 0} - {+1/(1 - v_i' + u_ij), v_i'-u_ij < 0}) u_ik
            {0,                 v_i'+u_ij = 0}    {0,                 v_i'-u_ij = 0}
            {-1/(1 + v_i' + u_ij), v_i'+u_ij > 0}   {-1/(1 + v_i' - u_ij), v_i'-u_ij > 0}

∂f_j/∂y = ∑({+1/(1 - v_i' - u_ij), v_i'+u_ij < 0} - {+1/(1 - v_i' + u_ij), v_i'-u_ij < 0}) (-v_i)
           {0,                 v_i'+u_ij = 0}    {0,                 v_i'-u_ij = 0}
           {-1/(1 + v_i' + u_ij), v_i'+u_ij > 0}   {-1/(1 + v_i' - u_ij), v_i'-u_ij > 0}


Because Lorentzian-Isosceles orthogonality is not homogeneous, require unit length of v'

     f₀ = ∑log(1+|v_i'|) - 1 = 0

     ∂f₀/∂x_k = ∑{+u_ik/(1-v_i'), v_i' < 0}
                {0,           v_i' = 0}
                {-u_ik/(1+v_i'), v_i' > 0}

     ∂f₀/∂y = ∑{-v_i/(1-v_i'), v_i' < 0}
               {0,           v_i' = 0}
               {+v_i/(1+v_i'), v_i' > 0}

Comedian orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ij,
seek factors x, such that the linear combination v' = v_i - ∑x_lu_il
satisfies the condition 

     f_j = com(v', u_j) = 0
     f_j = med((v_i - ∑x_lu_il)u_ij) = 0
     f_j = u_oj(v_o - ∑x_lu_ol) = 0

where o is the index of the median.  To solve this system of equations numerically, 
use the Jacobian matrix:

     ∂f_j/∂x_k = -u_oju_ok

Euclidean orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ij,
seek factors x, such that the linear combination  v' = v_i - ∑x_ku_ik
satisfies the condition 

     u_j · v' = 0
     u_j·v_i - ∑u_j·x_ku_k = 0
     ∑∑u_iju_ikx_k = ∑u_ijv_i
     U^TUx = U^Tv
     x = U⁺v

Which may be recognized as the normal equations for the least-squares 
solution of the linear system.  Alternatively, solve the system numerically 
using the Jacobian matrix:

     ∂f_j/∂x_k = -∑u_iju_ik

ℓ1 residual orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ik,
seek factors x, such that the linear combination v' = v_i - ∑x_ku_ik
has minimum ℓ1 norm.

     f = ∑|v_i'| = ∑|v_i - ∑x_ku_ik|

To minimize this function numerically, use the gradient:

     ∂f/∂x_k = ∑{+u_ik, v_i' < 0}
               {0,    v_i' = 0}
               {-u_ik, v_i' > 0}

ℓ∞ residual orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ik,
seek factors x, such that the linear combination v' = v_i - ∑x_ku_ik
has minimum ℓ∞ norm.

     f = max|v_i'| = max|v_i - ∑x_ku_ik|

To minimize this function numerically, use the gradient:

     ∂f/∂x_k = {+u_rk, v_r' < 0}
              {0,    v_r' = 0}
              {-u_rk, v_r' > 0}

where r is the index of the maximum element of |v'|

Lorentzian residual orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ik,
seek factors x, such that the linear combination v' = v_i - ∑x_lu_il
has minimum Lorentzian norm.

     f = ∑log(1 + |v_i'|) = ∑log(1 + |v_i - ∑x_lu_il|)

To minimize this function numerically, use the gradient:

     ∂f/∂x_k = ∑{+u_ik/(1-v_i'), v_i' < 0}
               {0,           v_i' = 0}
               {-u_ik/(1+v_i'), v_i' > 0}

Median absolute residual orthogonality against n vectors

Given a vector of m components v_i, and n vectors of m components u_ik,
seek factors x, such that the linear combination v' = v_i - ∑x_lu_il
has minimum median absolute value.

     f = med(|v_i'|) = med(|v_i - ∑x_lu_il|)

To minimize this function numerically, use the gradient.

     ∂f/∂x_k = {+u_ok, v_o' < 0}
              {0,    v_o' = 0}
              {-u_ok, v_o' > 0}

where o is the index of the median element of the array |v_i-∑x_lu_il|

References

B. D. Roberts, "On the geometry of abstract vector spaces", Tôhuku Math. Jour. v.39 pp.49-59, 1934. cited in [Alonso & Benítez, 1988]
Edgar Anderson, "The Irises of the Gaspé peninsula", Bulletin of the American Iris Society, v.59, pp.2-5, 1935.
Garrett Birkhoff, "Orthogonality in linear metric spaces", Duke Math. J. v.1 n.2 pp.169-172, 1935. cited in [Alonso & Benítez, 1988]
R. C. James, "Orthogonality in normed linear spaces", Duke Math. Jour. v.12 pp.291-301, 1945. cited in [Alonso & Benítez, 1988]
Charles L. Lawson and Richard J. Hanson, Solving Least Squares Problems, Prentice-Hall, 1974.
I. Barrodale and C. Phillips, "Algorithm 495: Solution of an Overdetermined System of Linear Equations in the Chebychev Norm", ACM Transactions on Mathematical Software (TOMS) v.1 i.3 pp.264-270, September 1975.
Nabih N. Abdelmalek, "Algorithm 551: A Fortran Subroutine for the L₁ Solution of Overdetermined Systems of Linear Equations", ACM Transactions on Mathematical Software (TOMS) v.6 i.2 pp.228-230, June 1980.
J. Gerritsma, R. Onnink, and A. Versluis, "Geometry, resistance and stability of the delft systematic yacht hull series", International Shipbuilding Progress v.28 pp.276-297, 1981. Dataset at UCI
William Kahan, "Computing Cross-Products and Rotations in 2- and 3- Dimensional Euclidean Spaces", University of California at Berkeley, 1981-2016.
Raymond W. Freese, Charles R. Dimminie, and Edward Z. Andalafte, "A Study of Generalized Orthogonality Relations in Normed Linear Spaces", Mathematische Nachrichte v.122 i.1 pp.197-204, 1985.
M. Lichman (Admin.), UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, 1987-present
Javier Alonso and Carlos Benítez, "Orthogonality in Normed Linear Spaces: A Survey: Part I: Main Properties", Extracta Mathematicae v.3 n.1 pp.1-15, 1988.
Fred T. Krogh, Charles L. Lawson, and W. Van Snyder, "MATH77 and mathc90: Libraries of Mathematical Subprograms in Fortran 77 and C". Pasadena, California: JPL Section 395, Supercomputing and Computational Mathematics Group, 1992-1998. Available from netlib
George Klir and Bo Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, 1995.
Ziya Akça and Rüstem Kaya, "On the Taxicab Trigonometry", Jour. of Inst. of Math. & Comp. Sci. (Math. Ser.) v.10 n.3 pp.151-159, 1997.
Michael Falk, "On MAD and Comedians", Ann. Inst. Statist. Math. v.49 n.4 pp.615-644, 1997.
Larry Nazareth & Paul Tseng, "Gilding the Lily: a Variant of the Nelder-Mead Algorithm based on Golden-Section Search", January 2000, revised July 2001
S. I. Birbil, S.-C. Fang, J.B.G. Frenk and S. Zhang "Recursive Approximation of the High Dimensional max Function" Erasmus Research Institute of Management, 8 January 2003.
Bijan Bidabad, "L₁ Norm Based Computational Algorithms", 4 January 2005. Available from the author's website
Andreas Klein, "A Generalized Kahan-Babuška-Summation-Algorithm", 21 April 2005
Sung-Hyuk Cha, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions" International Journal of Mathematical Models and Methods in Applied Sciences v.1 i.4, 2007.
Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Fourth Edition, Academic Press/Elsevier, 2009.