# 2. Supervised Metric Learning¶

Supervised metric learning algorithms take as inputs points `X`

and target
labels `y`

, and learn a distance matrix that make points from the same class
(for classification) or with close target value (for regression) close to each
other, and points from different classes or with distant target values far away
from each other.

## 2.1. General API¶

Supervised metric learning algorithms essentially use the same API as scikit-learn.

### 2.1.1. Input data¶

In order to train a model, you need two array-like objects, `X`

and `y`

. `X`

should be a 2D array-like of shape `(n_samples, n_features)`

, where
`n_samples`

is the number of points of your dataset and `n_features`

is the
number of attributes describing each point. `y`

should be a 1D
array-like
of shape `(n_samples,)`

, containing for each point in `X`

the class it
belongs to (or the value to regress for this sample, if you use MLKR for
instance).

Here is an example of a dataset of two dogs and one cat (the classes are ‘dog’ and ‘cat’) an animal being represented by two numbers.

```
>>> import numpy as np
>>> X = np.array([[2.3, 3.6], [0.2, 0.5], [6.7, 2.1]])
>>> y = np.array(['dog', 'cat', 'dog'])
```

Note

You can also use a preprocessor instead of directly giving the inputs as 2D arrays. See the Preprocessor section for more details.

### 2.1.2. Fit, transform, and so on¶

The goal of supervised metric-learning algorithms is to transform points in a new space, in which the distance between two points from the same class will be small, and the distance between two points from different classes will be large. To do so, we fit the metric learner (example: NCA).

```
>>> from metric_learn import NCA
>>> nca = NCA(random_state=42)
>>> nca.fit(X, y)
NCA(init='auto', max_iter=100, n_components=None,
preprocessor=None, random_state=42, tol=None, verbose=False)
```

Now that the estimator is fitted, you can use it on new data for several purposes.

First, you can transform the data in the learned space, using `transform`

:
Here we transform two points in the new embedding space.

```
>>> X_new = np.array([[9.4, 4.1], [2.1, 4.4]])
>>> nca.transform(X_new)
array([[ 5.91884732, 10.25406973],
[ 3.1545886 , 6.80350083]])
```

Also, as explained before, our metric learners has learn a distance between points. You can use this distance in two main ways:

- You can either return the distance between pairs of points using the
`score_pairs`

function:

```
>>> nca.score_pairs([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]]])
array([0.49627072, 3.65287282])
```

- Or you can return a function that will return the distance (in the new
space) between two 1D arrays (the coordinates of the points in the original
space), similarly to distance functions in
`scipy.spatial.distance`

.

```
>>> metric_fun = nca.get_metric()
>>> metric_fun([3.5, 3.6], [5.6, 2.4])
0.4962707194621285
```

Note

If the metric learner that you use learns a Mahalanobis distance (like it is the case for all algorithms
currently in metric-learn), you can get the plain learned Mahalanobis
matrix using `get_mahalanobis_matrix`

.

```
>>> nca.get_mahalanobis_matrix()
array([[0.43680409, 0.89169412],
[0.89169412, 1.9542479 ]])
```

### 2.1.3. Scikit-learn compatibility¶

All supervised algorithms are scikit-learn estimators
(`sklearn.base.BaseEstimator`

) and transformers
(`sklearn.base.TransformerMixin`

) so they are compatible with pipelines
(`sklearn.pipeline.Pipeline`

) and
scikit-learn model selection routines
(`sklearn.model_selection.cross_val_score`

,
`sklearn.model_selection.GridSearchCV`

, etc).

## 2.2. Algorithms¶

### 2.2.1. `LMNN`

¶

Large Margin Nearest Neighbor Metric Learning
(`LMNN`

)

LMNN learns a Mahalanobis distance metric in the kNN classification setting. The learned metric attempts to keep close k-nearest neighbors from the same class, while keeping examples from different classes separated by a large margin. This algorithm makes no assumptions about the distribution of the data.

The distance is learned by solving the following optimization problem:

where \(\mathbf{x}_i\) is a data point, \(\mathbf{x}_j\) is one of its k-nearest neighbors sharing the same label, and \(\mathbf{x}_l\) are all the other instances within that region with different labels, \(\eta_{ij}, y_{ij} \in \{0, 1\}\) are both the indicators, \(\eta_{ij}\) represents \(\mathbf{x}_{j}\) is the k-nearest neighbors (with same labels) of \(\mathbf{x}_{i}\), \(y_{ij}=0\) indicates \(\mathbf{x}_{i}, \mathbf{x}_{j}\) belong to different classes, \([\cdot]_+=\max(0, \cdot)\) is the Hinge loss.

```
import numpy as np
from metric_learn import LMNN
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
lmnn = LMNN(k=5, learn_rate=1e-6)
lmnn.fit(X, Y, verbose=False)
```

References:

[1] | Weinberger et al. Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR 2009 |

[2] | Wikipedia entry on Large Margin Nearest Neighbor |

### 2.2.2. `NCA`

¶

Neighborhood Components Analysis (`NCA`

)

NCA is a distance metric learning algorithm which aims to improve the accuracy of nearest neighbors classification compared to the standard Euclidean distance. The algorithm directly maximizes a stochastic variant of the leave-one-out k-nearest neighbors (KNN) score on the training set. It can also learn a low-dimensional linear transformation of data that can be used for data visualization and fast classification.

They use the decomposition \(\mathbf{M} = \mathbf{L}^T\mathbf{L}\) and define the probability \(p_{ij}\) that \(\mathbf{x}_i\) is the neighbor of \(\mathbf{x}_j\) by calculating the softmax likelihood of the Mahalanobis distance:

Then the probability that \(\mathbf{x}_i\) will be correctly classified by the stochastic nearest neighbors rule is:

The optimization problem is to find matrix \(\mathbf{L}\) that maximizes the sum of probability of being correctly classified:

```
import numpy as np
from metric_learn import NCA
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
nca = NCA(max_iter=1000)
nca.fit(X, Y)
```

References:

[1] | Goldberger et al. Neighbourhood Components Analysis. NIPS 2005 |

[2] | Wikipedia entry on Neighborhood Components Analysis |

### 2.2.3. `LFDA`

¶

Local Fisher Discriminant Analysis (`LFDA`

)

LFDA is a linear supervised dimensionality reduction method which effectively combines the ideas of `Linear Discriminant Analysis`

and Locality-Preserving Projection . It is
particularly useful when dealing with multi-modality, where one ore more classes
consist of separate clusters in input space. The core optimization problem of
LFDA is solved as a generalized eigenvalue problem.

The algorithm define the Fisher local within-/between-class scatter matrix \(\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}\) in a pairwise fashion:

where

here \(\mathbf{A}_{i,j}\) is the \((i,j)\)-th entry of the affinity
matrix \(\mathbf{A}\):, which can be calculated with local scaling methods, `n`

and `n_l`

are the total number of points and the number of points per cluster `l`

respectively.

Then the learning problem becomes derive the LFDA transformation matrix \(\mathbf{L}_{LFDA}\):

That is, it is looking for a transformation matrix \(\mathbf{L}\) such that nearby data pairs in the same class are made close and the data pairs in different classes are separated from each other; far apart data pairs in the same class are not imposed to be close.

```
import numpy as np
from metric_learn import LFDA
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
lfda = LFDA(k=2, dim=2)
lfda.fit(X, Y)
```

References:

[1] | Sugiyama. Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis. JMLR 2007 |

[2] | Tang. Local Fisher Discriminant Analysis on Beer Style Clustering. |

### 2.2.4. `MLKR`

¶

Metric Learning for Kernel Regression (`MLKR`

)

MLKR is an algorithm for supervised metric learning, which learns a distance function by directly minimizing the leave-one-out regression error. This algorithm can also be viewed as a supervised variation of PCA and can be used for dimensionality reduction and high dimensional data visualization.

Theoretically, MLKR can be applied with many types of kernel functions and distance metrics, we hereafter focus the exposition on a particular instance of the Gaussian kernel and Mahalanobis metric, as these are used in our empirical development. The Gaussian kernel is denoted as:

where \(d(\cdot, \cdot)\) is the squared distance under some metrics, here in the fashion of Mahalanobis, it should be \(d(\mathbf{x}_i, \mathbf{x}_j) = ||\mathbf{L}(\mathbf{x}_i - \mathbf{x}_j)||\), the transition matrix \(\mathbf{L}\) is derived from the decomposition of Mahalanobis matrix \(\mathbf{M=L^TL}\).

Since \(\sigma^2\) can be integrated into \(d(\cdot)\), we can set \(\sigma^2=1\) for the sake of simplicity. Here we use the cumulative leave-one-out quadratic regression error of the training samples as the loss function:

where the prediction \(\hat{y}_i\) is derived from kernel regression by calculating a weighted average of all the training samples:

```
from metric_learn import MLKR
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
mlkr = MLKR()
mlkr.fit(X, Y)
```

References:

[1] | Weinberger et al. Metric Learning for Kernel Regression. AISTATS 2007 |

### 2.2.5. Supervised versions of weakly-supervised algorithms¶

Each weakly-supervised algorithm
has a supervised version of the form `*_Supervised`

where similarity tuples are
randomly generated from the labels information and passed to the underlying
algorithm.

Warning

Supervised versions of weakly-supervised algorithms interpret label -1 (or any negative label) as a point with unknown label. Those points are discarded in the learning process.

For pairs learners (see Learning on pairs), pairs (tuple of two points
from the dataset), and pair labels (`int`

indicating whether the two points
are similar (+1) or dissimilar (-1)), are sampled with the function
`metric_learn.constraints.positive_negative_pairs`

. To sample positive pairs
(of label +1), this method will look at all the samples from the same label and
sample randomly a pair among them. To sample negative pairs (of label -1), this
method will look at all the samples from a different class and sample randomly
a pair among them. The method will try to build `num_constraints`

positive
pairs and `num_constraints`

negative pairs, but sometimes it cannot find enough
of one of those, so forcing `same_length=True`

will return both times the
minimum of the two lenghts.

For using quadruplets learners (see Learning on quadruplets) in a supervised way, positive and negative pairs are sampled as above and concatenated so that we have a 3D array of quadruplets, where for each quadruplet the two first points are from the same class, and the two last points are from a different class (so indeed the two last points should be less similar than the two first points).

```
from metric_learn import MMC_Supervised
from sklearn.datasets import load_iris
iris_data = load_iris()
X = iris_data['data']
Y = iris_data['target']
mmc = MMC_Supervised(num_constraints=200)
mmc.fit(X, Y)
```