========================== Supervised Metric Learning ========================== Supervised metric learning algorithms take as inputs points `X` and target labels `y`, and learn a distance matrix that make points from the same class (for classification) or with close target value (for regression) close to each other, and points from different classes or with distant target values far away from each other. General API =========== Supervised metric learning algorithms essentially use the same API as scikit-learn. Input data ---------- In order to train a model, you need two `array-like `_ objects, `X` and `y`. `X` should be a 2D array-like of shape `(n_samples, n_features)`, where `n_samples` is the number of points of your dataset and `n_features` is the number of attributes describing each point. `y` should be a 1D array-like of shape `(n_samples,)`, containing for each point in `X` the class it belongs to (or the value to regress for this sample, if you use `MLKR` for instance). Here is an example of a dataset of two dogs and one cat (the classes are 'dog' and 'cat') an animal being represented by two numbers. >>> import numpy as np >>> X = np.array([[2.3, 3.6], [0.2, 0.5], [6.7, 2.1]]) >>> y = np.array(['dog', 'cat', 'dog']) .. note:: You can also use a preprocessor instead of directly giving the inputs as 2D arrays. See the :ref:`preprocessor_section` section for more details. Fit, transform, and so on ------------------------- The goal of supervised metric-learning algorithms is to transform points in a new space, in which the distance between two points from the same class will be small, and the distance between two points from different classes will be large. To do so, we fit the metric learner (example: `NCA`). >>> from metric_learn import NCA >>> nca = NCA(random_state=42) >>> nca.fit(X, y) NCA(init='auto', max_iter=100, n_components=None, preprocessor=None, random_state=42, tol=None, verbose=False) Now that the estimator is fitted, you can use it on new data for several purposes. First, you can transform the data in the learned space, using `transform`: Here we transform two points in the new embedding space. >>> X_new = np.array([[9.4, 4.1], [2.1, 4.4]]) >>> nca.transform(X_new) array([[ 5.91884732, 10.25406973], [ 3.1545886 , 6.80350083]]) Also, as explained before, our metric learners has learn a distance between points. You can use this distance in two main ways: - You can either return the distance between pairs of points using the `pair_distance` function: >>> nca.pair_distance([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) array([0.49627072, 3.65287282, 6.06079877]) - Or you can return a function that will return the distance (in the new space) between two 1D arrays (the coordinates of the points in the original space), similarly to distance functions in `scipy.spatial.distance`. >>> metric_fun = nca.get_metric() >>> metric_fun([3.5, 3.6], [5.6, 2.4]) 0.4962707194621285 - Alternatively, you can use `pair_score` to return the **score** between pairs of points (the larger the score, the more similar the pair). For Mahalanobis learners, it is equal to the opposite of the distance. >>> score = nca.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]]) >>> score array([-0.49627072, -3.65287282, -6.06079877]) This is useful because `pair_score` matches the **score** semantic of scikit-learn's `Classification metrics `_. .. note:: If the metric learner that you use learns a :ref:`Mahalanobis distance ` (like it is the case for all algorithms currently in metric-learn), you can get the plain learned Mahalanobis matrix using `get_mahalanobis_matrix`. >>> nca.get_mahalanobis_matrix() array([[0.43680409, 0.89169412], [0.89169412, 1.9542479 ]]) Scikit-learn compatibility -------------------------- All supervised algorithms are scikit-learn estimators (`sklearn.base.BaseEstimator`) and transformers (`sklearn.base.TransformerMixin`) so they are compatible with pipelines (`sklearn.pipeline.Pipeline`) and scikit-learn model selection routines (`sklearn.model_selection.cross_val_score`, `sklearn.model_selection.GridSearchCV`, etc). You can also use some of the scoring functions from `sklearn.metrics`. Algorithms ========== .. _lmnn: :py:class:`LMNN ` ----------------------------------------- Large Margin Nearest Neighbor Metric Learning (:py:class:`LMNN `) `LMNN` learns a Mahalanobis distance metric in the kNN classification setting. The learned metric attempts to keep close k-nearest neighbors from the same class, while keeping examples from different classes separated by a large margin. This algorithm makes no assumptions about the distribution of the data. The distance is learned by solving the following optimization problem: .. math:: \min_\mathbf{L}\sum_{i, j}\eta_{ij}||\mathbf{L(x_i-x_j)}||^2 + c\sum_{i, j, l}\eta_{ij}(1-y_{ij})[1+||\mathbf{L(x_i-x_j)}||^2-|| \mathbf{L(x_i-x_l)}||^2]_+) where :math:`\mathbf{x}_i` is a data point, :math:`\mathbf{x}_j` is one of its k-nearest neighbors sharing the same label, and :math:`\mathbf{x}_l` are all the other instances within that region with different labels, :math:`\eta_{ij}, y_{ij} \in \{0, 1\}` are both the indicators, :math:`\eta_{ij}` represents :math:`\mathbf{x}_{j}` is the k-nearest neighbors (with same labels) of :math:`\mathbf{x}_{i}`, :math:`y_{ij}=0` indicates :math:`\mathbf{x}_{i}, \mathbf{x}_{j}` belong to different classes, :math:`[\cdot]_+=\max(0, \cdot)` is the Hinge loss. .. rubric:: Example Code :: import numpy as np from metric_learn import LMNN from sklearn.datasets import load_iris iris_data = load_iris() X = iris_data['data'] Y = iris_data['target'] lmnn = LMNN(n_neighbors=5, learn_rate=1e-6) lmnn.fit(X, Y, verbose=False) .. rubric:: References .. container:: hatnote hatnote-gray [1]. Weinberger et al. `Distance Metric Learning for Large Margin Nearest Neighbor Classification `_. JMLR 2009. [2]. `Wikipedia entry on Large Margin Nearest Neighbor `_. .. _nca: :py:class:`NCA ` -------------------------------------- Neighborhood Components Analysis (:py:class:`NCA `) `NCA` is a distance metric learning algorithm which aims to improve the accuracy of nearest neighbors classification compared to the standard Euclidean distance. The algorithm directly maximizes a stochastic variant of the leave-one-out k-nearest neighbors (KNN) score on the training set. It can also learn a low-dimensional linear transformation of data that can be used for data visualization and fast classification. They use the decomposition :math:`\mathbf{M} = \mathbf{L}^T\mathbf{L}` and define the probability :math:`p_{ij}` that :math:`\mathbf{x}_i` is the neighbor of :math:`\mathbf{x}_j` by calculating the softmax likelihood of the Mahalanobis distance: .. math:: p_{ij} = \frac{\exp(-|| \mathbf{Lx}_i - \mathbf{Lx}_j ||_2^2)} {\sum_{l\neq i}\exp(-||\mathbf{Lx}_i - \mathbf{Lx}_l||_2^2)}, \qquad p_{ii}=0 Then the probability that :math:`\mathbf{x}_i` will be correctly classified by the stochastic nearest neighbors rule is: .. math:: p_{i} = \sum_{j:j\neq i, y_j=y_i}p_{ij} The optimization problem is to find matrix :math:`\mathbf{L}` that maximizes the sum of probability of being correctly classified: .. math:: \mathbf{L} = \text{argmax}\sum_i p_i .. rubric:: Example Code :: import numpy as np from metric_learn import NCA from sklearn.datasets import load_iris iris_data = load_iris() X = iris_data['data'] Y = iris_data['target'] nca = NCA(max_iter=1000) nca.fit(X, Y) .. rubric:: References .. container:: hatnote hatnote-gray [1]. Goldberger et al. `Neighbourhood Components Analysis `_. NIPS 2005. [2]. `Wikipedia entry on Neighborhood Components Analysis `_. .. _lfda: :py:class:`LFDA ` ----------------------------------------- Local Fisher Discriminant Analysis (:py:class:`LFDA `) `LFDA` is a linear supervised dimensionality reduction method which effectively combines the ideas of `Linear Discriminant Analysis ` and Locality-Preserving Projection . It is particularly useful when dealing with multi-modality, where one ore more classes consist of separate clusters in input space. The core optimization problem of LFDA is solved as a generalized eigenvalue problem. The algorithm define the Fisher local within-/between-class scatter matrix :math:`\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}` in a pairwise fashion: .. math:: \mathbf{S}^{(w)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(w)}(\mathbf{x}_i - \mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\ \mathbf{S}^{(b)} = \frac{1}{2}\sum_{i,j=1}^nW_{ij}^{(b)}(\mathbf{x}_i - \mathbf{x}_j)(\mathbf{x}_i - \mathbf{x}_j)^T,\\ where .. math:: W_{ij}^{(w)} = \left\{\begin{aligned}0 \qquad y_i\neq y_j \\ \,\,\mathbf{A}_{i,j}/n_l \qquad y_i = y_j\end{aligned}\right.\\ W_{ij}^{(b)} = \left\{\begin{aligned}1/n \qquad y_i\neq y_j \\ \,\,\mathbf{A}_{i,j}(1/n-1/n_l) \qquad y_i = y_j\end{aligned}\right.\\ here :math:`\mathbf{A}_{i,j}` is the :math:`(i,j)`-th entry of the affinity matrix :math:`\mathbf{A}`:, which can be calculated with local scaling methods, `n` and `n_l` are the total number of points and the number of points per cluster `l` respectively. Then the learning problem becomes derive the LFDA transformation matrix :math:`\mathbf{L}_{LFDA}`: .. math:: \mathbf{L}_{LFDA} = \arg\max_\mathbf{L} [\text{tr}((\mathbf{L}^T\mathbf{S}^{(w)} \mathbf{L})^{-1}\mathbf{L}^T\mathbf{S}^{(b)}\mathbf{L})] That is, it is looking for a transformation matrix :math:`\mathbf{L}` such that nearby data pairs in the same class are made close and the data pairs in different classes are separated from each other; far apart data pairs in the same class are not imposed to be close. .. rubric:: Example Code :: import numpy as np from metric_learn import LFDA from sklearn.datasets import load_iris iris_data = load_iris() X = iris_data['data'] Y = iris_data['target'] lfda = LFDA(k=2, dim=2) lfda.fit(X, Y) .. note:: LDFA suffers from a problem called “sign indeterminacy”, which means the sign of the ``components`` and the output from transform depend on a random state. This is directly related to the calculation of eigenvectors in the algorithm. The same input ran in different times might lead to different transforms, but both valid. To work around this, fit instances of this class to data once, then keep the instance around to do transformations. .. rubric:: References .. container:: hatnote hatnote-gray [1]. Sugiyama. `Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis `_. JMLR 2007. [2]. Tang. `Local Fisher Discriminant Analysis on Beer Style Clustering `_. .. _mlkr: :py:class:`MLKR ` ----------------------------------------- Metric Learning for Kernel Regression (:py:class:`MLKR `) `MLKR` is an algorithm for supervised metric learning, which learns a distance function by directly minimizing the leave-one-out regression error. This algorithm can also be viewed as a supervised variation of PCA and can be used for dimensionality reduction and high dimensional data visualization. Theoretically, `MLKR` can be applied with many types of kernel functions and distance metrics, we hereafter focus the exposition on a particular instance of the Gaussian kernel and Mahalanobis metric, as these are used in our empirical development. The Gaussian kernel is denoted as: .. math:: k_{ij} = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{d(\mathbf{x}_i, \mathbf{x}_j)}{\sigma^2}) where :math:`d(\cdot, \cdot)` is the squared distance under some metrics, here in the fashion of Mahalanobis, it should be :math:`d(\mathbf{x}_i, \mathbf{x}_j) = ||\mathbf{L}(\mathbf{x}_i - \mathbf{x}_j)||`, the transition matrix :math:`\mathbf{L}` is derived from the decomposition of Mahalanobis matrix :math:`\mathbf{M=L^TL}`. Since :math:`\sigma^2` can be integrated into :math:`d(\cdot)`, we can set :math:`\sigma^2=1` for the sake of simplicity. Here we use the cumulative leave-one-out quadratic regression error of the training samples as the loss function: .. math:: \mathcal{L} = \sum_i(y_i - \hat{y}_i)^2 where the prediction :math:`\hat{y}_i` is derived from kernel regression by calculating a weighted average of all the training samples: .. math:: \hat{y}_i = \frac{\sum_{j\neq i}y_jk_{ij}}{\sum_{j\neq i}k_{ij}} .. rubric:: Example Code :: from metric_learn import MLKR from sklearn.datasets import load_iris iris_data = load_iris() X = iris_data['data'] Y = iris_data['target'] mlkr = MLKR() mlkr.fit(X, Y) .. rubric:: References .. container:: hatnote hatnote-gray [1]. Weinberger et al. `Metric Learning for Kernel Regression `_. AISTATS 2007. .. _supervised_version: Supervised versions of weakly-supervised algorithms --------------------------------------------------- Each :ref:`weakly-supervised algorithm ` has a supervised version of the form `*_Supervised` where similarity tuples are randomly generated from the labels information and passed to the underlying algorithm. .. warning:: Supervised versions of weakly-supervised algorithms interpret label -1 (or any negative label) as a point with unknown label. Those points are discarded in the learning process. For pairs learners (see :ref:`learning_on_pairs`), pairs (tuple of two points from the dataset), and pair labels (`int` indicating whether the two points are similar (+1) or dissimilar (-1)), are sampled with the function `metric_learn.constraints.positive_negative_pairs`. To sample positive pairs (of label +1), this method will look at all the samples from the same label and sample randomly a pair among them. To sample negative pairs (of label -1), this method will look at all the samples from a different class and sample randomly a pair among them. The method will try to build `n_constraints` positive pairs and `n_constraints` negative pairs, but sometimes it cannot find enough of one of those, so forcing `same_length=True` will return both times the minimum of the two lenghts. For using quadruplets learners (see :ref:`learning_on_quadruplets`) in a supervised way, positive and negative pairs are sampled as above and concatenated so that we have a 3D array of quadruplets, where for each quadruplet the two first points are from the same class, and the two last points are from a different class (so indeed the two last points should be less similar than the two first points). .. rubric:: Example Code :: from metric_learn import MMC_Supervised from sklearn.datasets import load_iris iris_data = load_iris() X = iris_data['data'] Y = iris_data['target'] mmc = MMC_Supervised(n_constraints=200) mmc.fit(X, Y)