*This blog is part 2 of a series featuring excerpts from the original technical paper, “Methods for Multidimensional Scaling” by Douglas B. Clarkson, IMSL, Inc., 1987.*

In this blog post we pick up from Part 1 with a discussion of a generalized criterion function along with further generalizations that may be useful.

# A General Criterion Function

$$ q = \sum_{h} \omega_h \sum_{i,j} |f(\tilde \delta_{ijm}) – a_h – b_h f(\delta_{ijm})|^p $$

where the \(\omega_h\) are “weights” which depend upon the \( f(\tilde \delta_{ijm}) \) in the \(h^{th}\) stratum; \(h\) indexes the strata; \(f\) is one of several transformations discussed in the Data transformation section below; \(a_h\) and \(b_h\) are parameters used in stratum \(h\) and discussed below; \(m\) indexes the dissimilarity matrix and depends upon \(h\) according to the stratification used; and \(p\) allows for general LP (\(p^{th}\) power) estimation. The most likely values for \(p\) are 2.0 for least squares and 1.0 for least absolute value. Other parameters in the model are found in the distance models for \(\delta_{ijm}\) and for nonmetric scaling in the disparities \(\tilde \delta_{ijm}\).

Null and Sarle (1982) have given a criterion function which uses \(p^{th}\) power estimation for ratio and interval data. The criterion \(q\) generalizes \(p^{th}\) power estimates to the nonmetric case.

## Data transformation

The function \(f\) transforms the disparities \(\tilde \delta\) and the distances \(\delta\), allowing the user to change assumptions about the distribution of the observed dissimilarities. This ability is most important in metric data, but \( f \) also has effects in nonmetric data, primarily through the scaling weights \(\omega_h\). Since least squares and normal theory maximum likelihood estimates are equivalent when there is a single stratum, for metric data the function \( f \) may be thought of as a transformation to normality. Alternatively, \(f\) may be used to stabilize the within-strata variances.

Three commonly used choices for \(f \) are \(f(x) = x, f(x) = x^2,\) or \(f(x) = \ln x\), where \(x > 0\). For metric data, if squared distances have constant variance within each stratum, then \(f(x) = x^2\) should be used, with other transformations used in a similar manner as appropriate. The ALSCAL models (Takane, Young, and DeLeeuw 1977) use \(f(x) = x^2\) in the stress function, while \(f(x) = x\) is used in the MULTISCALE models (Ramsey 1983), and KYST (Kruskal, Young, and Seery 1973), among other models. The MULTISCALE models also allow the use of \(f(x) = \ln x\).

## The distance models

These distance models are equivalent to those in ALSCAL (Takane, Young, and DeLeeuw 1977).

• Euclidean model

$$ \delta^2_{ijm} = \sum_{k=1}^{\tau} (X_{ik} – X_{jk})^2 $$

where \(i\) and \(j\) index the stimuli, and \(m\) indexes the matrix.

• Individual-differences model

$$ \delta^2_{ijm} = \sum_{k=1}^{\tau} W_{mk} (X_{ik} – X_{jk})^2 $$

where \(W_{mk}\) is the weight on the \(k^{th}\) dimension for the \(m^{th}\) matrix.

• Stimulus weighted model

$$ \delta^2_{ijm} = \sum_{k=1}^{\tau} S_{ik} (X_{ik} – X_{jk})^2 $$

where \(S_{ik}\) is the weight on the \(k^{th}\) dimension for the \(i_{th}\) stimulus.

• Stimulus-weighted individual-differences model

$$ \delta^2_{ijm} = \sum_{k=1}^{\tau} S_{ik}W_{mk} (X_{ik} – X_{jk})^2 $$

Of course, other distance models exist. For example, IDIOSCALE (Carroll and Chang 1970) allows a rotation of each individual’s coordinate axis, while Weeks and Bentler (1982) propose some asymmetric models via skew-symmetric matrices.

## Strata weights

In metric data, strata weights are used to scale the criterion function within a stratum. Weights which are inversely proportional to the variances are preferred because they lead to normal distribution theory maximum-likelihood estimates. Weights inversely proportional to the variances (when \(p = 2\)) are given as

$$ \omega^{-1}_{h} = \frac{ \sum |f(\tilde{\delta}_{ijm}) – a_h – b_hf(\hat{\delta}_{ijm})|^p}{n_h} $$

where \(\omega_h\) is the weight in the \(h^{th}\) stratum, the sum is over all observations in the stratum, and \(n_h\) is the number of observations in the stratum. When \(p \ne 2\), the resulting weights may still be optimal in some sense.

The astute reader will note that with the choice for \(\omega_h\) above, the criterion function is the ratio of two proportional quantities and as such is the sum over \(h\) of the proportionality constant, \(n_h\). Since the estimation algorithm treats \(\omega_h\) as fixed during each iteration, the criterion function is still “optimized” during each iteration (in the sense that its first partial derivatives will converge to zero). The partial derivatives of \(q\) for fixed \(\omega_h\) are identical to those obtained from criterion

$$ \tilde{q} = \sum_{h} n_h \ln \left [ \sum_i \sum_j |f(\tilde{\delta}_{ijm}) – a_h – b_h f(\hat{\delta}_{ijm})|^p \right ]$$

When weights inversely proportional to the strata variances are used, \(\tilde{q}\) is actually optimized. However, because of the equivalence in the derivatives, the result is the same as if one were to optimize \(q\) for fixed weights, where the fixed weights depend upon the final parameter estimates.

In nonmetric scaling the criterion function is minimized with respect to both \(\delta\) and \(\tilde{\delta}\), and the weights are used as a scaling factor so that the solution will not degenerate to zero. In most criterion-multidimensional scaling models, scaling is provided by the use of one of two possible strata weights proposed by Kruskal (1964). These weights are given as

$$ \omega_h^{-1} = \frac{|f(\tilde{\delta}_{ijm})|^p}{n_h}$$

or

$$ \omega_h^{-1} = \frac{|f(\tilde{\delta}_{ijm}) – \bar{f}(\tilde{\delta}_{…})|^p}{n_h}$$

where the sum is over the observations in the stratum, and where \(\bar{f}(\tilde{\delta}_{…})\) denotes the average of the disparities in the stratum. (Kruskal used \(p = 2\).) Both weighting schemes are allowed in \(q\) for metric as well as nonmetric data.

## The parameters \(a_h\) and \(b_h\)

The meaning of parameters \(a_h\) and \(b_h\) varies depending on other aspects of the model. Since both \(a_h\) and \(b_h\) are redundant in nonmetric data, \(a_h\) is fixed at \(0\), while \(b_h\) is fixed at \(1\) in this case. For metric data the meaning and use of both \(a_h\) and \(b_h\) vary with the transformation \(f\) and with the data type as follows:

• If transformation \(f (x) = x\) is used, then the parameter \(a_h\) is a translation parameter used for interval data. In this case, the quantity \(\tilde{\delta}_{ijm} – a_h\) is assumed to be a distance (i.e., when \(\tilde{\delta}_{ijm} – a_h\) is zero, the objects are assumed to be in the same location, while other values of \(\tilde{\delta}_{ijm} – a_h\) are all positive). When the observed data are ratio, \(a_h\) is set to zero.

• For \(f(x) = x, b_h\) is a scaling factor allowing for different scalings of distance between strata. For example, some subjects may consistently give a smaller response than others. Whether \(b_h\) should be used when different strata scales are allowed depends upon the parameterization used for \( \delta \), since \(b_h\) may be redundant for some models. (See section 2.5).

• If \(f(x) = x^2\), then the model assumes in interval data that \(\tilde{\delta}^2 – a_h\) is a squared distance measure. Thus, squared distance plus a constant is assumed to be observed. As when \(f(x) = x, a_h\) should be set to \(0\) for ratio data. Once more, when \(f(x) = x^2\), the parameter \(b_h\) is a scaling parameter.

• When \(f(x) = \ln(x)\) the general meaning of \(a_h\) and \(b_h\) changes. For such data, \(\exp(a_h)\) is a scaling factor, while \(b_h\) is a power transformation parameter (the transformation can be rewritten as \(\ln(\delta^{b_h})\) . Of course, \(b_h\) should be fixed at \(1\) if a power transformation is not desired. Since no translation to zero distance is possible when \(f (x) = \ln(x)\), interval data are not handled in this case.

• Unique parameter estimates

All models given are overparameterized so that the resulting parameter estimates are not uniquely defined. As was discussed for the Euclidean model, the columns of \(X\) can be translated. Moreover, in the Euclidean model, rotation is also possible. To eliminate lack of uniqueness due to translation, model estimates for the configuration can be centered in all models. No attempt at eliminating the rotation problem is usually made, but note that rotation invariance is not a problem in some of the models given. With more general models than the Euclidean model, other kinds of overparameterization occur. Further restrictions on the parameters to eliminate this overparameterization are given below by model transformation (\(f\)) type. In the following, \(W_{ik} \in W\) and \(S_{ik} \in S\), where \(W\) is the matrix of subject weights, and \(S\) is a matrix of stimulus weights. The restrictions to be applied by model transformation type are outlined below.

1. For all models

(a) \(\sum_{i=1}^n x_{ik} = 0\); i.e., center the columns of \(X\).

(b) If \(W\) is in the model, scale the columns of \(W\) so that \(\sum_{i=1}^n x^2_{ik} = 1\).

(c) If \(S\) is in the model and \(W\) is not in the model, scale the columns of \(S\) so that \(\sum_{i=1}^n x^2_{ik} = 1\).

(d) If both \(S\) and \(W\) are in the model, scale the columns of \(W\) so that \(\sum_{i=1}^n s^2_{ik} = 1\).

2. For \(f(x) = x\) and \(f(x) = x^2\)

(a) Set \(b_h = 1\) if the data are matrix conditional and \(W\) is in the model, or if the data are unconditional. (In all cases, matrix conditional with only one matrix is considered the same as unconditional data.)

(b) If the data are matrix conditional and \(W\) is not in the model, scale all elements in \(S\) (or \(X\) if \(S\) is not in the model) so that \(\sum_{h=1}^\gamma b^2_{h} = \gamma\), where \(\gamma\) is the number of matrices observed.

(c) If the data are row conditional and \(W\) is in the model, then scale the rows of \(W\) so that \(\sum_{i=1}^n b^2_{h} = n\), where \(h\) corresponds to \(i\) for the appropriate matrix.

(d) If the data are row conditional and \(W\) is not in the model, but \(S\) is in the model, then scale the rows of \(S\) so that \(\sum_{m=1}^n b^2_{h} = \gamma\), where \(h\) corresponds to \(m\) for the appropriate stimulus.

(e) If the data are row conditional and neither \(W\) or \(S\) is in the model, then scale all elements in \(X\) so that \(\sum_{h} b^2_{h} = n\gamma\).

3. For \(f(x) = \ln(x)\), substitute \(a_h\) for \(b_h\) (but set \(a_h\) to \(0\) instead of \(1\)) in all restrictions in item 2.

## Missing Values

Missing data are usually handled by eliminating the missing data point from the stress function. If the amount of missing data is not too severe, then it is not likely that problems (e.g., overparameterization) will result. In computing initial estimates, the average of the data elements in the stratum are often substituted for the missing data, and the usual estimation procedure can then be used. If the data are row conditional and all elements in a stratum are missing, the average of the non-missing data values for the matrix is often used.

## What’s next

In our next blog, we will continue with the computational procedures for optimization and parameter estimation for the generalized criterion functions. Later blogs will illustrate the procedure with an additional, more complex example.