Statistics
LeastTrimmedSquares
compute least trimmed squares regression
Calling Sequence
Parameters
Description
Options
Notes
Examples
References
Compatibility
LeastTrimmedSquares(X, Y, v)
LeastTrimmedSquares(XY, v)
LeastTrimmedSquares(X, Y, v, opts)
LeastTrimmedSquares(XY, v, opts)
X
-
values of independent variable(s)
Y
values of dependent variable
XY
values of independent and dependent variables
v
algebraic expression in which to express the result, or list or Vector of values of the independent variable(s) at which to evaluate the computed estimator
opts
(optional) one or more equations of the form include = h, intercept = tf, or subsets = s
The LeastTrimmedSquares function computes a robust linear estimator from a collection of input data. It is a quantity first described by Rousseeuw [1], and is computed using the FAST-LTS algorithm by Rousseeuw and Van Driessen [2]. The article [1] that introduced least trimmed squares regression also introduced least median of squares regression, but least trimmed squares regression was later considered the better choice by many, including the original author - see e.g. [2].
Least trimmed squares regression yields a robust estimator. This means that it will continue to perform well if some points are replaced by outliers. Least-squares regression, the type of regression most commonly used and implemented by LinearFit and NonlinearFit, is very susceptible to outliers.
Conceptually, one performs least trimmed squares regression by selecting an arbitrary subset of h of the input data points, where h is an input parameter (which can be set using the include option). One then performs regular least-squares regression on these h data points, and records the sum of squared residuals. This sum of squared residuals is now minimized over all subsets of h data points.
The FAST-LTS algorithm implemented in this command is an approximation algorithm: it will in all reasonable cases return the optimal regression model, but this is not guaranteed. An exact algorithm that does guarantee the optimal result can be selected with the subsets = all option, but this algorithm will take a number of steps that grows very quickly with the size of the problem.
Least trimmed squares regression with parameter h on n input points has a breakdown point of h / n. This means that when fewer than n - h of the data points in a sample are changed, the result of least trimmed squares regression can only change a limited amount. See the discussion of the include option below for limits on the values h can assume.
In the first and third calling sequences, the parameter X is a Matrix containing the values of the independent variables; each row corresponds to one data point and each column corresponds to one independent variable. Alternatively, you can specify X as a list of lists. If there is only a single independent variable, you can also specify X as a column Vector or a list.
In those same calling sequences, Y is a Vector containing the values of the dependent variable, in the same order as the data points occur in X. Alternatively, you can specify Y as a list, or a Matrix with a single column.
In the second and fourth calling sequences, XY is a Matrix containing both the dependent and independent data, with the dependent data being in the right-most column.
The returned value is an expression representing the estimator evaluated at the value v. By supplying a variable name here, you obtain the general expression for the estimator in terms of variables v[1], v[2], etc., corresponding to the independent variables. In particular, if there is a single independent variable, the result will be in terms of v[1]. If instead you supply a list (or Vector) of values, corresponding to the values of the independent variables, you obtain the value of the estimator at those values.
include = h
This option determines how many data points are included (and consequently, how many are ignored). The value h should either be an integer greater than 1, in which case it is directly used as the number of data points to include, or a number that is less than or equal to one, in which case it is multiplied by n (the number of data points) and rounded to obtain the number of data points to include.
In all cases, the resulting number of data points to include (which will be denoted by h in the rest of this discussion) needs to be at least equal to (n + p + 1)/2, where n is the number of data points and p is the number of independent variables, including (if present) the intercept. The default value for h is this minimum number (rounded up to an integer).
If you know that the number of outliers in your data is less than some number m, and n - m > (n + p + 1)/2, then you can supply the option include = n - m to get better statistical efficiency.
intercept = tf
This option determines if a column for the intercept (a constant term in the resulting regression model) should be added to X. The default value is true; you can supply the option intercept = false to turn this off. The intercept is internally handled by adding a column filled with the value 1 to the right of X.
Using this option is necessary if there is already an independent variable assuming only one constant value, or if the all-one vector is otherwise already in the column space of X: the matrix used internally needs to have full column rank, so it will not work if the all-one vector is joined to a Matrix that already contains that vector in its column space.
subsets = s
This option can be used by experts to control the details of how the approximation process used for the FAST-LTS algorithm works. The discussion of this option assumes knowledge of the Rousseeuw and Van Driessen paper [2] that describes the algorithm.
If p = 1 (after including the intercept), this command uses a fast exact algorithm and ignores the subsets option.
By submitting the subsets = all option, the user can select a variant that tries all p-subsets of the data points. It runs two C-steps on all regression lines generated this way, selects the 10 best results, and runs C-steps on these results until convergence. This should in all cases return the optimal result, but the number of regression lines to process grows very quickly as a function of n and p. This algorithm is selected by default if binomial(n, p) < 1500.
Otherwise, the algorithm runs in a number of rounds in which candidate regression lines are generated and subjected to C-steps; the details of these depend by default on n but can be fine-tuned by submitting an option of the form subsets = s, where s is a list of lists. Each sublist corresponds to one round of the algorithm, and contains a number of options specified as equations, listed below.
subsets = c and size = s
These options specify that the data points should be subdivided into c disjoint subsets of size s each, all treated independently. Any points not included are ignored for the time being. The other options then apply to each subset; for example, the generate option specifies how many candidates are generated for each subset. The default values are c = 1 and s equal to the total number of data points. These options can only be specified for the first round.
generate = g
This specifies how to obtain the candidates at the start of the round:
If g is a positive integer, Maple will randomly create that many candidates. This can only be specified for the first round.
If g is the value merge, then the data points selected in all subsets in previous rounds are merged, and the candidates found are merged as well.
If g is the value all, then all original data points are used, including those potentially ignored due to subsets and size options in the first round. The candidates found are merged as well.
steps = t
This specifies the number of C-steps done on each candidate in this round. The default is 2.
select = k
This specifies that the top k candidates go on to the next round. This is the only required option: there is no default, because this option needs to be included for every round (if the subsets option is used at all).
The underlying computation is done in floating-point; therefore, all data points must have type realcons and all returned solutions are floating-point, even if the problem is specified with exact values. For more information about numeric computation in the Statistics package, see the Statistics/Computation help page.
Another method for robust regression is implemented in the RepeatedMedianEstimator procedure.
with⁡Statistics:
In this example, we have 1000 data points. There is a single independent variable, x, with values uniformly distributed between 0 and 10. The dependent variable is a linear function of the independent variable plus some additive noise, y=5⁢x+10+noise, where the noise is from a probability distribution known to have severe outliers - the Cauchy distribution, with location parameter 0 and scale parameter 5.
x≔Sample⁡Uniform⁡0,10,1000
noise≔Sample⁡Cauchy⁡0,1,1000
y≔`~``+`⁡5⁢x+noise,` $`,10
Here we see all data points:
plotssetoptions⁡size=0.7,golden:
pp≔PointPlot⁡y,xcoords=x:pp
Linear least squares regression will be severely affected by the outliers.
ls_regression_result≔Fit⁡a⁢X+b,x,y,X
ls_regression_result≔3.44970682383807⁢X+10.6816568681413
ls_deviation_from_model≔coeff⁡ls_regression_result,X,1−52+coeff⁡ls_regression_result,X,0−102
ls_deviation_from_model≔2.86806501793850
Least trimmed squares regression gets much closer to the true line without noise.
lts_regression_result≔LeastTrimmedSquares⁡x,y,X
lts_regression_result≔5.03537530551575⁢X+9.82475419561271
lts_deviation_from_model≔coeff⁡lts_regression_result,X,1−52+coeff⁡lts_regression_result,X,0−102
lts_deviation_from_model≔0.0319625041956812
The result is even better if we include 900 out of the 1000 points, instead of the default of a little over 500.
lts_900_regression_result≔LeastTrimmedSquares⁡x,y,X,include=900
lts_900_regression_result≔5.00862730339998⁢X+10.0156318668695
lts_900_deviation_from_model≔coeff⁡lts_900_regression_result,X,1−52+coeff⁡lts_900_regression_result,X,0−102
lts_900_deviation_from_model≔0.000318785625780940
The other robust regression method, implemented in the RepeatedMedianEstimator procedure, also gets a good result.
rme_regression_result≔RepeatedMedianEstimator⁡x,y,X
rme_regression_result≔10.0306661686300+5.00564125476873⁢X
rme_deviation_from_model≔coeff⁡rme_regression_result,X,1−52+coeff⁡rme_regression_result,X,0−102
rme_deviation_from_model≔0.000972237653807886
In order to visualize these results, we show the same point plot as before, including the four regression lines. The three regression lines from robust methods cannot be distinguished, but the least squares method is clearly off. We zoom in on the vertical range that includes most points.
plotsdisplay⁡pp,plot⁡ls_regression_result,lts_regression_result,lts_900_regression_result,rme_regression_result,X=0..10,legend=Least squares,Least trimmed squares,Least trimmed squares (900 points),Repeated median estimator,view=0..10,−10..110
In the next example, we have 1000 input points with two independent variables, most of which follow the model z=5⁢x−3⁢y+2+noise, where the noise is relatively small. However, 10 of the points are far away in both the dependent and independent directions (and they have much greater noise). Such cases are particularly hard for non-robust methods to deal with.
We first see what this point cloud looks like, and then view the resulting fit.
xy≔Sample⁡Normal⁡0,1,1000,2
noise≔Sample⁡Normal⁡0,1,1000
xy1..10≔Sample⁡Normal⁡0,10,10,2
xy1..10≔−7.258354574164604.46546251109489−12.81168712706026.957019762224525.99013830391918−0.07112206693374605.75154152981187−24.76289089010189.095483167839293.963367743756048.7395917679601410.802828136997610.7436650987406−17.87568844606224.35976236509024−11.7928910523180−5.771711562030784.335046352346482.59252474066801−10.1924397735107
noise1..10≔Sample⁡Normal⁡0,100,10
noise1..10≔100.8008096663398.96726548728334−46.8119150928935−259.96902774526919.0856519868389−153.183461105396−25.701490859414294.175189332515175.9056085428043−120.552449126529
z≔`~``+`⁡xy·5,−3,` $`,2+noise%T
pp≔plotspointplot3d⁡xy|z,color=black:pp
ls_regression_result≔Fit⁡a⁢X+b⁢Y+c,xy,z,X,Y
ls_regression_result≔2.79396059575048⁢X−0.846671122546459⁢Y+1.81658520523718
lts_regression_result≔LeastTrimmedSquares⁡xy,z,X,Y
lts_regression_result≔4.89975550872497⁢X−3.01061932803024⁢Y+2.16309086233403
plotsdisplay⁡pp,plot3d⁡ls_regression_result,lts_regression_result,X=−2..2,Y=−2..2,color=red,blue,view=−2..2,−2..2,−5..20,orientation=25,80,−60
Data sets up to significant size can be used in reasonable time. In the third example, we have 100000 points with 10 independent variables.
x≔Sample⁡Uniform⁡0,1,100000,10
y≔x·1,2,3,4,5,6,7,8,9,10
y1..1000≔`~``+`⁡y1..1000,` $`,10
CodeToolsUsage⁡LeastTrimmedSquares⁡x,y,X
memory used=298.15MiB, alloc change=45.50MiB, cpu time=4.43s, real time=1.80s, gc time=1.56s
1.00000000000002⁢X1+1.99999999999999⁢X2+2.99999999999999⁢X3+4.00000000000000⁢X4+5.00000000000000⁢X5+5.99999999999999⁢X6+7.00000000000000⁢X7+8.00000000000002⁢X8+8.99999999999999⁢X9+10.0000000000000⁢X10+8.20789049648978×10−16
[1] Rousseeuw, Peter J. Least Median of Squares Regression. Journal of the American Statistical Association 79, 1984, pp.871-880.
[2] Rousseeuw, Peter J., and Van Driessen, Katrien. Computing LTS Regression for Large Data Sets. Data Mining and Knowledge Discovery 12, 2006, pp.29-45.
The Statistics[LeastTrimmedSquares] command was introduced in Maple 2019.
For more information on Maple 2019 changes, see Updates in Maple 2019.
See Also
Statistics/Regression
Statistics[RepeatedMedianEstimator]
Download Help Document