Statistics
SplitByColumn
split matrix data into submatrices by values of one column
Calling Sequence
Parameters
Description
Options
Examples
Compatibility
SplitByColumn(M, n, options)
SplitByColumn(L, n, options)
M
-
DataFrame or Matrix of data
n
index or label of the 'key' column
L
list of Vectors or other ordered one-dimensional data sets; all data sets should have the same number of elements
options
(optional) equation(s) of the form option=value where option is one of bounds, output, or ignore; specify options for the SplitByColumn function
The SplitByColumn function splits a DataFrame or Matrix of data into a list of submatrices. Data placement into a submatrix is decided on a row by row basis, depending on the value in a particular column (the 'key' column).
This is useful if the values in any one row are related and data in any one column are of the same type. This type of data is often stored in a DataFrame object. For example, data about house sales in a region may be organized in three columns giving the price, number of bedrooms, and surface area of each house, where the three values belonging to any one sale are in the same row. This function allows you to separate your data into submatrices with all data for sales involving one bedroom houses, two bedroom houses, etc. Many other Statistics commands can then find properties for these submatrices individually. For more details and a list of applicable Statistics commands, see Statistics/DataFrames.
Alternatively, if your data is organized into a number of Vectors for each of the types of data (corresponding to the columns in the DataFrame case), you can also submit it as a list of Vectors. In this case, n is the index of the 'key' Vector in the list L of Vectors.
The options argument can contain one or more of the options shown below.
bounds - By default, SplitByColumn creates a separate submatrix for every different value in the key column of M (as determined by n). This may not be what you want, especially if your key data is more or less continuous (such as the house prices in the example above). In this case, you can use the bounds option to specify boundaries of 'bins' into which values are collected. There are several different ways to use this option:
bounds=none is the default and leads to a separate 'bin' for every value in the key column.
bounds=b0,b1,...,bn means values from the key column will be divided into n bins: from b0 to b1, from b1 to b2, etc. Values less than b0 or greater than bn, and the rows containing them, are discarded.
These intervals are closed on the left and open on the right; that is, if the exact value b1 occurs in the data, then that row goes into the second matrix, not the first one. The rightmost bin is an exception: it represents an interval closed on both sides.
Instead of using a list b0,b1,...,bn, you can also use a Vector with the same entries. This has the same effect.
bounds=quantiles⁡q0,q1,...,qn or bounds=deciles⁡d0,d1,...,dn or bounds=percentiles⁡p0,p1,...,pn or bounds=quartiles⁡q0,q1,...,qn are equivalent ways of specifying bins similar to specifying a list, but instead of specifying absolute numbers for the bounds, you specify which fraction of the data should fall in each of the bins.
For example, specifying bounds=percentiles⁡0,25,60,90 specifies three bins, the first of which contains the lowest 25% of data, then then next 60% - 25% = 35% of data, and finally the following 90% - 60% = 30% of data. That is, the bounds are at the 0th, 25th, 60th, and 90th Percentile of the data in the key column. This could also be specified as bounds=quantiles⁡0,0.25,0.6,0.9 using the notion of Quantile instead of Percentile, or bounds=deciles⁡0,2.5,6,9 using Deciles, or bounds=quartile⁡0,1,2.4,3.6 using Quartiles.
bounds=n for a positive integer n means values from the key column will be divided into n bins of equal width.
output = default or vectorlist or matrix can be used to override the output format for each submatrix: a DataFrame/Matrix or a list of column Vectors. The default is to return the same type of data as was submitted. That is, for the calling sequence involving M, SplitByColumn returns a list of matrices by default, and for the calling sequence involving L, it returns a list of lists of Vectors. This can be overridden with this option.
In either case, the internal computations occur with a Matrix, so input and/or output of a list of Vectors costs some time and memory for conversion. (This is usually negligible, though.)
ignore = true or false specifies how undefined values in the key column should be treated. With ignore = true, all undefined values (and the corresponding rows) will be removed from the data set. With ignore = false (the default), the behavior of SplitByColumn depends on the value of bounds: with bounds = none, any occurring undefined values in the key column will give rise to a submatrix with undefined values. If multiple different values of type,undefined occur, then every value gets its own submatrix. With all other settings of bounds, undefined values will end up in an arbitrary submatrix.
with⁡Statistics:
We construct a Matrix with housing data. The first column has number of bedrooms, the second has number of square feet, the third has price.
We construct a DataFrame with housing data. The first column has number of bedrooms, the second has the area in square feet, the third has price.
bedrooms≔3,4,2,4,3,2,2,3,4,4,2,4,4,3,3
area≔1130,1123,1049,1527,907,580,878,1075,1040,1295,1100,995,908,853,856
price≔114700,125200,81600,127400,88500,59500,96500,113300,104400,136600,80100,128000,115700,94700,89400
HouseSalesData≔DataFrame⁡bedrooms,area,price,columns=Bedrooms,Area,Price
We can create box plots of the price for subgroups of sales defined by number of bedrooms.
ByRooms≔SplitByColumn⁡HouseSalesData,Bedrooms
ByRooms≔BedroomsAreaPrice3210498160062580595007287896500112110080100,BedroomsAreaPrice13113011470053907885008310751133001438539470015385689400,BedroomsAreaPrice2411231252004415271274009410401044001041295136600124995128000134908115700
PricesByRooms≔map⁡m↦mPrice,ByRooms:
BoxPlot⁡PricesByRooms,deciles=false,datasetlabels=2 bdrms,3 bdrms,4 bdrms,color=Red,Purple,Blue
plotsdisplay⁡zip⁡m,c↦PointPlot⁡mPrice,xcoords=mArea,color=c,symbolsize=20,ByRooms,Red,Purple,Blue
Looking at this plot, we see that there is one house that is especially small and cheap, and two that are especially large and expensive, and the rest are clustered fairly closely together. Let's split that cluster into two subgroups according to their area. There is a small gap around 950 square feet.
ByArea≔SplitByColumn⁡HouseSalesData,Area,bounds=700,950,1250
ByArea≔BedroomsAreaPrice539078850072878965001349081157001438539470015385689400,BedroomsAreaPrice13113011470024112312520032104981600831075113300941040104400112110080100124995128000
PricesByArea≔map⁡m↦mPrice,ByArea
PricesByArea≔5885007965001311570014947001589400,1114700212520038160081133009104400118010012128000
BoxPlot⁡PricesByArea,deciles=false,datasetlabels=700-950,950-1250,color=Green,Cyan
We can compute the means of the columns in an individual data set by using the Mean command. To apply it to the list of DataFrame objects returned by SplitByColumn, we can use the elementwise version of Mean, obtained by appending a tilde.
Mean⁡HouseSalesData
Bedrooms3.13333333333333Area1021.06666666667Price103706.666666667
`~`Mean⁡ByArea
Bedrooms3.Area880.400000000000Price96960.,Bedrooms3.14285714285714Area1073.14285714286Price106757.142857143
The same thing can be done in one step with the Aggregate method of the DataFrame object. Since this command is part of the DataFrame object, it does not accept Matrices.
Aggregate⁡HouseSalesData,Bedrooms
BedroomsAreaPrice12901.75000000000079425.23964.200000000000100120.341148.122883.333333333
We can also split the data into four roughly equally large groups by price. The output is lists of vectors this time.
ByPrice≔SplitByColumn⁡HouseSalesData,3,bounds=quartiles⁡0,1,2,3,4,output=vectorlist
ByPrice≔2322,10499075801100,81600885005950080100,233,878853856,965009470089400,3344,113010751040908,114700113300104400115700,4444,112315271295995,125200127400136600128000
Let's examine the areas of these houses.
AreasByPrice≔map⁡x↦x2,ByPrice
AreasByPrice≔10499075801100,878853856,113010751040908,112315271295995
BoxPlot⁡AreasByPrice,deciles=false,datasetlabels=1st quartile,2nd quartile,3rd quartile,4th quartile,color=Red..Yellow
There are undefined values in the following Matrix. If we do not supply a bounds argument, then the undefined values in the third column are grouped into a separate bin.
data≔Matrix⁡1,2,3,4,5,undefined,7,undefined,9,10,11,undefined,13,8,3
data≔12345undefined7undefined91011undefined1383
SplitByColumn⁡data,3
1231383,7undefined9,45undefined1011undefined
If we do supply a bounds argument, then the result is not well-defined. In this case, the row with value 9 in the key column disappears, and undefined values are placed in the first half of data.
SplitByColumn⁡data,3,bounds=quantiles⁡0,12,1
45undefined1011undefined,1231383
With the ignore option, the undefined values are removed from the key column. Note that the undefined value in the second column remains untouched.
SplitByColumn⁡data,3,bounds=quantiles⁡0,23,1,ignore
1231383,7undefined9
The Statistics[SplitByColumn] command was introduced in Maple 16.
For more information on Maple 16 changes, see Updates in Maple 16.
See Also
Statistics[Computation]
Statistics[Sort]
Statistics[Tally]
Statistics[TallyInto]
Download Help Document