Grouping Continuous Data

From MM*Stat International

Jump to: navigation, search


Objectives of Statistics  • Statistical Investigation  • Statistical Element and Population  • Statistical Variable  • Measurement Scales  • Qualitative Variables  • Quantitative Variables  • Grouping Continuous Data  • Statistical Sequences and Frequencies  • Multiple Choice Questions

Consider height data on 100 school boys. In order to gain an overview of the distribution of heights you start ‘reading’ the raw data. But the typical person will soon discover that making sense of more than, say, 10 observations without some process of simplification is not useful. Intuitively, one starts to group individuals with similar heights. By focussing on the size of these groupings rather than on the raw data itself one gains an overview of the data. Even though one has set aside detailed information about exact heights,  one has created a clearer overall picture. Data sampled from continuous or quasi-continuous random variables can be condensed by partitioning the sample space into mutually exclusive classes. Counting the number of realizations falling into each of these classes is a means of providing a descriptive summary of the data. Grouping data into classes can greatly enhance our ability to ‘see’ the structure of the data, i.e. the distribution of the realizations over the sample space. Classes are non-overlapping intervals specified by their upper and lower limits (class boundaries) Loss of information arises from replacing the actual values by the sizes and location of the classes into which they fall   If one uses too few classes, then useful patterns may be concealed.  Too many classes may inhibit the expositional value of grouping. Class boundaries The upper and lower values of a class are called class boundaries. A class is fully specified by its lower boundary and upper boundary , where , i.e. upper boundary of the th class and lower boundary of the th class coincide. or , i.e. the class boundary can be attributed to either of the classes it separates. Example

less than 10 less than or equal to 10
10 to less than 12 greater than 10 to less than or equal to 12
12 to less than 15 greater than 12 to less than or equal to 15
15 or greater greater than 15

When measurements of (theoretically) unbounded variables are being classified, left- and/or right-most classes extend to , , respectively, i.e. they form a semi-open interval. Class width Taking the difference between two boundaries of a class yields the class width (sometimes referred to as the class size): Classes need not be of equal width. Class midpoint The class midpoint can be interpreted as a representative value for the class, if the measurements falling into it are evenly or symmetrically distributed. Politicians and political scientists are interested in the income distribution. In Germany, a large portion of the population has taxable income The 1986 data, compiled from various official sources, displays a concentration in small and medium income brackets. Relatively few individuals earned more than one million marks. Greater class widths have been chosen for higher income brackets to retain a compact exposition despite the skewness in the data.

Source: Datenreport 1992, p. 255; Statistisches Jahrbuch der Bundesrepublik Deutschland 1993, S. 566
Persons Consolidated
(1000) gross income
(mio. marks)
1 4000 1445.2 2611.3
4000 8000 1455.5 8889.2
8000 12000 1240.5 12310.9
12000 16000 1110.7 15492.7
15000 25000 2762.9 57218.5
25000 30000 1915.1 52755.4
30000 50000 6923.7 270182.7
50000 75000 3876.9 234493.1
75000 100000 1239.7 105452.9
100000 250000 791.6 108065.7
250000 500000 93.7 31433.8
500000 1 Mio. 26.6 17893.3
1 Mio. 2 Mio. 8.6 11769.9
2 Mio. 5 Mio. 3.7 10950.8
5 Mio. 10 Mio. 0.9 6041.8
0.5 10749.8