# Grouping Continuous Data

Consider height data on 100 school boys. In order to gain an overview of the distribution of heights you start ‘reading’ the raw data. But the typical person will soon discover that making sense of more than, say, 10 observations without some process of simplification is not useful. Intuitively, one starts to group individuals with similar heights. By focussing on the size of these groupings rather than on the raw data itself one gains an overview of the data. Even though one has set aside detailed information about exact heights,  one has created a clearer overall picture. Data sampled from continuous or quasi-continuous random variables can be condensed by partitioning the sample space into mutually exclusive classes. Counting the number of realizations falling into each of these classes is a means of providing a descriptive summary of the data. Grouping data into classes can greatly enhance our ability to ‘see’ the structure of the data, i.e. the distribution of the realizations over the sample space. Classes are non-overlapping intervals specified by their upper and lower limits (class boundaries) Loss of information arises from replacing the actual values by the sizes and location of the classes into which they fall   If one uses too few classes, then useful patterns may be concealed.  Too many classes may inhibit the expositional value of grouping. Class boundaries The upper and lower values of a class are called class boundaries. A class $j$ is fully specified by its lower boundary $x_{j}^{l}$ and upper boundary $x_{j}^{u}$ $\left(j=1,\ldots ,k\right)$ , where $x_{j}^{u}=x_{j+1}^{l}\quad \left(j=1,\ldots ,k-1\right)$ , i.e. upper boundary of the $j$ th class and lower boundary of the $(j+1)$ th class coincide. $x_{j}^{l} or $x_{j}^{l}\leq x , i.e. the class boundary can be attributed to either of the classes it separates. Example
 less than 10 $<10$ less than or equal to 10 10 to less than 12 $\geq 10,<12$ $>10,\leq 12$ greater than 10 to less than or equal to 12 12 to less than 15 $\geq 12,<15$ $>12,\leq 15$ greater than 12 to less than or equal to 15 15 or greater $>15$ greater than 15
When measurements of (theoretically) unbounded variables are being classified, left- and/or right-most classes extend to $-\infty$ , $+\infty$ , respectively, i.e. they form a semi-open interval. Class width Taking the difference between two boundaries of a class yields the class width (sometimes referred to as the class size): $\Delta x_{j}=x_{j}^{u}-x_{j}^{l}\quad \left(j=1,\ldots ,k\right)$ Classes need not be of equal width. Class midpoint The class midpoint $x_{j}$ can be interpreted as a representative value for the class, if the measurements falling into it are evenly or symmetrically distributed. $x_{j}={\frac {x_{j}^{l}+x_{j}^{u}}{2}}\quad \left(j=1,\ldots ,k\right)$ Politicians and political scientists are interested in the income distribution. In Germany, a large portion of the population has taxable income The 1986 data, compiled from various official sources, displays a concentration in small and medium income brackets. Relatively few individuals earned more than one million marks. Greater class widths have been chosen for higher income brackets to retain a compact exposition despite the skewness in the data.