STEMBOX.TXT       J C Nash   1986

Graphical displays for univariate data
-     frequency or relative frequency & scale are main features
-     Most common displays are bar charts (histograms).
-     Pie charts common for relative frequency data also.
-     Stem & leaf diagram is both a graph and a table of the (original?) data.

EXAMPLE -- Shows how to include the cumulative frequency plot as well.
    An extended Stem-and-leaf diagram displaying miles per gallon data. 
    Entries with an asterisk are derived from metric raw data.  This
    example illustrates the possibility of preserving high precision of
    data, additional detail (metric origin of data) and the cumulative
    frequency diagram.

 1|  5.75                                                  
 6|  6.93*  7.17   7.70*  9.24   9.29*  9.48* 10.18* 10.85 
11| 11.01  11.21* 11.71  12.58* 13.72                      
16| 18.10* 19.01*                                          
21| 22.97* 25.94*                                          
26|                                                        
30|                                                        
36|                                                        
41| 43.08*                                                 
                                                           

 1|| 1 |  1 |X
 6|| 8 |  9 |XXXXXXXXX
11|| 5 | 14 |XXXXXXXXXXXXXX
16|| 2 | 16 |XXXXXXXXXXXXXXXX
21|| 2 | 18 |XXXXXXXXXXXXXXXXXX
26|| 0 | 18 |XXXXXXXXXXXXXXXXXX
30|| 0 | 18 |XXXXXXXXXXXXXXXXXX
36|| 0 | 18 |XXXXXXXXXXXXXXXXXX
41|| 1 | 19 |XXXXXXXXXXXXXXXXXXX
     ^    ^       Ogive
     |    |
     |    Cumulative Frequency (from bottom)
     Frequency 
 
Textbooks often gives "rules" for drawing stem and leaf diagrams.  Note
that modifications may be made to suit needs.  Practice and comparison of
results to those produced by good software, e.g., Minitab, is the best
method for learning this technique.  More detail is given below.

Similarly for histograms, practice is worth more than abstract study. 
There is NO substitute for experience in judging an appropriate set of
break-points (class intervals) for the axis of a histogram (the categories
into which data will be put).  Be careful to label categories, especially
if the axis is not equispaced.

Warning: some graphs are drawn on the principle that area rather than
height is proportional to relative frequency.  Most practitioners prefer
to use bars of the same width and insist that height is the measure of
relative frequency.

Histogram or stem and leaf can be used to provide interpretations about
the data, e.g., x% of the data is greater than y.  Obvious applications
are to situations such as "1% of property owners pay more than $10,000 in
taxes." However, we need to do more if we want to be able to make
statements such as "1% of property owners paid more than 20% of total
taxes" (this sort of information can be drawn from a bivariate plot called
a Q-Q plot).

Some practitioners like to use the cumulative frequency plot rather than
the histogram -- it is easier to read off the type of interpretation
above.

Other tools: BOXPLOTS, DOTPLOTS, HISTOGRAMS, OGIVE

The next 3 pages are taken from documentation for a special project.

Exploratory graphical data analysis tools

i. Stem and leaf diagrams

Stem and leaf diagrams serve much the same function as histograms.  That
is, they allow a set of data y[i], i=1..n, to be sorted into BINS of a
given, usually equal, size for plotting.  In drawing histograms, one of
the major decisions is the size (or width) of the interval(s) which make
up the bins.  The stem and leaf diagram uses the natural scale of the
numbers in y[] as a guide to this choice.  Furthermore, it helps overcome
the possibility that all the data which fall in a bin actually lie near
the boundaries of that bin.  This is accomplished as follows.

The leading digit of each data element is used to determine which STEM
position will be used to plot that data element.  If the data is
distributed so that there are, say, 6 or more different first digits, then
we can use the first digit as the index for the stem.  Then the second
digit of each data element is plotted against the appropriate stem value. 
Thus, we might get

       1 29
       2 321456
       3 11778
       4 22
       5 3445
       6 0

from the data set 12, 19, 23, 22, 21, 24, 25, 26, 31, 31, 37, 37, 38, 42,
42, 53, 54, 54, 55, 60.

There are a number of important details which make the program code for
stem and leaf displays quite difficult.  First, data is not always of the
right form so that the first digit conveniently provides an index to the
stem.  To overcome this, we may decide that it is better to have fewer
choices for the "leaf" display characters, and choose 5, 2 or only 1.  The
program augments the stem indicator as follows:

a) if only 1 leaf character is allowed, then we are really plotting a
histogram of the data, and use the data value for the stem and an asterisk
(*) for the leaf;

b) for a 2 character leaf display, if the stem indicator digit is D, then
the leaf digits

   0 or 1 are plotted on the stem line      D*
   2 or 3                                   DT   (T for TWO or THREE)
   4 or 5                                   DF   (F for FOUR or FIVE)
   6 or 7                                   DS   (S for SIX or SEVEN)
   8 or 9                                   D.

c) for a 5 character leaf display, if the stem indicator digit is D, then
the leaf digits

          0,1,2,3, or 4   are plotted on stem line         D
          5,6,7,8, or 9   are plotted on stem line         D*

The data in the stem and leaf diagram may be sorted on each stem.

In general, I have found it convenient to use 2 or even 3 digit leaf
displays so that the original data is wholly preserved in the plot.  That
is, if all digits of each data element can be plotted in the diagram, then
the diagram preserves all the information in the original data.

Other modifications are also possible.  Outliers can also be left out of
the plot and simply listed separately.  This may be important, since
outliers can easily distort the display.

ii. Boxplots

Boxplots are a convenient method for displaying distributional information
for sets of data.  In particular they allow several sets of data to be
compared easily.

To draw a boxplot for a single set of data, we first need to sort the
data.  Suppose the data is in the array y[i], i=1..n, and sorted in
ascending order.  The DEPTH of the i'th point is i or (n+1-i), that is,
the number of the point from the nearest end of the data.  The depth of
the MEDIAN is (n+1)/2.  (If n is odd, the median is y[(n+1)/2]; if n is
even, the median is any point between y[n/2] and y[n/2 + 1].  Thus the
depth, (n+1)/2, is correct in either case.) Let the depth of the median be
d(M).

We can now split each half of the data by defining the HINGEs.  The depth
of the hinges is

     d(H) = (int( d(M) ) + 1)/2

where int(x) is the integer part of x.  (N.B.  This is slightly different
from Mendenhall's formula.) Each hinge is at d(H) from either the i=1 or
i=n end of the y[ ] array.  The values of the observations which
correspond to the hinges are not necessarily at the same distance from the
median, so we have a tool for indicating the asymmetry of the distribution
of the y[] data.  Note that hinges are similar to quartiles.  The first
quartile, Q1, is defined for y[] so that 1/4 of the data points fall below
Q1.  Similarly, 1/4 of the y[] values fall above Q3.  (Q2 = M, the
median.) The use of the depth of the median in the definition of hinges
may cause the hinges to lie slightly closer to the median than the
quartiles, but no serious misinterpretation is likely to follow from
treating hinges as quartiles.  The arithmetic for calculating the hinges
is simpler than that for quartiles.  We will label the lower and upper
hinges HL and HU.

The simplest forms of boxplot display just 5 numbers

    y[1]  HL    M     HU    y[n]

in the following way
                                   ------------------
                          ---------I    +           I---
                                   ------------------
                       y[1]      HL    M           HU y[n]

which gives rise to the full "BOX AND WHISKER PLOT" name.  We shall
simplify this to BOXPLOT in accord with common usage by statisticians.

The "5-number summary", while useful, can be improved upon.  The simple
boxplot above is relatively helpful for data which is distributed in a
manner more or less like the traditional Gaussian (or Normal)
distribution.  However, in the current project, and in many other
applications, we may be very interested not only in the "usual" data, but
in OUTLIERS -- those unusual points which may be very important to us.

To get a good handle on outliers, let us define the H-SPREAD as H-spread =
HU - HL.

Then let us define INNER FENCES

   IFL = HL - 1.5 * H-spread
   IFU = HU + 1.5 * H-spread

and OUTER FENCES

   OFL = HL - 3 * H-spread
   OFU = HU + 3 * H-spread

Using the inner fences, define the ADJACENT VALUES as those points just
INSIDE the inner fences.  Thus

   AVL = minimum value y[i] such that y[i]>=IFL
   AVU = maximum value y[j] such that y[j]<=IFU

Data outside the inner fences is sometimes termed "outside"; that beyond
the outer fences is "far outside".  The refined boxplot is as follows
(note AVU <> IFU):

                         -------
       *        ---------I +   I---------      * *     O
                         -------
      y[1]     AVL      HL M   HU      AVU            y[n]

     OFL       IFL                       IFU        OFU

The markings are as follows:
    
 +      marks the median
 I      and the "box" mark the hinges
 ----   the "whisker" mark the distance to the adjacent values
 *      is used to plot each "outside" point (outliers)
 O      is used to plot each "far outside" point (extreme outliers)
Not every plot will have * and O points.  In fact, they will be rare for
data which is Gaussian ("normal").


2000-9-19: Minitab adds a depth column to the stem and leaf display.

For example:

Stem-and-leaf of C1        N  = 25
Leaf Unit = 1.0


    3    8 124
    3    8 
    5    9 13
   10    9 56669
   (7)  10 0012234
    8   10 589
    5   11 0022
    1   11 6

    ^
    |
    This is the DEPTH column. 

The sorted data is:
sdata   
   81.842    82.166    84.787    91.312    93.215    95.740    96.107 
   96.347    96.404    99.748   100.719   100.958   101.521   102.142 
  102.431   103.608   104.532   105.050   108.608   109.979   110.743 
  110.893   112.184   112.755   116.556 

The largest observation (116.556) has DEPTH 1 from the top. 
110.743 is the 5th number from the top, so has DEPTH 5.
93.215 is the 5th number from the bottom, so also has depth 5.
The median is observation 13 (101.521). It is in the row with 7
observations, with the (7) in parentheses to indicate the medial 
row (the row with the median in it).

