Unit E: Statistics and Probability

Chapter 1: Statistics


Outliers and Trimmed Mean

When outliers occur in a set of data, the mean can often provide a misleading result. That is, the mean will be distorted or skewed.


Ten coworkers are sitting at a restaurant table in downtown Calgary. Nine of the workers have annual incomes of $75 000. The tenth individual is a successful businessman who earns $5 000 000 per year. The mean of their annual incomes is $567 500. This figure does not represent what the typical person at the table earns. The data value of $5 000 000 skews the mean to be much higher than the average salary of the people at the table.

To avoid outliers skewing the value of the mean, a trimmed mean is used to represent the average of the set of data. The trimmed mean uses a method that eliminates the largest and smallest data values before calculating the mean.

Calculate the trimmed mean in the following set of data.

10
15
2
13
12
13
16
14
15
11
4
14
15
14


When the trimmed mean is calculated, the values must be arranged in ascending order. An equal number of values from the top and bottom of the data must be removed when eliminating outliers. In this set of data, the values of 2 and 4 are smaller than the other numbers.

Therefore, the two lowest numbers and two highest numbers are excluded.

2
4
10
11
12
13
13
14
14
14
15
15
15
16

The trimmed mean of this set of data is

x ¯ = sum   of   the   values   in   the   data   set total   number   of   values   in   the   data   set = 10 + 11 + 12 + 13 + 13 + 14 + 14 + 14 + 15 + 15 10 = 131 10 = 13 . 1
A figure skating competition produces the following scores: 7.8, 8.1, 8.3, 7.5, 9.9.

  1. Calculate the mean and the trimmed mean. Round to the nearest hundredth.
  2. What is the purpose of using the trimmed mean instead of the mean?

  1. To calculate the mean, use the following formula:

    x ¯ = sum   of   the   values   in   the   data   set total   number   of   values   in   the   data   set = x 1 + x 2 + x 3 + x 4 + x 5 n = 7 . 8 + 8 . 1 + 8 . 3 + 7 . 5 + 9 . 9 5 = 41 . 6 5 = 8 . 32

    The outlier is 9.9. The top and bottom values, 7.5 and 9.9, are removed to eliminate the outlier before finding the trimmed mean.

    Trimmed mean:

    x ¯ = sum   of   the   values   in   the   data   set total   number   of   values   in   the   data   set = 7 . 8 + 8 . 1 + 8 . 3 3 = 24 . 2 3 = 8 . 07

  2. Calculating the trimmed mean can reduce the effects of outliers on a data set.

Quentin is a real estate agent in a town of 5 000 people. To be effective, he needs to know the average house price in his town for new and current clients. Currently, he has houses posted at $315 000, $299 900, $283 000, $315 000, $277 000, $269 900, and $230 900.

  1. Calculate the mean, median, and mode of the house prices.

  2. Of the three measures of central tendency, which should Quentin provide to his clients as the average house price in his town? Explain.

  3. Quentin listed another house at $645 000. Calculate the new mean, median, and mode.

  4. Compare the two means. How is the mean affected by the new listing?

  5. Compare the two medians. How is the median affected by the new listing?

  6. Compare the two modes. How is the mode affected by the new listing?

  7. Which of the three new measures of central tendency should Quentin provide to his clients as the average house price in his town? Explain.

  1. To calculate the mean, use the following formula:

    x¯=sum of the values in the data settotal number of values in the data set=x1+x2+x3+x4+x5+x6+x7n=315 000+299 900+283 000+315 000+277 000+269 900+230 9007=1 990 7007=284 385.71

    The mean house price is $284 385.71.

    To find the median, arrange the data in ascending order.

    $230 900, $269 900, $277 000, $283 000, $299 900, $315 000, $315 000

    There are 7 data values. As there is an odd number of data values, the median is the middle value. Therefore, the median house price is $283 000.

    Recall that the mode is the most frequently occurring data value. Therefore, the mode house price is $315 000.

  2. Mean and median are both good representations of the average. These numbers are close in value. Mode is not a useful average because it only indicates that more than one house is being sold for $315 000.

  3. New mean:

    x ¯ = sum   of   the   values   in   the   data   set total   number   of   values   in   the   data   set = x 1 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 + x 8 n = 315   000 + 299   900 + 283   000 + 315   000 + 277   000 + 269   900 + 230   900 + 645   000 8 = 2   635   700 8 = 329   462 . 5

    The new mean house price is $329 462.50.

    New median:

    Arrange the data in ascending order.

    $230 900, $269 900, $277 000, $283 000, $299 900, $315 000, $315 000, $645 000

    There are 8 data values. Since there is an even number of data values, the median is the average of the two middle values; i.e., the fourth and fifth data values.

    $230 900, $269 900, $277 000, $ 283   000 ,   $ 299   900 , $315 000, $315 000, $645 000

    Median = sum   of   the   two   middle   values 2 = 283   000 + 299   900 2 = 582   900 2 = 291   450

    The new median house price is $291 450.

    New mode:

    The mode house price is still $315 000.

  4. The mean increases by over $45 000 when the new listing is added. An outlier with a much higher value than the rest of the data increases the mean substantially.

  5. The median increases by over $8 000 when the new listing is added.

  6. The mode remains the same, as long as the outlier added does not contribute to a new mode.

  7. As the mean is easily affected by outliers, it should not be used. The average housing price should be the median, which indicates that half the house prices are above and half are below that middle value. Mode is not a useful average since it only communicates that more than one house is listed at a particular price.