# Descriptive Statistics

Essay by pfoster282005University, Bachelor's March 2009

Descriptive Statistics PaperWhat is the best way to measure our data on the 2005 Major League Baseball Statistics? We have several ways to gauge the central tendency of our data. We can take all the data add it up and take the arithmetic mean or the average. We can find the most recurring number in the set and use it to find the mode. We can also take the data and order it from lowest to highest to find the number in the middle or the median as another measure of the central tendency.

2005 MLB Data; Team Salary (in millions) Blue and Team Wins RedIf using the mean to measure a central tendency of our data we extrapolate an average team salary of 73 million that produced 81 wins. The problem with using the mean is we have outliers that produce a higher average salary per team, wherein, examining most of the teams' salaries from 67 million to 100 million the outlier that will affect the central tendency is the New York Yankees.

An outlier is defined as a value less than Q1-1.5H or greater than Q1+1.5H, and an extreme outlier has a value less than Q1-3.0H or greater than Q1+3.0H. Their team salary is 208 million dollars which is well above the 73 million average team salary and lying outside the interquartile range and far enough from the 75th percentile to qualify as an extreme outlier.

The histogram of the salaries variables with all MLB teams included is skewed right and again because of the influence of the New York Yankees team salary.

To measure the wins the central tendency of the mean would be a better representation. The data does not have any outliers that would skew the average to the high side or low side.

Another viable option we have is to use the mode to measure the central tendency. When using the mode we would look for the most recurring number in the data set and use as our central tendency. The problem with using the mode is that there are no recurring numbers in the team salaries. Some salaries are close but not exact so mode would not work with measuring the team salaries but would help with the team wins. When the mode is used to measure the team wins we extrapolate wins of 95, 83, and 67. The higher number 95 and lower number of 67 wins are misleading data because they do not represent the average number of wins. Sixty seven wins is one of the lowest rankings for the 2005 MLB season where 95 wins is close to the top. Because the mode gives us three different numbers it would not be a good choice of measurement either.

Descriptive statisticssalariescount30mean73,063,563.267sample variance1,171,964,722,279,960.000sample standard deviation34,233,970.297minimum29679067maximum208306817range178627750population variance1,132,899,231,537,290.000population standard deviation33,658,568.471standard error of the mean6,250,239.255skewness2.174kurtosis7.571coefficient of variation (CV)46.86%1st quartile50,292,565.500median66,191,416.5003rd quartile87,573,983.750interquartile range37,281,418.250mode#N/Alow extremes0low outliers0high outliers0high extremes1suggested interval width10000000The most productive way to measure the central tendency of our date is to use the median. When using the median we take our data and order them from lowest to highest and then take the 2 numbers that are in the middle, add them, then divide by two. This is the case since we have an even numbered data set, we take the two middle numbers find the mean and use that number as our central tendency. The measures of central tendency are the methods that we use to summarize data by trying to find one number that best represents all the numbers in a sample or a population. When using the median to measure our data set, the average team salary was approximately 66 million and the number of wins per team was 81. Using the median was the best choice when measuring this data set because the outlier did not affect the average. When we find the median we not only rely on team salary and win data we also rely on how many groups are in the data set. The groups being; the 30 teams we used to compile the data in our multivariate set. A sample size is considered normal if the sample size is more than 30.

Hypothesis Test: Independent Groups (t-test, pooled variance)salariesWins73,063,563.267 81.000 mean34,233,970.297 10.834 std. dev.

3030n58 df73,063,482.2667 difference (salaries - Wins)585,982,361,140,036.0000 pooled variance24,207,072.5438 pooled std. dev.

6,250,239.2548 standard error of difference0hypothesized difference11.69 t6.97E-17 p-value (two-tailed)To measure the dispersion of our data we can choose two ways. We can use the range which takes the differences of the maximum and the minimum values to give us a measurement or can use standard deviation. Standard deviation is the better way to measure dispersion.

When using the range we take the difference between the maximum value and the minimum value. This type of measurement will only give us information on the maximum and minimum value but does not provide any information on the vales in between. If we were dealing with a small data say five teams then we could use the range to measure the dispersion. Since our data set contains 30 teams we use another way to measure the dispersion.

The best way to measure our data set is to use standard deviation. Standard deviation is based on the means of the data values. First we find the mean and subtract it from each data value and square the result. By using the standard deviation we information every data value in the set. For team salaries our measurement of dispersion is 33.7 million dollars. For team wins the measurement of dispersion is about 10 wins.

We plotted a scatter plot to compare the two data sets; Wins and Salaries and the slope of the plotted data with the exception of the outlier indicates a correlation between salaries and winsthat we were not presented within the histograms.

ConclusionAfter researching our problem statement of baseball team salaries affecting team wins, we have come to the conclusion that money cannot buy a team wins. On paper when you look at team salaries and the types of high-priced players that are attained, one would assume that these types of players would produce wins. Not so the case, in fact, a high salary gives you a good chance at wins, but based on our research and data it is not true. A team can have the highest salary in the major leagues but still lack in the win department, e.g. the Mets and Nationals had 83 and 81 wins respectively in 2005, but their salaries were very different being 101.3 and 48.6 million. This leads us to a conclusion that player personalities and team chemistry is what adds up to a team with the highest wins. A team needs the right types of players who all have the same goals and aspirations to be the best team in the major leagues, along with playing as one team and not individuals who are looking for that high salary. With this information, it is not necessary to do further research as the data has proved there is no relationship or link between team salaries and the production of wins.

Reference:Sekaran, U. (2003). Research methods for business: a skill building approach (4th ed.). Hoboken, NJ: Wiley.

D. Doane and L. Seward , (2007). Applied statistics in business and economics. Burr Ridge, IL: McGraw-Hill.

Orris, J. PhD. (2007). Basic Statistics Using Excel and Mega Stat. [University of Phoenix Custom Edition e-text]. New York, NY: McGraw-Hill/Irwine. Retrieved March 15, 2009, from University of Phoenix, rEsource, RES341--Interdisciplinary Capstone Course Web site.