tag:blogger.com,1999:blog-4903856565127937599.post-82276069189134731962007-12-29T13:22:00.001-05:002008-04-15T00:26:00.129-04:00Student's T-Distribution, Degrees of FreedomIn the <a href="http://numoraclerecipes.blogspot.com/2007/12/students-t-distribution-degrees-of.html">previous post</a>, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.<br /><br /><b>Computing confidence Interval for μ when σ is Unknown:</b> The solution then is to use the <em>sample</em> standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.<br /><br /><b>Student's T Distribution:</b><br />For a <em>normally distributed population</em>, the Student's T distribution is given by this standardized statistic: <b>t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations</b>, where S is the sample standard deviation, and <b>n</b> is the sample size. Key points:<ul><li>The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2, <b>σ² = df/(df - 2)</b>. </li><li>The z (normal) distribution deals with one unknown <b>μ</b> - estimated by the random variable X_bar, while the t distribution deals with two unknowns <b>μ</b> and <b>σ</b> - estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data.</li><li>As a consequence, the t distribution has wider confidence intervals than z</li><li>There is a t-distribution for each df=1,..n.</li><li>A good <a href="http://www.statsoft.com/textbook/sttable.html">comparison of the distributions is provided here</a>.</li><li>For a sample (n < 30) taken from normally distributed population, a <b>(1 - α) 100% confidence interval</b> for μ when σ is <u>unknown</u> is <b>X_bar ± t<sub>α/2</sub> s/√n</b>. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma.</li><li>But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the <b>(1 - α)100% confidence interval</b> for μ is <b>X_bar ± z<sub>α/2</sub> s/√n</b> (Note: Using the sample sd itself).</li></ul>We will pause to understand df in this context (based on Aczel's book). <a href="http://numoraclerecipes.blogspot.com/2007/12/sampling-distributions-introduction.html">We noted earlier</a> that the df helps as compensating factor - here is how. Assume a population of five numbers - 1, 2, 3, 4, 5. The (known) population mean is μ = (1+2+3+4+5)/5 = 7.5. Assume we are asked to sample 5 numbers and find the squared standard deviation (ssd) based on μ:<pre>x | x_bar | deviation | deviation_squared<br />3 | 3 | 0.0 | 0<br />2 | 3 | -1.0 | 1<br />4 | 3 | 1.0 | 1<br />1 | 3 | -2.0 | 4<br />2 | 3 | -1.0 | 1<br />Sum of Squared Deviation = 7</pre>Given the mean, the deviation computation for the <em>random</em> 5 samples effectively retain 5 degrees of freedom. Next, assume we <em>don't</em> know the population mean, and instead are asked to compute the deviation from <em>one</em> random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:<pre>x | x_bar | deviation | deviation_squared<br />3 | 2.4 | 0.6 | 0.36<br />2 | 2.4 | -0.4 | 0.16<br />4 | 2.4 | 1.6 | 2.56<br />1 | 2.4 | -1.4 | 1.96<br />2 | 2.4 | -0.4 | 0.16<br />Sum of Squared Deviation = 5.2</pre>The use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.<br />Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:<pre>x | x_bar | deviation | deviation_squared<br />3 | 3 | 0 | 0<br />2 | 3 | 0 | 0<br />4 | 4 | 0 | 0<br />1 | 1 | 0 | 0<br />2 | 2 | 0 | 0<br />Sum of Squared Deviation = 0</pre> which reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,<ul><li>deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n</li><li>deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)</li><li> deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.</li></ul>Confidence interval using T-distribution applies for the narrow case of n < 30, normal population; the Z distribution covers larger samples. The practical use this distribution appears more to be in comparing two populations using the <b>Student's T-Test</b> - which requires understanding the concepts of Hypothesis Testing. So I will defer the code for an equivalent confidence_interval() routine based on T-distribution for later.Ramkumar Krishnan (Ram)http://www.blogger.com/profile/16930865713058805668noreply@blogger.com