tag:blogger.com,1999:blog-152504732008-01-31T13:58:03.629ZApplying Regression BlogJnoreply@blogger.comBlogger21125tag:blogger.com,1999:blog-15250473.post-19381767852953312332008-01-20T04:44:00.001Z2008-01-20T05:12:05.555ZThe Ghost MapIf you're anything like me, then when you learned about Jon Snow and the Broad Street Pump, you learned something like this:<br /><ol><li>There was a cholera epidemic in an area of Soho in London in 1854.</li><li>Jon Snow was a doctor there, who drew a map of where the deaths were, and realized that water coming from the Broad Street pump was the source. </li><li>Upon realizing that, he removed the handle from the pump, to stop people using it.</li><li>Snow showed that cholera was an infectious disease.<br /></li><li>He thereby also founded the discipline of epidemiology.</li></ol>Steven Johnson has written a book, called <a href="http://www.amazon.co.uk/Ghost-Map-Street-Epidemic-Networks/dp/0141029366/jeremymiles">The Ghost Map</a>, named after the map that was drawn by Snow, all about the incident. (Johnson also wrote <a href="http://www.amazon.co.uk/Everything-Bad-Good-You-Popular/dp/0141018682/jeremymiles">Everything Bad is Good for You</a>). <br /><br />The first thing you learn from the book is that the popular account of the Broad Street pump incident is wrong in many, many ways. The events happened in <span style="font-style: italic;">almost </span>the reverse order. Epidemiology existed before the incident (not long before, and Snow was a founder). Some people thought that cholera was infectious before this incident. Most didn't. Most didn't think it was infectious afterwards either. Snow removed the pump handle - but before drew the map. He didn't even draw the best map - someone else drew a better improved version later on.<br /><br />The book also goes into detail about what was known about the minor details of people's lives, where they worked (this was important, because this affected what they drank) where their relatives lived, but sometimes details are lost - no one knows the name of the first victim. It examines the epidemic on a broader, political scale - why had things got to the state where it was possible for a cholera epidemic to occur. It examines how the situation arose that cholera could spread - the city got too big, the carts that had to carry the, errrmmm... , waste had too far to go, and removing the 'night soil' became too expensive; people filled their cellars with shit instead. It looks at what happened as a result - sewage systems removed sewage from people's houses, and dumped it in the river.<br /><br />So why am I writing about it here? It shows how three things that are required for statistical analysis to have an effect come together. First, there was a theory - the theory could be tested by the collection of data. Snow (and others) spend a long time collecting that data, finding who had died, where they had lived, what they did, but the theory was needed to guide the data collection. Second, the data had to be presented clearly - in this case on the map, which was a form of bar chart, with a bar on each house. (Nowadays we'd call that a <a href="http://en.wikipedia.org/wiki/GIS">Geographical Information System (GIS)</a>.) Third, the consumers of research had to be persuaded - in this case, the government departments who needed to build sewers to improve public health.<br /><br />Finally, the book is extraordinarily well written. It manages to keep the text flowing without becoming turgid, and without deviating from the facts, here's a brief example: "Word of the outbreak had traveled through the wider city and beyond. The chemist's son who had enjoyed his pudding days before on Wardour Street died on that Sunday at his home in Willesden."<br /><br />There's more information in the <a href="http://en.wikipedia.org/wiki/John_Snow_%28physician%29">John Snow</a> entry on Wikipedia.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-67308644019813072692007-11-17T04:56:00.000Z2007-11-20T01:21:09.203ZEffect SizesA Davies sent a message to the <a href="http://www.jiscmail.ac.uk/lists/psych-methods.html">psych-methods</a> list:<br /><br /><blockquote>I have carried out three hierarchical regression analyses (linear forattitudes and intention, and logistic for behaviour). Within these regressions I have controlled for past behaviour and baseline psychological variables, such that I want to know what the effect of the intervention isonce these variables are controlled for in the early steps.<br /><br />I am familiar with using means and standard deviations to calculate cohen’s d, but am unsure how to obtain these values from a regression analysis? Are there other approaches that can be used from data analysed using hierarchical regression? I understand that the R2 value can be used but as I understand it that is an indicator of the effect of adding the variable to the regression. I've also read about using the t-test statistic but when I have used it it seems to indicate a large effect of the intervention (d=.45) when actually the intervention is only marginally significant (p<.07) as a predictor of change. Does anyone have any suggestions?</blockquote><br />To answer this question, we should think about what an effect size is, and what it's for. Sometimes the units of measurement are meaningful. If eating a tomato every day is associate with living (on average) two years longer, then there's no need to turn this into an effect size. You understand what two years means. It makes sense. You don't need an effect size, and you can easily compare it to other studies that examined the influence of eating a carrot every day.<br /><br />In psychology, and other social sciences, we often have measures that are not meaningful. If one study has looked at the efficacy of Prozac for depression, and found that Prozac was associated with a difference of 8 points on the <a href="http://www.blogger.com/counsellingresource.com/quizzes/cesd/index.html">CESD</a>; and a second study has looked at the efficacy of CBT and found a difference of 4 points on the <a href="http://www.depression-primarycare.org/clinicians/toolkits/materials/forms/phq9/">PHQ</a>, it's hard to compare those two effects.<br /><br />The solution is an effect size, something like Cohen's <span style="font-style: italic;">d</span>. Cohen's <span style="font-style: italic;">d</span> is very simple. It's the difference between the two groups, divided by the standard deviation (that's a pooled standard deviation, so it's a tiny bit fiddlier, but not much).<br /><br />If the SD of the CESD was 16, then 8 is half of that, so the effect of Prozac was d = 0.5. If the SD of the PHQ was 8, then 4 is half of that, so the effect of Prozac was d = 0.5, and the two effects were the same.<br /><br />And as long as we have a dichotomous predictor, we are happy.<br /><br />However, when we move to regression and multiple regression, it gets trickier, and we need something different. There are several things that we can use.<br /><br />The first is the standardized effect, what SPSS calls beta. If you've only got one predictor, it's the correlation. Correlations are nice. We understand correlations. We might not like the standardized effect for a couple of reasons. One of them is that it destroys the units. If we have units we are interested in, then the standardized effect hides them. If the standardized effect of the relationship between tomatoes eaten per day and longevity is moderate (say 0.3) I might go and eat a lot of tomatoes. However, you might then tell me that eating one more tomato per day increases longevity by 12 seconds. That's pretty poor. If I only knew the standardized effect, that would be hidden from me. If your predictor is dichotomous, then the standardized effect is very silly.<br /><br />The second choice is a partially standardized effect. Here, you standardize only the outcome variable, and keep the predictor unstandardized. This effect is the difference, in standard deviation units, associated with a 1 unit change in the predictor. If your predictor is dichotomous, that's Cohen's <span style="font-style: italic;">d</span>. If it's not dichotomous, it's analogous to Cohen's d.<br /><br />The third choice if you're using hierarchical regression is the change in R2. You ask how much additional variance is explained by the predictor (or predictors) that you added to the model at each step.<br /><br /><br />In the case of logistic regression, things get harder (as a general rule, everything is harder with logistic regression). Every sortware program gives you the estimate of the logistic regression as output, but not every package gives the Odds Ratio (sometimes called Exp(B)) unless you ask for it. (In Stata, for example, you need to use the <span style="font-family:courier new;">, or </span>option; in R, SPSS and SAS it's automatic).<br /><br />You should always present the odds ratio, because it makes more sense than the estimate. But it doesn't make a lot of sense.<br /><br />We can't standardize our outcome, because it's dichotomous. We could standardize our predictor, if that made sense, and present the partially standardized OR.<br /><br />However, odds ratios are funny. They're funny because people don't know how to interpret them, so you should give them help. And you give them help by converting the parameter estimates into probabilities, then choose sensible values for the covariates, and calculate the probabilities associated with them.<br /><br />I'm going to use the auto data in <a href="http://www.stata.com/">Stata</a> to demonstrate this (I'll give the Stata code at the end). I'm going to regress <span style="font-family:courier new;">foreign</span> (a dichotomous variable indicating whether a car was made in the USA or not), on <span style="font-family:courier new;">price</span> (in 1000$) and <span style="font-family:courier new;">mpg</span>.<br /><br />Here's the Stata output:<br /><br /><span style="font-family:courier new;">---------------------------</span><br /><span style="font-family:courier new;"> foreign | Coef. </span><br /><span style="font-family:courier new;">-------------+-------------</span><br /><span style="font-family:courier new;"> price | .2660188 </span><br /><span style="font-family:courier new;"> mpg | .2338353 </span><br /><span style="font-family:courier new;"> _cons | -7.648111 </span><br /><span style="font-family:courier new;">---------------------------</span><br /><br />Of course, we need to get the odds ratios instead of the linear estimates.<br /><br /><span style="font-family:courier new;">----------------------------</span><br /><span style="font-family:courier new;"> foreign | Odds Ratio </span><br /><span style="font-family:courier new;">-------------+--------------</span><br /><span style="font-family:courier new;"> price | 1.30476 </span><br /><span style="font-family:courier new;"> mpg | 1.263436 </span><br /><span style="font-family:courier new;">----------------------------</span><br /><br />The two variables are in real units, so there's no need to standardize them. A price increase of $1000 is associated with the odds of a car being foreign increasing by 1.3 times (these are odds ratios, so they are multiplicativel; holding mpg constant) and one more mpg is associated with odds of being foreign 1.26 times higher. But what does that mean, in terms of the probability (because that's what we're interested in) of a car being foreign.<br /><br />Let's compare two cars. One that costs $4000, and one that costs $5000 (these aren't new data). We can calculate the probability that each of those cars is foreign, and that will give us an effect size.<br /><br />We just need to plug our numbers into the regression equation. But hold on, we need numbers for mpg. Let's pick a low value, say 14. (I'm not going to go through all the calculations, because it will take too long. If you're not familiar with how this is done, you can either believe me, or look it up.)<br /><br />Our $4k which does 14 mpg car has a 0.035 (3.5%) probability of being foreign.<br />Our $5k which does 14 mpg car has a 0.045 (4.5%) probability of being foreign.<br /><br />So adding $1k to the price of a car increases the probability it's foreign by 1%.<br /><br />But what if we chose a different number of mpg? Let's use 25.<br />Our $4k which does 25 mpg car has a 0.32 (32%) probability of being foreign.<br />Our $5k which does 25 mpg car has a 0.38 (38%) probability of being foreign.<br /><br />Which means that at this range, adding $1k has increased the probability by 6%. That's rather a large difference.<br /><br />You need to calculate those probabilities, and you need to calculate them at appropriate values of the other covariates, in order to ensure that the reader can interpret your regression. However, it's a difficult issue, and you'll want to read more. Two good references are:<br /><br /><a href="http://www.amazon.com/Regression-Categorical-Dependent-Quantitative-Techniques/dp/0803973748/jeremymilesspage">Regression Models for Categorical and Limited Dependent Variables. by J Scott Long</a>. (There's a 2nd edition of this book out, which I haven't seen, but it's published by Stata Press, and therefore might be Stata specific).<br /><br /><a href="http://www.amazon.com/Analysis-Regression-Multilevel-Hierarchical-Models/dp/052168689X/jeremymilesspage">Data Analysis Using Regression and Multilevel/Hierarchical Models , by Andrew Gelman and Jennifer Hill</a>.<br /><br />(Tip of the Hat: Some of my thinking on this has been influenced by <a href="http://www.ntu.ac.uk/research/school_research/social/staff/51655gp.html">Thom Baguley</a>, and thanks to Greg Meyer for a correction).<br /><br /><br /><br />Stata Code<br /><span style="font-family:courier new;">sysuse auto</span><br /><span style="font-family:courier new;">replace price = price / 1000</span><br /><span style="font-family:courier new;">logit foreign price mpg</span><br /><span style="font-family:courier new;">logit foreign price mpg, or</span>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-36858954789125478832007-07-05T17:24:00.000+01:002007-07-05T17:32:58.989+01:00Start them young<a href="http://www.shef.ac.uk/politics/staff/seancarey.html">Sean Carey</a> teaches a course called<a href="http://www.essex.ac.uk/methods/courses/descriptions/1d07.shtm"> Introduction to Regression</a>, at the <a href="http://www.essex.ac.uk/methods">Essex Summer School in Social Science Data Analysis</a>. Here's a picture of his offspring that he uses in his initial presentation. I liked it, so I <strike>stole</strike> <strike>borrowed</strike> <strike>used</strike> displayed it here. (Notice that the error term is missing, or the hat on the <span style="font-style: italic;">y</span>)<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.jeremymiles.co.uk/regressionbook/uploaded_images/robin-763340.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://www.jeremymiles.co.uk/regressionbook/uploaded_images/robin-762701.jpg" alt="" border="0" /></a>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-42244752829501195432007-06-01T05:47:00.000+01:002007-06-01T06:34:31.104+01:00Historical RegressionThere's been a story on the news recently that a report suggests that childhood obesity is increasing because more mothers are working, and not staying at home with their children. It's been reported in a few places, one of them is <a href="http://cbs4.com/topstories/local_story_138154250.html">here</a>. It says <blockquote>"We saw that start to happen. We could track childhood obesity. There's a direct correlation," said Terry Mason, Chicago's public health commissioner."</blockquote>This might well be true, or it might not (that's the sort of thing that other blogs can worry about), they mentioned correlation, so we're going to talk about it on this blog.<br /><br />So, a direct correlation, eh? And what's the sample size here? Let's see, well, we've got before, and we've got after. So we'll give them N=2. That's not a lot, is it? In fact, any two measures (which change) measured before and after some time have to have a correlation of r=+1, or r=-1. They don't even give a specific date to it - it's 'the 1980s'. As far as that evidence goes, you might as well say that it was Ronald Reagan getting elected, the release of Windows version 1.0 (anyone remember that? It had a clock that was pretty cool, but nothing else), or the fact that W Germany made it to the world cup finals 3 times (I'm including 1990, 'cos we're allowed to be a bit fuzzy here.)<br /><br />In other words, if there WASN'T a direct correlation, we should be surprised.<br /><br />This is an example of a more general problem (well, two more general problems).<br /><br />The first is inspecting the data, and then making up a theory about the causal relationships that exist. But you've cheated, you looked at the data. Another example of this sort of thing is the clustering of leukaemia cases around nuclear power stations. When people first theorised this, they first looked at the data, and they said let's choose leukaemia (not any other sort of cancer) amongst children (we think of leukaemia as a children's disease, because that's what makes the news, but it's not necessarily), and let's make the children aged, <span class="blsp-spelling-error" id="SPELLING_ERROR_0">errmm</span>, less than 3, and make it, <span class="blsp-spelling-error" id="SPELLING_ERROR_1">errrmmm</span>, 5 miles from a nuclear power station. No! 10 miles! Ah look, we've got a cluster!" Obesity is a problem now, but they identify the root of this in the 1980s - 20 years ago, so any event between now and then would appear to do the job.<br /><br />The second (related) problem is that of the sample size. Any time you say "Y happened, because X happened before" where X and Y are one off national events, you have an N of 2, and you can't say anything. <span class="blsp-spelling-error" id="SPELLING_ERROR_2">Donohue</span> and Levitt famously argued in 1999 that legalising abortion had decreased murder rates 20 (or so) years later (it's famous because they discuss it in <a href="http://www.freakonomics.com/"><span class="blsp-spelling-error" id="SPELLING_ERROR_3">Freakonomics</span></a>). But they've got an N of 2. Murder rates are unstable, they weren't going to stay the same, so there was always going to be a correlation. This is discussed in a paper by <a href="http://crab.rutgers.edu/%7Egoertzel/">Ted <span class="blsp-spelling-error" id="SPELLING_ERROR_4">Goertzel</span></a> called <a href="http://crab.rutgers.edu/%7Egoertzel/mythsofmurder.htm">Myths of Murder and Multiple Regression</a> (and if I ever write a paper with a title that good, I'll die happy). If you want to make this sort of statement, you need to have a bigger sample - you need to have abortion criminalised again, a few years later, because the murder rate should then rise, then ban it, and it should drop again. Alternatively, you need to have abortion legalised in different countries at different times, and then see what happens to the murder rate in those countries, the same period afterwards.<br /><br />Social scientists do have a habit of being able to make up a plausible sounding theoretical explanation of any result that they happened to find. What if Donohue and Levitt had found that murder rates went up, as a result of abortion being legalized? Do you think they could have made up a plausible sounding theory to account for that too? (Go on, you try. I bet you can.) Well, it happens that other researchers have found that, by controlling for a couple of other variables, legalizing abortion really did make the murder rate go up.<br /><br />It's unlikely that this will ever replicate, at least enough times to get a sensible sample, and it's unlikely that there will be changes in the workforce that mean that most mothers no longer work, so we'll never know who (if anyone) is correct. And a hallmark of science is that it has to be able to be wrong - for a theory or a finding or a fact to be considered to be scientific, we must be able to state what evidence would make us say "Oh, that was wrong then". For statistical analysis of this sort of result, there isn't anything that would enable us to reject the theory, and if we can't reject it, we might as well rely on psychoanalysis. ("You hate your father? That's because of repressed sexual urges. You love your father? That's because of unrepressed sexual urges.")<br /><br /><br /><br /><div align="right"><p><span style="font-family:Arial;"><i></i></span></p> </div>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-54570650226533124582007-05-04T22:55:00.000+01:002007-05-04T23:00:58.570+01:00Stepwise regressionThere has been a little exchange on the SAS-L list about stepwise regression, and whether you should use it.<br /><br />Jerry Davis (defending stepwise regression, in applied settings) wrote:<br /><span style="font-style: italic;" class="q">Stepwise selection methods are just statistical tools and like any tool, may be used inappropriately. I doubt if Walmart gives a rip about multicollinearity, they just want better predictions. If stepwise does the job, so be it.</span><br /><br />Peter Flom wrote:<br /><span style="font-style: italic;">The analogy to a tool is common.</span><br /><span style="font-style: italic;">Stepwise methods ARE like a tool. They are like a hammer whose head flies off at frequent but irregular intervals.</span>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-56486345083060192602007-05-03T15:01:00.000+01:002007-05-03T15:10:36.401+01:00MediationThe <a href="http://content.apa.org/journals/met/12/1">latest issue of Psychological Methods </a>(Volume 12, issue 1) has two (count them! Two!) articles on mediation and moderation.<br /><br />The first, by Jeffrey Edwards and Lisa Schurer, "<a href="http://content.apa.org/journals/met/12/1/1" class="text" style="color: rgb(0, 51, 153);">Methods for Integrating Moderation and Mediation: A General Analytical Framework Using Moderated Path Analysis</a>" (link is to the abstract only) is about how to most appropriately combine analysis on moderation and mediation. The article looks a bit tricky - there are lots of equations, but they are not nasty hard equations, and working through them gets what you want.<br /><br />The second is by Scott Maxwell and David Cole, <a href="http://content.apa.org/journals/met/12/1/23">Bias in Cross-Sectional Analyses of Longitudinal Mediation</a>. is possibly more important, and is about whether mediation analyses are biased, if everything is measured cross-sectionally (i.e. at the same time). The short answer is yes.<br /><br />This is problematic for many researchers, who measure three things, and then carry out a mediation analysis to show that <span style="font-style: italic;">X</span> causes <span style="font-style: italic;">M</span> and <span style="font-style: italic;">M</span> causes <span style="font-style: italic;">Y</span>. Almost every effect that we are interested in turns out to be biased if you don't measure the effects longitudinally.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-53599368579271412522007-02-01T03:51:00.000Z2007-02-01T03:58:46.459ZGuide to R for SPSS and SAS UsersBob <span onclick="BLOG_clickHandler(this)" class="blsp-spelling-error" id="SPELLING_ERROR_0">Muenchen</span>, who works in the <a href="http://oit.utk.edu/scc/">Statistical Consulting Centre at the University of Tennessee</a>, has put together a <a href="http://oit.utk.edu/scc/RforSAS&SPSSusers.doc">guide to using R, for <span onclick="BLOG_clickHandler(this)" class="blsp-spelling-error" id="SPELLING_ERROR_1">SPSS</span> and <span onclick="BLOG_clickHandler(this)" class="blsp-spelling-error" id="SPELLING_ERROR_2">SAS</span> Users</a>. I've gone on about <a href="http://www.r-project.org">R</a> before, on this blog for <span onclick="BLOG_clickHandler(this)" class="blsp-spelling-error" id="SPELLING_ERROR_3">SEM</span>, <a href="http://www.jeremymiles.co.uk/regressionbook/2007/01/sem-on-cheap.html">here </a>and <a href="http://www.jeremymiles.co.uk/regressionbook/2006/07/sem-in-r-for-free.html">here</a>; and I've written about <a href="http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/">how to do regression in R</a>.<br /><br />The guide is 91 pages long, and focuses on data manipulation more than analysis. Getting info on data analysis in R is pretty straightforward, getting information on manipulation is harder, so this is a welcome addition.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-70494644296472182252007-01-26T04:06:00.000Z2007-01-26T04:16:48.432ZCanonical Correlation<div style="text-align: justify;">Catherine Day sent an email to the psych-postgrads list, which said:<br /><blockquote>Sorry to bombard you with yet another statistical problem but I'm desperate! I have 2 sets of latent variables (1 set measuring taste preference and another set measuring personality dimensions).<br />I'm looking at the relationship between taste preference and personality and understand that canonical correlation is the appropriate analysis. The problem is nobody at Sheffield Hallam has performed one before and we do not have the add-on package for SPSS that does this.<br />Does anyone know how to do this type of analysis or can point me in the<br />direction of some training?</blockquote>I replied:<br /><br />Canonical correlation is one possibility. It's kind of atheoretical though - that is, if you know what the latent variables are, you probably shouldn't use it. You might want to use structural equation modelling or partial least squares analysis instead (although both of those are pretty fiddly, and you shouldn't use them unless you really, really have to).<br /><br />Just to check: are you modelling at the item level? If you are, I'd consider summing to the scales (maybe factor analysing first) and then doing regression.<br /><div style="text-align: justify;"><br />There's <a href="http://www.amazon.com/Canonical-Correlation-Analysis-Interpretation-Quantitative/dp/0803923929/">a book on canonical correlation</a>, in the Sage Little Green Books s<span style="text-decoration: underline;"></span>eries. I think that Tabachnick and Fidell's book 'Using multivariate statistics' cover it as well.<br /><br /><!-- D(["mb","with a locally installed version).<br /><br />To use it, you type syntax (into the syntax editor) like this:<br /><br />INCLUDE \'c:\\Program Files\\SPSS11\\Canonical correlation.sps\'.<br />CANCORR set1 \u003d y1, y2, y3/<br /> set2 \u003d x1, x2/.<br /><br />Where x1 and x2 are predictors, and y1 and y2 are outcomes (just keep<br />going until you\'ve got them all). (I think I\'ve got my x and y the<br />right way around - I don\'t have SPSS here, so I\'m guessing a bit.)<br /><br />Jeremy<br /></div>",1] ); //-->Finally, you have got the add on, you just don't realise it. It's an SPSS syntax file, called 'canonical correlation.sps', and you'll find it in the SPSS folder (c:\program files\SPSS, if you're using windows with a locally installed version).<br /><br />To use it, you type syntax (into the syntax editor) like this:<br /><br /><span style="font-family:courier new;">INCLUDE 'c:\Program Files\SPSS11\Canonical correlation.sps'.</span><br /><span style="font-family:courier new;">CANCORR set1 = y1, y2, y3/</span><br /><span style="font-family:courier new;"> set2 = x1, x2/.</span><br /><br />Where x1 and x2 are predictors, and y1 and y2 are outcomes (just keep going until you've got them all). (I think I've got my x and y the right way around - I don't have SPSS here, so I'm guessing a bit.)<br /></div><br /></div>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-24792582129891307532007-01-07T01:53:00.000Z2007-01-10T04:02:10.264ZSEM on the cheapHelen M sent a message to the <a href="http://www.jiscmail.ac.uk/lists/PSYCH-POSTGRADS.html">psych-postgrads</a> list, because she didn't have the <a href="http://www.spss.com/amos">AMOS</a> package for SEM. And it ain't cheap.<br /><br />I sent the following reply:<br /><br />First, there is a program called Mx, which is free, and available from <a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.vcu.edu/mx/" target="_blank">http://www.vcu.edu/mx/</a> - it was designed for twin studies, but can do any kind of structural equation modelling. It has a path diagram input, but it needs tweaking in syntax, which is a little fiddly - until you get used to it.<br /><br />Second, there is a program call R, which is also free - it's available from <a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.r-project.org/" target="_blank">www.r-project.org</a>. R can read an SPSS file, and can do everything you'll ever want to do - it can also read an SPSS file. R comes in packages, and one of the packages is called sem (note - lower case), it's written by John Fox, and can do, SEM (upper case).<br />It's syntax based, but it's pretty easy - you write <span style="font-family: courier new;">y <- x</span> to regress <span style="font-style: italic;">y</span> on <span style="font-style: italic;">x</span>, and <span style="font-family: courier new;">y <-> x</span> to correlate <span style="font-style: italic;">y</span> with <span style="font-style: italic;">x</span>. R will download the sem package itself, if you ask it nicely, but the sem home page is here: <a onclick="return top.js.OpenExtLink(window,event,this)" href="http://socserv.mcmaster.ca/jfox/Misc/sem/index.html" target="_blank">http://socserv.mcmaster.ca<wbr>/jfox/Misc/sem/index.html</a> . Also, if you use R, you make life easier for yourself with a program called JGR,<br />pronounced, JaGuaR - which you can find here: <a onclick="return top.js.OpenExtLink(window,event,this)" href="http://rosuda.org/JGR/" target="_blank">http://rosuda.org/JGR/</a><br />(JGR stands for Java GUI for R (hmmm... I'll explain an acronym with<br />three more abbreviations).<br /><br />Third, precisely for people in your situation, you can 'rent' a copy<br />of LISREL, which will do SEM for you - LISREL has something of a bad<br />reputation, because it was the first package out there, and so people<br />remember that, but it's much easier nowadays. It costs $65 feeble<br />American dollars, which works out at about £33 nowadays - or for a<br />year, it's $100. <a onclick="return top.js.OpenExtLink(window,event,this)" href="http://estore.e-academy.com/index.cfm?loc=estore/soft_browse/soft_display_product&ID_Product=525" target="_blank">http://estore.e-academy.com<wbr>/index.cfm?loc=estore/soft<wbr>_browse/soft_display_product<wbr>&ID_Product=525</a><br /><br />Fourth, if you've got access to Stata, there's a free Stata add on<br />called gllamm (generalized linear latent and multilevel models - or<br />something like that), it's free and it's available from<br /><a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.gllamm.org/" target="_blank">www.gllamm.org</a>. It can do SEM, but it doesn't seem that<br />straightforward to me (which means I tried for 5 minutes and gave up).<br /><br />Fifth, if you're doing something very straightforward, and you're not<br />bad at Excel, you can do SEM using Excel - although if you want to do<br />anything more complicated than a one factor CFA, you're in trouble.<br />And even if you only want to do that, you're in a bit of trouble.<br />I'm only saying it 'cos I wrote a paper that tells you how to do it.<br />And no one ever does it. Not even me.<br /><br />Also, you're going to want some help when you make trivial mistakes<br />which will drive you insane to spot (comma in the wrong place - things<br />like that). You can email this list, which might help, the<br />psych-methods list (on jiscmail as well), semnet - although you will<br />immediately get bogged down with arguments about arcane details of<br />your models. If you use sem, you can ask the r-users list (you'll<br />find stuff on the R web site about that - you get an awful lot of<br />email on it though).<br /><br />I'm assuming you've got a book, so I won't go into that now.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-68290944694298489202006-12-20T06:06:00.000Z2006-12-20T06:08:54.740ZMediation in cluster trialsShocking I know, but I've found out that some people who read this blog <span style="font-style: italic;">don't</span> regularly read <a href="http://www.leaonline.com/loi/mbr">Multivariate Behavioral Research</a>. If you're one of those people you won't know that the most recent issue (vol 41, number 3) has a paper about mediation effects in cluster randomized trials. (A cluster randomized trial is one where people are randomized in clusters, such as classes, or schools, or households). <br /><br />Here's <a href="http://www.leaonline.com/doi/abs/10.1207/s15327906mbr4103_5">the abstract</a>.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1153906286329540012006-07-26T10:19:00.000+01:002006-07-26T10:31:26.340+01:00SEM in R (for free)One of the problems with trying to learn about structural equation modelling (SEM) is that one needs to have access to a relatively expensive program (<a href="http://www.statmodel.com">Mplus</a>, <a href="http://spss.com/amos/">AMOS</a>, <a href="http://www.ssicentral.com">Lisrel</a>, <a href="http://www.mvsoft.com">EQS</a>), or one needs to use a free program, such as <a href="http://www.vcu.edu/mx/">Mx</a>. <br /><br />Mx is a great program - it can do all sorts of amazingly clever things, but even the authors would admit that it isn't the world's easiest program to use - in fact, it's possibly got the steepest learning curve of any SEM program. <br /><br />A solution to this is the package <a style="font-family: courier new;" href="http://socserv.mcmaster.ca/jfox/Misc/sem/index.html">sem</a>, written by <a href="http://socserv.mcmaster.ca/jfox/">John Fox</a>, using <a href="http://www.r-project.org">R</a>. R is a free and open source program which can read an SPSS data file, and <span style="font-family: courier new;">sem </span>is a package (add-on) to it. Learning sem (the package) has always been a bit tricky, because there is a bit of a lack of material to help - the help files (like pretty much all R help files) tell you how everything works, but don't really tell you how to do stuff. That's all been solved by a paper, authored by Fox, in the most recent issue of the journal <a href="http://www.erlbaum.com/ME2/dirmod.asp?sid=28807ECF50FE49F0837125BE640E681F&nm=&type=eCommerce&mod=CommerceJournals&mid=B7D79E2F39304DB3A6A67FAE5C6F9AF7&tier=3&id=E94283735FB44681B615C810E6CDD654&itemid=1070-5511">Structural Equation Modeling</a> (a right riveting read, I'll tell you) (Vol 13, Number 3, pages 465-486). Anyway, it's a nice introduction, and you can read it <a href="http://socserv.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf">here</a>.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1149679154845392532006-06-07T12:16:00.000+01:002006-06-07T12:19:14.856+01:00More on multilevel in SPSS<a href="http://www.sussex.ac.uk/Users/danw/">Dan Wright</a> sent an email to the <a href="http://www.jiscmail.ac.uk/lists/psych-methods.html">psych-methods</a> list, with some websites that he gives to students interested in multilevel models in SPSS. They are here:<br /><br /> <span style="text-decoration: underline;"><a href="http://www.spss.ch/upload/1107355943_LinearMixedEffectsModelling.pdf">http://www.spss.ch/upload/1107355943_LinearMixedEffectsModelling.pdf</a><br /><a href="http://www.mlwin.com/softrev/reviewspss.pdf">http://www.mlwin.com/softrev/reviewspss.pdf</a><br /><a href="http://epm.sagepub.com/cgi/reprint/65/5/717">http://epm.sagepub.com/cgi/reprint/65/5/717</a><br /><a href="http://www.sussex.ac.uk/Users/danw/masters/advanced%20stats/multilevel%20exercise.htm">http://www.sussex.ac.uk/Users/danw/masters/advanced%20stats/multilevel%20exercise.htm</a><br /><a href="http://www.ats.ucla.edu/stat/examples/alda/">http://www.ats.ucla.edu/stat/examples/alda/</a></span>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1144926586164339362006-04-13T12:03:00.000+01:002006-04-13T16:21:57.113+01:00Huber-White estimates in SPSSStatistical tests usually assume that the measurements are independently and identically distributed (i.i.d.). This means that if you know the residual from one person, you should not be able to make any prediction about the residual from any other person. One occasion where this is violated is in cluster randomised trials. In a cluster randomised trial, we don't randomise individually, instead we randomise in groups, or clusters. This happens, for example, when we randomise a hospital to a group, but measure the effects on patients, or if we randomise a teacher to a group, and measure the effect on children.<br /><br />Cluster randomised trials <a href="http://www-users.york.ac.uk/%7Emb55/talks/clusml.htm">are becoming more common in medical, </a>and other research, and there is increasing recognition that we need to analyse them in appropriate ways. This can be done in a number of ways. One approach that is commonly used is the Huber-White sandwich estimator, implemented in <a href="http://www.stata.com">Stata </a>as 'robust estimates'. It's also possible to use multilevel modelling in SPSS (although it's a bit fiddly - you really need to use syntax) or generalised estimating equations, which will be available in SPSS 15.<br /><br />However, it's possible to get the same results using the SPSS complex samples procedure, although it's a bit fiddly. I'll demonstrate this using some data that were the basis of a paper by <a href="http://bmj.bmjjournals.com/cgi/content/full/329/7469/774">Richards, et al, from the BMJ</a>.<br /><br />When people telephone their doctor's surgery out of hours, they usually get through to a member of staff at the surgery. In this study, some weeks this remained the case, some weeks they were put through to <a href="http://www.nhsdirect.nhs.uk">NHS Direct</a> . One of the outcomes was how much it then cost to treat the person. The assumption of i.i.d. is violated - a person phoning in the same week is likely to be more similar in cost to a person phoning in another week. (For example, if there is a flu bug going around, a lot of people may phone the doctor, to be told that they need to go to bed and take paracetamol, which is cheap. If there is a very high pollen count in a week, people may be hospitalised with asthma, which is expensive.) We don't need to know if this happened, we just need to know if it <span style="font-style: italic;">might </span>have happened.<br /><br />(Although we could check using the intra class correlation - ICC, sometimes called the intra-unit correlation. I did, and it was 0.02. Not very high, but big enough to matter, as we shall see).<br /><br />I first did the analysis with SPSS, ignoring the clustering of the data. Here, I find that that the difference between the groups was £2.40 per person, with 95% confidence intervals £0.58 to £4.22, p =0.010, with NHS direct costing more. Thus, we would conclude that there is evidence that people who telephone NHS direct cost more to treat (in total) than people who telephone the surgery.<br /><br />I then did the analysis with Stata, requesting robust standard errors, with week as a clustering variable - this gives Huber White sandwich estimators. The syntax for this is horribly easy:<br /><br /><span style="font-family:courier new;">regress totcost group, robust cluster(week)</span><br /><br />The difference between the two groups is the same - NHS direct cost £2.40 more (which is good, if that wasn't the case, then we would have done something wrong). However, the 95% confidence intervals are now £5.65 to -£0.85, p = 0.141. Ignoring the clustering would have meant that we risked an inflated type I error rate.<br /><br />I can also do this in SPSS (although it's not widely known, and the SPSS website support doesn't tell you this). We have to use the Complex Samples procedure, which appeared in SPSS 13.0. (And you have to have this module - it's an 'optional extra', as is much of SPSS).<br /><br />First, we need to create a variable which is full of 1s. The syntax is:<br /><br /><span style="font-family:courier new;">compute constant = 1.<br /><br /><span style="font-family:georgia;"><span style="font-family:georgia;">Or you can use the menu commands.</span><br /><br /><span style="font-family:georgia;">We then choose Analyse, Complex Samples, Prepare for Analysis.</span><br /><br /><span style="font-family:georgia;">Select 'Create a plan file' (which is the default), click 'Browse' and type a file name (I'll call it example, adn SPSS will add '.csaplan' to it). Click Next.</span><br /><br /><span style="font-family:georgia;">Choose the constant variable as the Sample Weight (so that this is always 1 - we want everyone to be sampled) and in the cluster variable, put (surprisingly) the cluster variable - this is week. Click Next.</span><br /><br /><span style="font-family:georgia;">We want sampling with replacement. This is the default on the next screen, so click next.</span><br /><br /><span style="font-family:georgia;">Then click Finish. SPSS produces a chunk of syntax which looks like this:</span><br /><br /><span style="font-family:courier new;">CSPLAN ANALYSIS</span><br /><span style="font-family:courier new;"> /PLAN FILE='example.csaplan'</span><br /><span style="font-family:courier new;"> /PLANVARS ANALYSISWEIGHT=constant</span><br /><span style="font-family:courier new;"> /PRINT PLAN</span><br /><span style="font-family:courier new;"> /DESIGN CLUSTER= week</span><br /><span style="font-family:courier new;"> /ESTIMATOR TYPE=WR</span><br /><br /><br />Next, we run the analysis. To do this, select Analyse, Complex Samples, General Linear Model.<br /><br />In the Plan File, we choose the file that we just created, and then click continue.<br /><br />Put the outcome into the dependent variable box, and the predictor into the covariate box (even though our predictor is categorical I prefer to put it into covariate, because you never know how SPSS is going to code it.)<br /><br />Click on the statistics button, and ask for model: parameters, standard error, confidence interval and t-test (why oh why oh why would you not want these?). Then click continue and OK.<br /><br />We get the following chunk of syntax:<br /><br /><span style="font-family:courier new;">CSGLM totcost WITH group</span><br /><span style="font-family:courier new;"> /PLAN FILE = 'example.csaplan'</span><br /><span style="font-family:courier new;"> /MODEL group</span><br /><span style="font-family:courier new;"> /INTERCEPT INCLUDE=YES SHOW=YES</span><br /><span style="font-family:courier new;"> /STATISTICS PARAMETER SE CINTERVAL TTEST</span><br /><span style="font-family:courier new;"> /PRINT SUMMARY VARIABLEINFO SAMPLEINFO</span><br /><span style="font-family:courier new;"> /TEST TYPE=F PADJUST=LSD</span><br /><span style="font-family:courier new;"> /MISSING CLASSMISSING=EXCLUDE</span><br /><span style="font-family:courier new;"> /CRITERIA CILEVEL=95.</span><br /><br />The estimate for the difference is £2.40, with 95% CIs £5.65 to -£0.85, and p = 0.141. Which is the same as we got in Stata.<br /><br />If the residuals are independently distributed, but are not identically distributed (and are still normal) this means that there is heteroscedasticity, and the correction can still be used. In this case, repeat the above procedure, but just don't put anything into the clusters box.<br /><br />(Veronica Morton helped me with this.)<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /></span></span>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1142282389086657112006-03-13T20:35:00.000Z2006-03-13T21:58:48.946ZInterpreting Interactions<a href="http://www.unc.edu/%7Edbauer/">Daniel Bauer</a> and <a href="http://www.unc.edu/%7Ecurran/">Patrick Curran</a> have written a paper called <a href="http://www.unc.edu/%7Ecurran/pdfs/Bauer&Curran%282005%29.pdf">Probing interactions in fixed and multilevel regression: Inferential and graphical techniques</a> in the journal <a href="https://www.erlbaum.com/shop/tek9.asp?pg=products&specific=0027-3171"><span style="font-style: italic;">Multivariate Behavioral Research</span></a> which describes a method for probing interactions between continuous variables.<br /><br />I'll illustrate this in SPSS, using the data file which is used to illustrate interactions in the book <a href="http://www.jeremymiles.co.uk/regressionbook/data/data1_1.sav">Applying regression and correlation</a>. You can get it <a href="http://www.jeremymiles.co.uk/regressionbook/data/data1_1.sav">here</a> (in SPSS format). The dataset comprises the number of books (on statistics) that students read, the number of lectures that they attended, and the final grade they got. To spoil the ending, both number of books they read and number of lectures they attended were positive predictors of final grade. However, in addition, there is an interaction between books and grade.<br /><br />To discover this, we standardise the two predictor variables and multiply them together to create the interaction variable. We then regress grade onto the standardised books, attendance and the interaction variables.<br /><br />The results are shown in the following table:<br /><br /><table class="MsoNormalTable" style="margin-left: 4.65pt; border-collapse: collapse; width: 421px; height: 124px;" border="0" cellpadding="0" cellspacing="0"> <tbody> <tr style="height: 22.3pt;" height="30"> <td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 1.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 66.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 22.3pt;color:white;" bg="" height="30" valign="bottom" width="89"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: left;" align="left"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style=""><span style=""> Variable </span><o:p></o:p></span></span></p> </td> <td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 1.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 38.15pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 22.3pt;color:white;" bg="" height="30" valign="bottom" width="51"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center;" align="center"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">B<o:p></o:p></span></span></p> </td> <td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 1.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 48.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 22.3pt;color:white;" bg="" height="30" valign="bottom" width="65"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center;" align="center"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">Std. Error<o:p></o:p></span></span></p> </td> <td style="vertical-align: top;"><br /></td><td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 1.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 0.5in; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 22.3pt;color:white;" bg="" height="30" valign="bottom" width="48"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center;" align="center"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style=""><span style=""> p</span><o:p></o:p></span></span></p> </td> <td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 1.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 63.35pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 22.3pt;color:white;" bg="" height="30" valign="bottom" width="84"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center;" align="center"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">Lower 95% CI<o:p></o:p></span></span></p> </td> <td style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 1.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 64.05pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 22.3pt;color:white;" bg="" height="30" valign="bottom" width="85"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center;" align="center"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">Upper 95% CI<o:p></o:p></span></span></p> </td> </tr> <tr style="height: 15.8pt;" height="21"> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 66.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" valign="top" width="89"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: left;" align="left"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">Zscore(books)<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 38.15pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="51"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">5.951<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 48.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="65"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">2.403<o:p></o:p></span></span></p> </td> <td style="vertical-align: top;"><br /></td><td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 0.5in; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="48"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">.018<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 63.35pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="84"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">1.076<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 64.05pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="85"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">10.825<o:p></o:p></span></span></p> </td> </tr> <tr style="height: 15.8pt;" height="21"> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 66.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" valign="top" width="89"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: left;" align="left"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">Zscore(attend)<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 38.15pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="51"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">5.702<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 48.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="65"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">2.404<o:p></o:p></span></span></p> </td> <td style="vertical-align: top;"><br /></td><td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 0.5in; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="48"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">.023<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 63.35pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="84"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">.827<o:p></o:p></span></span></p> </td> <td style="padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 64.05pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="85"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">10.578<o:p></o:p></span></span></p> </td> </tr> <tr style="height: 15.8pt;" height="21"> <td style="border-style: none none double; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 4.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 66.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" valign="top" width="89"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: left;" align="left"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">Interaction<br /><o:p></o:p></span></span></p> </td> <td style="border-style: none none double; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 4.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 38.15pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="51"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">4.503<o:p></o:p></span></span></p> </td> <td style="border-style: none none double; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 4.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 48.95pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="65"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">2.140<o:p></o:p></span></span></p> </td> <td style="vertical-align: top;"><br /></td><td style="border-style: none none double; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 4.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 0.5in; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="48"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">.042<o:p></o:p></span></span></p> </td> <td style="border-style: none none double; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 4.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 63.35pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="84"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">.162<o:p></o:p></span></span></p> </td> <td style="border-style: none none double; border-color: -moz-use-text-color -moz-use-text-color black; border-width: medium medium 4.5pt; padding: 0in 4.65pt; background: white none repeat scroll 0% 50%; width: 64.05pt; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial; height: 15.8pt;color:white;" bg="" height="21" width="85"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: right;" align="right"><span style=";font-family:Times New Roman;font-size:78%;color:black;" ><span style="">8.844<o:p></o:p></span></span></p> </td> </tr> </tbody></table><br />It's pretty clear that books has an effect, and attendance has an effect. We can interpret these in terms of standard deviations - we predict that a person who reads 1 SD more books will (controlling for attendance and the interaction term) achieve a grade that is 5.95 points higher. We predict that a person who attends 1 SD more lectures will achieve a grade that is 5.70 points higher.<br /><br />And we predict something from the interaction term, but we aren't exactly sure what yet.<br /><br />One way to see what we do predict from the interaction term is to 'pick a point'. Choose arbitrary values for attendance (say) plug them into the equation and see what happens.<br /><br />Let's do that first. Because we don't like stuff to be hard, we will pick a low, medium and high number of attendances. The easiest way to do this is to use z-scores -1, 0 and 1, respectively.<br /><br />Then, our equation looks like this:<br /><br />Pred Grade = books * 5.95 + attendance * 5.70 + interaction * 4.50<br /><br />We can expand this - because we know that the interaction is equal to books * attendance<br /><br />Pred Grade = books * 5.95 + attendance * 5.70 + books * attendance * 4.50<br /><br />Now we just stick -1, 0 or 1 into attendance, for a low medium and high attender.<br /><br />Low (-1)<br />Pred Grade = books * 5.95 + -1* 5.70 + books * -1 * 4.50<br /><br />Notice a couple of times we multiply a value by -1. Multiply those out.<br /><br />Pred Grade = books * 5.95 + -5.70 + books * -4.50<br /><br />Now notice that we have boos * 5.95 + books * -4.50 - that makes books * (5.95 - 4.50) which makes books * 1.45<br /><br />Pred Grade = books * 1.45 + -5.70<br /><br />And, as we aren't interested in the constant, we can lose anything that's not multiplied by books.<br />Pred Grade = books * 1.45<br /><br />Same again for medium (this one's easier, because we just multiple by zero a lot).<br />Pred Grade = books * 5.95 + 0* 5.70 + books * 0 * 4.50<br />Pred Grade = books * 5.95<br /><br />Same again for high:<br />Pred Grade = books * 5.95 + 1* 5.70 + books * 1 * 4.50<br />Pred Grade = books * 10.45<br /><br />So, if you don't attend many lectures, reading another SD of books is going to help you a bit, but only a bit - your predicted grade only goes up by 1.45. If you are a moderate attender, you increase your predicted grade by 5.95 for the same number of books, and if you are a high attender, your predicted grade increases by 10.45 for the same number of books. (Notice that we are always saying predicted grade - we don't know that this is the cause).<br /><br />This is all well and good, but we might want stuff like p-values on these, or standard errors, or confidence intervals. We might say that for the low attender the relationship isn't even statistically significant - reading more books doesn't help (because the low attenders might read the wrong books). This is where Bauer and Curran's paper comes in.<br /><br />I'm not going to go into the nitty gritty. I'm just going to show what you can do with it. I'm going to use some calculators that you can find <a href="http://www.unc.edu/%7Epreacher/interact/mlr2.htm">here</a>. (But in the future, it looks like they might be <a href="http://www.unc.edu/%7Ecurran/interact.htm">here</a>, and I'll probably forget to change this.)<br /><br />The most exciting piece of output is this graph (and the tools on the web let you draw it; click on the graphs for bigger, clearer images).<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www-users.york.ac.uk/%7Ejnvm1/pics/pic1.gif"><img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 351px; height: 258px;" src="http://www-users.york.ac.uk/%7Ejnvm1/pics/pic1.gif" alt="" border="0" /><br /></a> The graph shows, on the x-axis, the (z-score of) the number of lectures attended by the student. On the y-axis is the simple slope of the (z-score of) books read. We evaluated that slope at -1, 0 and +1. We could do that by looking at those positions on the x-axis, tracking up to the black like, and reading off the values on the y-axis. The values appear to be where we said they would be - I've drawn in the lines on the next chart - remember our earlier calculations showed that they should be at 1.45, 5.95 and 10.45.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www-users.york.ac.uk/%7Ejnvm1/pics/pic1lines.GIF"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 387px; height: 286px;" src="http://www-users.york.ac.uk/%7Ejnvm1/pics/pic1lines.GIF" alt="" border="0" /></a><br />However the more exciting thing is the red lines. These show the confidence intervals around the slope. We said that if lecture attendance was low, we might expect that the effect of reading more books was not statistically significant. This does appear to be the case - the confidence intervals of the slope include 0 when the number of lectures attended is low.<br />Finally, the vertical dotted line shows the (z) value of lectures at which the p-value of the slope becomes statistically significant. This line appears at -0.21. In other words, if the z-score of the number of books read is below -0.21, we do not have evidence that reading more books will help improve the person's predicted grade.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1135001981230958232005-12-19T14:19:00.000Z2006-06-25T18:03:30.813+01:00Correlation and causationA <a href="http://www.badscience.net/?p=197">nice example of the failure to distinguish between correlation and causation </a>on the <a href="http://www.badscience.net/">badscience.net</a> website. The Daily Mail (which I'll decline to link to) says that moderate drinkers are thinner than non-drinkers, therefore drinking makes you thin.<br /><br />It's possible to think of a very large number of reasons that this relationship might exist. My top choice would be age - older people tend to drink less (drinking peaks at around age 21 - you'll have to take my word that I have read that somewhere reputable). Older people are more likely to be overweight than younger people. Maybe they've got their causation reversed - fatter people are less likely to drink than thinner people, because they are trying to lose weight.<br /><br />It's also possible to think of one very good reason that this is not a causal relationship. Take two identical people, make them do exactly the same things. Make them eat the same foods. Make one of them drink water. Make one of them drink wine. The one that drinks wine <span style="font-weight: bold;">will</span> become heavier than the one that doesn't - it's the laws of physics.<br /><br />However, when someone makes a causal statement based on a correlation, it's not for us, the readers, to find reasons it may not be true. It is for them to present a convincing argument. Saying "smokers get more cancer, because more smokers get cancer than non-smokers" is not a convincing argument, until a lot more evidence is collated. This is discussed in <a href="http://www.amazon.co.uk/exec/obidos/ASIN/0805805281/jeremymiles">statistics as principled argument</a> by Robert Abelson, and also in <a href="http://www.amazon.co.uk/exec/obidos/ASIN/0761962301/jeremymiles">Applying regression and correlation</a>, by me and Mark Shevlin.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1131488269181853632005-11-08T22:07:00.000Z2005-11-08T22:17:49.193ZFreakonomicsI started to read the book <a href="http://www.amazon.co.uk/exec/obidos/ASIN/0141019018/jeremymiles">Freakonomics</a>, a book that's caused quite a stir (although it passed me by). The very first example it gives is that of a paper by <a href="http://papers.ssrn.com/paper.taf?ABSTRACT_ID=174508.">Donohue and Levitt</a> in which they claim that legalising abortion in the early 1970s cut crime in the 1990s - the type of person that has an abortion is the type of person who has children who then commit crimes. Basically, the criminals of the future were never born, so Donohue and Levitt claim.<br /><br /><a href="http://www.crab.rutgers.edu/%7Egoertzel/">Goertzel </a>has used this very example in a paper, published in the <a href="http://www.csicop.org/si/">Skeptical Enquirer</a> of <a href="http://www.crab.rutgers.edu/%7Egoertzel/mythsofmurder.htm">Econometrics as Junk Science</a>. Donohue and Levitt used multiple regression, to find the effects of legalising abortion, controlling for other variables. But then N is, effectively, 1. Abortion was only legalised once - and that means that the correlation with any other event is 1.00, and the data are unanalysable.<br /><br />As Geortzel says in his paper: "[this is] a way to model an inadequate body of data. Econometrics cannot make a valid general law out of the historical fact that abortion was legalized in the 1970s and crime went down in the 1990s. We would need at least a few dozen such historical experiences for a valid statistical test."<br /><br />Apart from that, I rather liked the book. It did have some interesting examples of multiple regression, although it didn't really label them as such.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1124962084350042552005-08-25T10:22:00.000+01:002005-08-25T10:28:04.356+01:00SPSS User Group MeetingThe <a href="http://www.spssusers.co.uk/">SPSS User Group </a>Annual Conference will be held in York in November this year. <a href="http://www.psy.herts.ac.uk/pub/D.E.Kornbrot/hmpage.html">Diana Kornbrot</a> will be talking about presenting results in logistic regression.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1124915628060955302005-08-24T21:28:00.000+01:002005-08-24T21:42:33.263+01:00Longitudinal multilevel models in SPSS<span style="font-family:georgia;">In Version 11, SPSS added a multilevel modelling (they call it mixed models) procedure. I find the menus completely incomprehensible - I've tried intelligently selecting the appropriate options, and I've tried randomly guessing, and neither gives me the right answer. </span><br /><span style="font-family:georgia;">Instead, you have to use syntax - and so here, to get you (and me) started, is the syntax for longitudinal multilevel models in SPSS.</span><br /><span style="font-family:georgia;">I'm assuming you've a variable called time, which represents, well, time. A variable called outcome, which represents the outcome, and a variable called subject, this is (you've guessed it) the subjects.</span><br /><br /><span style="font-family:georgia;">Your data will look something like:</span><br /><br /><span style="font-family:courier new;">Time Outcome Subject</span><br /><span style="font-family:courier new;"><br />1 10 1<br />2 12 1<br />3 15 1<br />1 11 2<br />2 15 2<br />2 14 3<br />3 17 3<br /><br />...<br /><br />3 18 200</span><br /><br /><span style="font-family:georgia;">If you only want random intercepts:<br /><span style="font-family:courier new;"><br />MIXED<br />outcome with time<br />/random = intercept | subject (id)<br />/print = solution.<br /></span><br />For random intercepts and slopes, with a zero correlation between slope and intercept):<br /><span style="font-family:courier new;"><br />MIXED<br />outcome with time<br />/fixed = time<br />/random = intercept time | subject (id)<br />/print = solution.<br /></span><br /><br />(Note that the (id) specifies the form of the correlation matrix - id is short for identity.<br /><br />And for random slope, random intercepts, and correlation between slope and intercept:<br /><br /><span style="font-family:courier new;">MIXED</span><br /><span style="font-family:courier new;"> outcome with time</span><br /><span style="font-family:courier new;">/fixed = time</span><br /><span style="font-family:courier new;">/random = intercept time | subject (id) covtype (un)</span><br /><span style="font-family:courier new;">/print = solution.</span><br /></span><br /><br />The covtype (un) is unstructured.<br /><br />A much more detailed and useful guide to mixed models in SPSS, written by Alastair Leyland, can be found <a href="http://multilevel.ioe.ac.uk/softrev/revspss.html">here</a>.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1123710004571892982005-08-10T22:35:00.000+01:002005-08-10T22:41:08.993+01:00Extrapolating too farI came across a great example of extrapolation beyond the data, on the <a href="http://groups-beta.google.com/group/MedStats">medstats </a>mailing list.<br /><br />It's about how to cook a turkey, by dropping it out of the window. Lots of times. I did a bit of googling, but I couldn't find any other reference to it. The link in medstats is <a href="http://groups-beta.google.com/group/MedStats/browse_frm/thread/5d53ece79ac43e49/df569bff4a058da2?lnk=st&q=medstats+turkey&rnum=1#df569bff4a058da2">here</a>. If you know more, email me.<br /><br />Incidentally, the Journal of Irreproducible Results became the <a href="http://www.improbable.com/">Annals of Improbable Research</a>.Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1123591673405344182005-08-09T13:45:00.000+01:002005-08-25T10:40:39.536+01:00Website updatesSome updates have been made to the website. The front page <a href="http://www.jeremymiles.co.uk/">http://www.jeremymiles.co.uk/</a> has been updated. <a href="http://www.jeremymiles.co.uk/regressionbook/errors.html">Errors and omissions </a>has been updated. <a href="http://www.jeremymiles.co.uk/regressionbook/books.html">Further reading </a>has been updated, with a couple of new books, and new editions of some books. <a href="http://www.jeremymiles.co.uk/regressionbook/books.html"></a>Jnoreply@blogger.comtag:blogger.com,1999:blog-15250473.post-1123582642202341162005-08-09T11:17:00.000+01:002005-08-09T11:23:10.950+01:00Introducing the BlogThis is the Applying Regression blog, which is going to contain stuff about updates to the book Applying Regression and Correlation, by Jeremy Miles and Mark Shevlin.<br /><br /><a href="http://www.jeremymiles.co.uk/regressionbook/index.html">More information on the book.</a>Jnoreply@blogger.com