I've discussed

what I wanted for this experiment activity,

what my plan was, and now I'll talk about how it went and what to change in future activities. Overall I thought it went pretty well, but there were a few major changes that had to be made on the fly. The biggest one being the math tests themselves; I completely underestimated my student's basic math ability and we came up with useless data. Someone mentioned that they thought it was part of the activity, and was a great way to show how 'messy' statistics really is. I'm glad they thought that, because I really didn't anticipate it.

The initial discussion of how to construct this experiment was useful and demonstrated a number of ideas we discussed in class. Controlling for certain variables turned into a big part of the discussion, namely how to control for people with natural math ability. We decided to do a paired sample, pairing those people of the same math ability by their score on the first test. I asked if this was really the best measure, and we had a good conversation of how to measure someone's math ability, and how for

**some** people, that's their job.

To control for some of these variables, and to construct a basic demographic survey, I had students develop a few survey questions that may help explain some variation in math ability. This discussion included what to ask, how to ask it, and what kinds of variables (categorical, numerical, etc.) we were measuring. I suggested a question about how long it has been since you took a math class, and some students wanted to do a categorical variable of 0-6 months, 7-12 months, etc. I responded with the question "Is it easier to turn numerical data into categorical data or categorical data into numerical data?" and we talked about converting from one data type to another, and how we were going to use the data.

We also talked about what would happen if a person took two similar math tests back to back. One student mentioned that people would become fatigued, and rightly so. To limit this I asked what we could do to limit the fatigue, and we discussed the pros and cons of long and short tests. I also mentioned the idea of activating previous knowledge and that after seeing the first test, students would remember how to complete the questions for the second test. We 'settled' on giving both tests to both control and experimental group, and 18 basic math questions... since thats what was in the packet. I know this isn't in the true spirit of exploratory activities, and some people might deride me for exerting this amount of control over the process. I want students to explore this material and engage with it, but doing everything on the fly doesn't seem conductive to these aims. Without some kind of structure students get bored, annoyed, disengaged with the learning process. I can deal with the first two things (barely) but the third I can't.

Students then took the survey we constructed together (number of months since last math class, sex, age, handedness, work status) and the first test. I took them all, handed them back out randomly, and we graded them. I then had students come up to the computer to enter in the information in Excel. This was a good step since it showed that data entry is an important step, one people take for granted. It is time consuming work, and must be done accurately. This was at the 1-hour mark and once they entered the data we had a 5-minute break.

Once the data was entered there were some survey responses that didn't make sense. Instead of age, one person put 'old'. For the number of months since last math class, someone put the categorical variable response 0-6 months even though we settled on a numeric one. I then discussed data cleansing and that our simple decisions on how to handle these discrepancies has real impact on our data. For the 0-6 month responses, we inputted 3 and included an asterix. For the 'old' response we took the oldest age in the data set (26) and replaced 'old' with that number, including an asterix.

After grading and inputting the data, all but two people received perfect scores on the first test. We discussed how we couldn't use this data since we are looking for improvement in math ability after attempting this puzzle, and we can't show improvement if everyone scored perfectly the first time. I quickly made another test that was more difficult (basic algebra, roots, percentages), had students complete it, graded it, and the scores were much more varied.

This turned out great, even though my veins went to ice when I looked at the initial scores. Students saw that our question couldn't be answered with the data set we so carefully constructed and we had to start again. This demonstrates the 'messiness' of statistics I try to get across to them and how you really have to rely on sound statistical principles, and your understanding of the context to get good data.

Creating the sample was now fairly simple, we paired people based on their initial scores and randomly assigned one to the control group and one to the experimental group. In the pairing we noticed that we had an odd number of people. I asked if there was an observational unit (person) that their initial test and survey information seemed to be outside everyone else's. We decided to remove the 'old' entry from above, since it did not seem comparable with the others. Once we did that we created each of the groups.

The experimental group then had 10 minutes to work on the puzzle. I did not say they had to complete it, just that they had to attempt it. The control group worked on the part of the activity that was to be turned in, descriptive statistics and a box-and-whisker plot of the four data sets (pre-test/post-test, control/experiment). Once the time was up I made another quick test, administered it, we graded it, and I collected the scores.

Getting all the data together there was a distinct difference in the control and experimental group. Averaging the before and after puzzle scores for both groups, it was clear that the experimental group's averages were about ten percent higher than the control group's. I then found the differences in scores, averaged those, and found the percent increase to be about six percent. These two averages came out to two different numbers, but I asked which one would I pick to use to sell my puzzle? We then had a conversation about these sorts of averages, how to compute other similar numbers, and how marketers do similar things in their promotional materials.

Overall I thought it was great, and would probably keep most of it, including the too-easy of tests. I would be a little more prepared and have some back-up tests to use, but it really demonstrated that your initial plan sometimes doesn't work. I would like to include scatterplots and linear regression models next time, but it is not included in our Statistics I course.

I could have done a better job of demonstrating how to control for a variable, and used some basic descriptive techniques to do so. For example, breaking the control and experimental groups into sex, handedness, or work status could show if there were any significant differences in these groups.

Thank you for reading these posts about me struggling through the planning, development, and execution of this activity. If you have any thoughts or questions feel free to post them below.