[MUSIC] Hello and welcome to week four of this MOOC where we start to delve into the world of statistical inference. Now inference itself, we could sort of subdivide into two main branches. Firstly, our estimation of focus for the week four. And secondly, hypothesis testing our focus in week five. Now conceptually, what are we trying to achieve? Well, this word inference means to infer something about a wider population based on an observed sample of data. So really when we do statistical analysis, the data we observed, we tend to view as a sample drawn from some wider population. Now this word population in the everyday use of the term may refer to perhaps the population of a country or maybe a city. Well, indeed, we may be considering those particular types of populations in our statistical studies, but we are not confined to that kind of simplistic definition of a population. Rather, a population doesn't necessarily even have to refer to human beings. It maybe the population of companies who shares a listed on some stock exchange. Maybe we're looking at the population of fish in the sea, planets and the universe, you name it. Now at the heart of what we tried to do with our statistical inference is that, we assume that our sample is fairly representative of that wider population. And our goal when selecting a sample in the first place is to achieve hopefully this representativeness. Now contextually, that may sound straightforward enough, but that's perhaps easier said than done. History is littered with many examples where a inference has been drawn on samples which are very much unrepresentative of the population. Perhaps you will consider a few famous examples. In the 1936 US presidential election, the Literary Digest whereby, generally wealthy people tend to subscribe to obtain books on various topics, predicted that the Republican candidate would win that 1936 election. The size of the data set they dealt with, well, they had over 2 million responses to their survey based on that opinion poll, it seemed to suggest that the Republican candidate would win. In the end, FDR, the Democratic candidate won the election. So one might think if you base an opinion poll on over two million responses, that that's going to give you a very accurate result. Well, it transpires that this was a classic case whereby, the population from which the sample was drawn was not in fact representative of the target population. So the target population, in this case, would have been the US electorate. However, the sample of voters that the Literary Digest considered was drawn from its own readership. So here would seem an example of coverage bias of the sampling population. I.e., the people on whose views were solicited, were only drawn from the Literary Digest subscribers, who themselves were typically not representative of the US electorate overall, why? Well, this was the 1936 election, really in the heart of the Great Depression, so what sorts of individuals would be subscribing to the Literary Digest? Typically those on very high incomes and hence, would give us a little skewed representation of the US electorate. And it would tend to have a much greater proportion of individuals from high socioeconomic groups, who would tend to support the Republican candidate. That was 1936, scroll forward 12 years to the 1948 US presidential election. Now, most of the mass media at the time were calling for the Republican candidate jury to beat the Democratic candidate of Truman. And there's a very famous say image of whereby the victorious President Truman Was holding up a copy of the Chicago Tribune which had as the headline, Dewey defeats Truman, why? Because the opinion poll in which had been conducted, seemed to suggest a victory for the Republican challenger. And indeed, another example whereby, the opinion poll was based on a sample, which turned out not to be representative of the population as a whole. Maybe some more recent examples. The Brexit referendum of 2016. Admittedly, the polls were fairly narrow and one really indicating a clear lead for either side. Nonetheless, there was a general expectation that the remain side of the referendum vote would win that the referendum itself. Now, if one actually looks at opinion polls, they actually give slightly different results depending on the content method which was actually used to obtain the sample. Namely, telephone polls tended to give a slight lead to the remain campaign, whereas polls conducted online, tended to give the reverse picture of giving a slight lead for the leave campaign. Now of course, once the referendum had been concluded and we saw that the leave side won, of course with hindsight, you might perhaps want to explain why the online polls were more accurate then those conducted telephone. Well, it transpires that the online surveys were better able to reach a representative sample of voters based on educational background. Which transpired to be one of the key predictors as to which way an individual voted in that referendum. So our goal, hopefully, is to get a sample which is representative of that population. But we should beware the many instances in history where this has failed to happen, leading us to erroneous inferences. So for the rest of this section, I'd just like to introduce you to a few basic examples of sampling techniques. Now we could really divide or partition sampling techniques into two main types. On one side, we have the so called non-probability or non-random sampling focus for the remainder of this section. And in the next section, we will look at different type of probability or random sampling. So examples of non-random sampling. Well, our first one would be convenience sampling. If you imagine when you're perhaps out shopping in a supermarket, maybe someone has asked you to taste test or sample perhaps a new biscuit or cheese or bread that's being offered. So an example of convenience sampling whereby the people who are asked to taste test this product are simply those who are in the right place at the right time. I.e., they happen to be passing that particular counter in the supermarket. Or maybe, next time you listen to perhaps or watch the TV news, you might see some people in the street interviewers, whereby the reporter selects a few individuals to solicit their opinions on whatever issue is being reported on. So convenience sampling is by definition convenient. It's fairly quick and easy to do, however, the sample which is generated is not necessarily representative of that wider population. For example, return to the supermarket, depending on the time of the day, indeed, depending on the day of the the week, the sorts of people who may be in the supermarket at that time may not be representative of all shoppers at that supermarket. Is it during school hours, when children may not be present? Is it at weekends when maybe office workers are not at work? So one should always be conscious of just how representative your sample may prove to be. Now an extension of convenience sampling is judgmental sampling which is a form of convenience sampling. Where however, there is some sort of expert who exercises and judgment to try and reach this utopian representative sample. An example might be when so-called expert witnesses are called to give evidence in a criminal trial. There may be evidence either to support the prosecution or perhaps the defenses case. So these are experts should hopefully have some expertise in their particular field. And should be able to give a more insightful evaluation of the evidence than a member of the general public might be able to achieve. Thirdly, we can consider quota sampling, whereby, conscious of our goal of trying to get a sample which is representative of the wider population. Well if we know certain attributes or characteristics within the population, you may seek to try and replicate those within our sample by choosing and selecting particular cultures of various types of individuals. So let's take a very simple example. Let's imagine our population consisted of 50% males and 50% females. Thereby, if we decided we wanted to sample 1,000 individuals, by construction, we would set quotas such that 50%, i.e., 500 of those 1,000 respondents would be male and the other 50% would be female. However, this is still a form of non-random sampling, because it will be up to the researcher, him or herself to actually do the selection and choose which individuals to participate in the study. So for example, if you see someone with a clipboard standing on the pavement, they may have been a set particular quotas. Maybe by gender, perhaps by age group, as well, if you wanted to get a broad cross-section of respondents by age too. Then they would select individuals that they would approach to ask to complete their survey. Of course, there may be some selection bias here. That person with the clipboard is perhaps not necessarily equally likely to select any individual passing. They may opt for those who perhaps seem friendlier, more approachable and hence, more likely to respond to their survey. And finally, we have a so called snowball sampling. Now know this is not something that can only be a conducted in winter, but this name snowball is derived as follows. If you think about putting a small snowball at the top of a mountain and let the thing roll down the side, it's going to becomes bigger and bigger and bigger. So with a snowball sampling, we will select a small number of respondents, potentially randomly to begin with. However, once they have been selected we would then ask them to give us some referrals of their friends, family, associates, who they think might be appropriate for our study. So we try to achieve a representative sample that way. However, given these respondents choosing from their own network of people they know, they may, in fact, exhibit some biases in their selections. So, just to perhaps round off. Remember when we introduced the different levels of measurement. Nominal and ordinal as types of categorical variables, and also our interval and ratio levels of measurement. We said at the time, that depending on the types of variables we have, that would effect the kinds of statistical analyses we can perform. Well in a similar way, depending on how we select our sample, be it non-randomly as discussed here, and randomly as to be discussed next, that will affect how we can actually analyze the data. Namely, a lot of our statistical inference procedures, which will be introduced in due course, tend to require samples selected in a probabilistic or random sense. So the non-random techniques considered here, convenience, judgmental, quota, and snowball sampling. They have, as perhaps key benefits, their general ease and convenience or trying to collect the data. However, due to the potential for selection bias to appear or we are perhaps limited in how we may analyze that data statistically. Nonetheless, for some sort of exploratory study, if you just want a rough idea about people's attitudes or opinions on some topic, then potentially those non-random kinds of sampling may suffice for your purposes. [MUSIC]