The lab at DC is a remarkable initiative by the Office of the City Administrator of Washington DC. Now, Dr. Lance already introduced you to the work of the lab, in the specific case that we'll be covering today. But I just wanted to touch on what I find remarkable about this group. Where we are seeing more and more governments of all sizes open up their data, often required to by legislation. The lab has gone much further and opened up the whole scientific process they use. This is next level accountability and transparency, and frankly, I hope this catches on. It gives everyone an opportunity to learn what works, and what doesn't, and importantly, it allows anyone to improve their skills, question assumptions, and contribute with very low friction. That I could take the work that they've done and do a partial replication analysis for this lecture, and not have to actually talk to anyone at the lab, is incredible and rare. Enough of me heaping praise. Let's consider the case that they were interested in. Trash stinks and municipalities have different strategies that they can try and employ to reduce litter in high-density areas like city centers. Public garbage cans are common, and the question asked in this work is, can a behavioral nudge in the form of a positive message on the garbage can itself reduce the amount of litter on the streets? If so, this would be an extremely inexpensive intervention which could be rolled out to all public garbage cans with minimal cost. To test this, in 2017, the city chose a number of streets in busy areas, and randomly assigned some garbage cans to have signs, and some not to. The randomization methods which Dr. Lance covered, are really important because we would expect that some cans might have a change in garbage just based on traffic patterns in the city. Now, I'm going to focus on the analysis of the data which was collected. The outcome variable we are interested in is the average of the maximum fullness per can, per day. The sensors were installed on the garbage cans to measure this fullness by volume. Let's take a look at the data. I'm going to bring in the tidyverse. I'm going to load the data file called a littercan_fills.csv, and we'll take a look at it. We see here a bunch of data about garbage cans, sensors, the fill level, locations, and dates. There's no information though on the condition, with or without a nudging sign, that the garbage cans has been assigned to, so we're going to have to look in another data file for that. I'm going to load in littercans_randomized.csv into df_cans. Take a look at that. In the second data file, we can see can-level information, the address, the street, the identifier of the block, and the side of the street. Now, in North America, odd numbers are on one side of the street, while even numbers are on the other, so you could see lots of mention of odd and even numbers here in the data. Importantly, we see that the conditional assignment variable here, 'Z', which is a '0' if the can had no sign on it, and a '1' if the can did have a sign on it. Now, this arrangement of data is pretty common. We have one file which has all of the experimental observations that were taken, that's our df_fills, and we have another which has details about the assignment condition and information about the population of samples, in this case garbage cans in our experiment, and that's our df_cans. We're going to have to join these two DataFrames together, and that means we get to learn a little bit more about R. We have a unique garbage can identifier in the datasets, MeID, and this is going to be key for our success. But before we do that, it's important to note that this base data we have been provided with doesn't actually have that outcome measure that the team was interested in. Remember this was the average of the maximum fullness per can, per day. Let's go back to df_fills and take a look. Let's look at a couple of the variables that we have here. The RecordedDateTime, MeID, and the CalculatedPercentFull. It looks like we have a can identifier, a timestamp of when values were measured, and a percentage of fullness. But if you look closely, you'll see that fullness is measured more than once a day. Let's do a little exploratory data analysis on a single garbage can. I'm just going to take this first one. I just looked at the table, grabbed one randomly, and we're going to filter on the date of the November 27th, and I will just toss that into a ggplot. Here's our plot of the calculated percent full overtime for that can for a day. We have a variable fill level being measured out over the day. Now, we want to calculate the average of the maximum fullness per can per day, which means we want to find the maximum fullness of a can on a given day, and then calculate the average of that across all of the days for each can. You already have these deployer skills to pull this off. Now it'd be a great time for you to pause this video and try it yourself. For my solution, I'm going to load in lubridate here. I'm going to convert this to an actual date object using lubridate. This is going to get rid of the time component automatically. There's other ways that you can do this though of course, and you've already seen some of those. You could treat this as a string instead and just pull out the things that you're interested in. But here I'm just going to create a new variable date_reading, and turn this RecordedDateTime into a RecordedDate. Then I'm going to group the cans by IDs and dates, and then I'm going to create a new variable which is just the maximum of the CalculatedPercentFull. I'm going to actually divide this by 100 so that I can get a number between 0 and 1. Of course, we create that with summarize. Let's take a look at this. Sometimes you have to look at your data and just wonder if it's trash. This is an important part of the data manipulation and cleaning process. For instance here, we've got a single max field value that's greater than one. What's this actually mean? Let's dig in and investigate this a bit. I'm going to update the line plot that we did with this new can and toss in some breaks, so we can take a look at this data. That's the can over a lot of time, the whole period. We've got a number of breaks here, and you can see that it does indeed go above 100 percent. It goes close to 100 at one time, and it goes down to 0 a couple of times. In this case, it appears that our data actually does go above 100 percent. Is this reasonable? Is it right? It requires going and talking to the experiment design team to understand how the sensors work. Maybe there's a max fill line, which is 100 percent, and the sensor is actually reading data that's been piled above that. Or perhaps it's a data collection error, and the data needs to be further cleaned. Regardless, it's important for you as an analyst to work with a larger team and explore the data to understand it. In this specific case, the team and the DC lab chose to cap the values at 100 percent. We're going to do that too. Just a quick little mutate here. This is all old habits for you now.