Adventures in Data Science: Can These Three Statements All Be True?

General May 12, 2016

Data Science

By David Yourdon, Data Scientist

The tremendous amount of consumer and campaign data now available to many marketers and publishers can be paralyzing due to its sheer size and complexity. One of the main reasons the data can stop folks in their tracks is that it can tell seemingly different and even contradictory stories depending on how you look at it.

This is why the marketing and media industry will need to start becoming more comfortable playing around with data science. That isn’t to say every marketer, media buyer or chief revenue officer will need to learn how to design and develop pattern analysis algorithms. Instead it means starting to think about data in ways that enable you to make sense of scenarios that seem to be paradoxical or even nonsensical.

To that end, here’s an example of how to untangle such complexity through some very basic data analysis.

Our Example

Imagine you and two colleagues are sitting in a conference room, trying to determine what drives your users to convert, (e.g. click on an ad or buy your product). It’s late, and you’ve been at it for hours.

One of your colleagues asserts that the data shows that most users who log onto your site at least once a week don’t convert. She presents data to back up her assertion.

However, your other colleague doesn’t seem to agree. “Actually,” he says, “you’ve got it backwards. Most users who don’t log in at least once a week don’t convert.” And he then trots out data to support his conclusion.

Meanwhile, you’ve got some research demonstrating that most users who convert tend to log in at least once a week.

Each of you is certain that your analysis is proven by the data.

So who’s correct? What is the takeaway here? Can we ignore the data and just rely on our gut?

Believe it or not, it’s possible that all three of you are correct.

  • Most users who log onto your site at least once a week don’t convert.
  • Most users who don’t log in at least once a week don’t convert.
  • Most users who convert tend to log in at least once a week.

Some basic set theory concepts can help us to untangle this confusing set of statements. The below Venn diagram, which uses triangles and rectangles instead of circles, (perfectly legal!) demonstrates how all three of you can be correct

Ven

In this diagram, the area of each shape is proportional to the number of users in that population and the behavior in question is “visit the site once a week.” is the set of users who visit the site once a week, C is the set of users who convert, and U is the set of all users (including everyone in C and V).

  • Statement #1 – Most users who log onto your site at least once a week don’t convert. Users who visit the site once a week reside in the blue triangle, and users who don’t convert are those who aren’t in the green rectangle. So are most of the users in the blue triangle not in the green rectangle? Yes. This statement is true.
  • Statement #2 – Most users who don’t log in at least once a week don’t convert. Users who don’t visit the site frequently are those who don’t reside in the blue triangle, and users who don’t convert are those who aren’t in the green rectangle. Are most of the users who don’t reside in the blue triangle also not in the green rectangle? Yes, that’s most of the white square: users who don’t visit frequently and don’t convert. This statement is also true.
  • Statement #3 – Most users who convert tend to log in at least once a week. If we retranslate this question into shapes, we get, “most of the users in the green rectangle are also in the blue triangle.” This statement is true as well.

At first, these statements are confusing when taken together, but the picture is clear, as is the takeaway: High-frequency visitors comprise a significantly higher fraction of the conversion segment than they do of the overall population. In shape terms, we see that about 90% of the green rectangle (converters) is blue (high-frequency), while only about 45% of the white rectangle (all users) is blue. High-frequency users are not users that we want to ignore.

This example demonstrates how to untangle a problem unique to the current time and place many of us find ourselves in., but there are some deeper lessons lurking here:

  • Statistically speaking events like conversions are rare: Conversion is such a rare event to begin with that absolute-sounding phrases like “most users do X” or “very few users do X” tend to be more confusing than helpful. Relative comparisons are usually more important. Be mindful of absolute-sounding phrases, and try to focus on concepts such as the likelihood of a conversion.
  • The size of a population can impact whether an insight is significant: Just because there are more of a certain type of consumer in a conversion population than another population doesn’t mean that there are lots of users of that type in the conversion population, or that this behavior is even common. Rather, it means that there are proportionally more of these users among those that convert as compared to a reference population. For instance, you may for some reason have more people in the conversion population more than 7 feet tall. However, no one would suggest that trying to find additional consumer taller than 7 feet would make sense for most brands.
  • Pictures can speak louder than words: It’s easy to get tripped up in sentences like “a lot (or a little) of users who do (or don’t do) X do (or don’t do) Y” because there are so many different overlapping sets of users to consider in the span of just a few words. If you are feeling confused, try sketching a Venn diagram.

It can often feel as though the mountain of data you have collected is playing tricks on you. Therefore, it doesn’t hurt to have a few tricks up your sleeve that help make sense of the complexity, confusion and contradictions the data
seems to be offering.