What About Small Data? Part 1

Big data remains all the rage. After exploding onto the scene in roughly 2012, with the popularization of the Hadoop framework, the big data lens still dominates the “LinkedIn press.” This myopia is certainly not without its merits; machine-generated data contain vast amounts of signals just waiting to be extracted and put to use. Indeed, many machine learning applications require real-time, streaming data for use as features, and likewise at the very least hundreds of thousands of training examples.

Big Data vs. Small Data

Big data sets are, first and foremost, really big. Observations range in the 100,000s (minimum) to the 100s of billions. Big data sets are typically semi-structured, and while data munging is required, I’ve found that it tends to be pretty straightforward. Likewise, when missing data occurs, it’s usually not that big a deal—assuming the missing data don’t have a pattern, it’s safe to delete the observations. Finally, when it comes to finding more signals, it’s usually a matter of finding another vendor or “thing” generating data, assuming you can key on something in the main data set.

Small data is still out there, though! You can’t just make a small data problem big. Small data isn’t collected in hours; it usually takes, at minimum, weeks to collect it. The number of observations ranges from 1,000s to 100,000s; and the number of “1’s” in the data set can sometimes be really, really small—think 10 or 100. Small data tends to be relational. Missing data is precious, and thus can’t just be ignored. And finally, finding more signals usually takes a lot of creativity.

Big Data Small Data
Data Collected In Seconds, Minutes, Hours Days, Weeks, Months
Number of Observations 100,000s – 100s of Billions 10s – 100,000s
Typical Structure Semi-Structured Relational
Data Munging Effort Moderate Hard
Missing Data Ignore or Interpolate Not so fast…
Finding More Signals Find Another Vendor or “Thing” Get creative
Topical Areas B2C, Digital B2B, Sales, Events, Above-the-Line
Limiting Factors Processing Power, Storage Time, Creativity

Many of the best problems out there today—the ones that will yield the most incremental fruit, in terms of leads, opportunities, loyal customers, dollars, etc.—have to deal with small data. At MarketBridge, we’re experts on small data, and I want to share some of the best practices for working in this messier, potentially more lucrative ROI realm. In this blog post, I’ll go through the first best practice, “providing insight along the recall-precision gradient.”

Part 1: Provide Insight Along the Recall-Precision Gradient


For our example, we’ll look at a wins and losses dataset for a large, considered-purchase solution designed for small and medium businesses. The training data consists of 10,000 accounts that have been touched by inside sales, yielding 35 wins, over a period of six months. From a feature perspective, assume that there is data on marketing stimulus; firmographic data; and competitive situation in the account.

As a data scientist, I’m tasked with providing a scored list of accounts based on training data, for a new set of 10,000 target companies. The list will be used for sales reps to prioritize their activities. I develop my awesome model and give Joe, a sales director, 1,000 likely accounts to call. I tuned the model for maximum recall — that is, I want to pick up every potential buyer because I don’t want to miss any revenue. Joe asks me how I did the model, and I tell him I extracted a bunch of features predicting likelihood to buy, and there was a bunch of code and statistics, and his eyes glaze over and he says, “just give me the list.”

A week later, I go by Joe’s desk and ask him how the list did. He is furious. He’s called 50 people on the list, and not one was interested. I tell him that the list was optimized for recall and to make sure he didn’t miss a single likely buyer. He walks away in a huff. How do we get around this problem?

Well, obviously, we retune our model to optimize for precision, right? I want to minimize the number of false positives in my model, so Joe doesn’t call a bunch of people who have no interest in what he’s selling. So, in this case, I give him a list of 35 predicted positives. The problem with this model is that he’s missing a lot of people he should be calling, and he tells me that he wants the “not so perfect” leads too, because he can actually adjust his activities to win in deals that might not have been perfect fits. I call this the “adaptive component” of high-touch marketing; it often goes hand-in-hand with small data analytics, and I’ll get back to this in a future post. And anyway, he’ll be done making calls in four days, but then what should he do?

In looking at this problem, we can learn a lot from the tradeoffs that epidemiologists make when building tests for diseases. Of course, the optimal goal for any model is perfect recall and perfect precision. In other words, all positive cases are predicted correctly (recall), and no false positives are generated (precision). In the real world, this simply doesn’t happen; we are constantly making trade-offs between erring on the side of capturing all positives, and not predicting any false positives, or, being aggressive vs. being conservative in our prediction.

An epidemiologist might tune a model to predict the presence of an extremely contagious disease, where the consequences of a false negative are grave, for maximum recall (true positives / true positives + false negatives). Likewise, she might tune a model to predict the presence of a disease that isn’t at all contagious, and extremely rare, to maximize precision (true positives / true positives + false positives) and avoid scaring a lot of people who are actually totally healthy.

Multiple Models to Activate Human Intelligence

So what’s the key? Luckily, I don’t have to give Joe just one list. Instead, I should give him three, along the recall-precision gradient:

  • A “primo” list, maximizing precision, that is, “few false negatives”;=
  • A “likely suspect” list, trading off precision and recall, perhaps maximizing F1 score
  • And a “wide net” list, maximizing recall, that is, “get all off the likely buyers into a list”

Technically, (using Python in this example), the list could be generated by running a grid search (via `GridSearchCV`, e.g., from Scikit-Learn) maximizing recall (wide net); F1 (likely suspect); and precision (primo). Of course, this is just general guidance; there’s nothing magical about the number three, or actually maximizing these metrics. The point is, give practitioners choices along the recall-precision decision boundary and teach them how to use this newfound intelligence.

1) Primo Model

Predicted Negative Predicted Positive Total True
True Negative 9,950 10 9,960
True Positive 15 25 40
Total Predicted 9,965 35 10,000

2) Likely Suspect Model

Predicted Negative Predicted Positive Total True
True Negative 9,860 100 9,960
True Positive 1 39 40
Total Predicted 9,861 139 10,000

3) Wide Net Model

Predicted Negative Predicted Positive Total True
True Negative 9,000 960 9,960
True Positive 0 40 40
Total Predicted 9,000 1,000 10,000

Now, Joe has three lists that he can use to tune his calling strategy. He can use the primo list of 35 first, perhaps putting maximum effort into tuning what he says and the content he provides. He can then go on to the likely suspect list of 139, perhaps realizing that there might be something he needs to tweak for these. Maybe the reps making the calls on the training data were not quite handling their calls to these correctly, so Joe uses some human intelligence to boost his performance. And finally, he might send the 1,000 wide-net prospectskeep warm” emails to nurture them and bring them along.

This is obviously an overly simple example, but this heuristic has worked very well for MarketBridge’s clients handling small data.

One final note; upon reading this, one of my friends asked me a good question: “Why not just provide the probability of the win? Why the three lists?” My answer: There’s nothing wrong with providing the probability of the win, too, but I’ve found that explaining things via three lists based on three different decision gradients is better. In simple terms, it helps activate the “human intelligence” component in the small data world and drives better adoption and usage.