Cloud-Based Consumer Data - Separating Hype from Reality
Cloud computing has fired the imagination of investors and entrepreneurs like few other new technologies of the last decade. Like any new meme it offers real possibility, but from my vantage point the hype-to-content ratio for cloud computing has been teetering dangerously into the red zone of late. If we layer on the extra ambiguities that accompany a new category such as Consumer Data Management and attempt to consider what constitutes Cloud-Based Consumer Data Management, the credibility hazards become especially acute.
New technology waves like cloud computing are fun. They constitute much of the reason so many of us enjoy working in technology. You’re constantly sharpening your saw, learning new frameworks, trying to figure out how to put them to work. But when the claims swirling around them become too breathless, not only do we all look like dummies, we also end up extinguishing whatever tangible possibility lies beneath them. The baby gets thrown out with the bath water, and everyone loses. With a view towards saving the baby and separating hype from reality, in this post I’d like to share a point of view on what Cloud-Based Consumer Data Management means to Krux.
Last year we outlined the main technical desiderata of a Consumer Data Management solution. We posited that consumer data management spanned five core areas:
- Portability: the ability to transport and connect data across sources and uses;
- Extensibility: the ability to move beyond just syntax and to manage consumer data at a semantic level;
- Privacy: ensuring the provenance of data in a manner that passes regulatory, legislative, and common-sense baselines;
- Atomicity: the ability to deliver finer-grained experiences that give individual consumers more of what they want, as opposed to simply tallying monthly UU’s or shipping ad campaigns; and
- Scalability: the ability to manage billions of data points, millions of users, thousands of segments across screens and sources in real-time.
What differentiates cloud-based consumer data management from plain-vanilla consumer data management? It starts with Atomicity and Scalability and moves quickly into new capabilities for integration, real-time data synthesis, and security.
Let’s first consider the aspect everyone intuitively associates with the cloud: Scalability. The advantage of a cloud-based architecture is that you can scale your storage and processing infrastructure linearly (or roughly linearly) with the volume of data you’re handling. In the case of consumer data management, unique users (UU’s) is the most logical measure of data volume. Capturing UU’s, overlaying their associated attributes, and mapping them to unique user ID’s in Krux’s cloud drives storage requirements that grow naturally as we take on new customers. Clever encoding of Krux’s UID’s gives us the ability to pack more UU’s in memory, thereby reducing response times. We’ve been pleased with the flexibility Amazon Web Services (AWS) provides and have built several server management and capacity provisioning tools using Puppet. We use several services provided by AWS including but not limited to EC2 (Elastic Compute Cloud), S3 (Simple Storage Service), RDS (Relational Database Service), Elastic Map-Reduce (EMR), SQS (Simple Queuing Service) and SNS (Simple Notification Service). With these resources we’ve been able to gracefully scale from 0 to well over 200M UU’s over the course of the last 12 months in an extraordinarily cost-effective manner.
Capturing user data via cloud-based storage is the first and easiest part of scalable consumer data management. The second, much trickier aspect involves real-time processing of that data in support of Atomicity. Specifically, it requires the ability to (1) assign users to audience segments with complex rules and (2) compute audience segments in-session and on-page.
A sample use case for #1 might be to target an in-market cell phone intender with income greater than $150K from the Western seaboard who visits the cell phone subsection of the Electronics section at least 5 times within a week. At such a consumer’s fourth visit, he possibly belongs to other segments of interest, but not yet the intended target. When he comes back the fifth time, he’s now in-segment. Assigning that user to the intended segment in real-time without batching or pre-computation is not a storage problem; it’s a sophisticated processing challenge. At Krux we’ve developed a proprietary set of techniques for this that work by integrating a real-time data processing pipeline that can process large volumes of data streams reliably with a high write-throughput NoSQL data store. We tried to use several message queuing systems (ActiveMQ, RabbitMQ, etc.) but we found that they could not give us the reliability we needed at scale.
#2 is about sensing and segmenting users in real-time when they come back to your page. Determining whether an individual user is in-segment in a relational database environment requires a join operation that takes hours to complete, particularly when millions of UU’s and hundreds of segments are involved. It became clear to us early on at Krux that to deliver cooler experiences and smarter ads in real-time, we needed to compute segments in 75ms or less – many orders of magnitude faster than what a relational database permits.
A further differentiator for cloud-based consumer data management is the ability not just to scale up, but to scale down. The typical problem with big-iron back-ends is that they require considerable fixed cost investment and are expensive to power, cool, and maintain, irrespective of the data that’s actually flowing through them. The cloud offers a more elastic alternative. If you’re not running your own iron and instead using a service such as AWS, there are material opportunities to reduce your monthly bills by tens or even hundreds of thousands of dollars by intelligently provisioning and dynamically relinquishing compute resources when they’re not needed.
At Krux, we do this by measuring and monitoring everything. Our cloud sensors keep a vigilant eye on server load. Via a combination of Puppet-based techniques and proprietary AWS APIs, we can provision capacity when our load increases and de-provision capacity when it decreases. With this infrastructure humming quietly in the background, we launched Krux Apps two weeks ago without a hitch, dialing up our cloud capacity to absorb additional load as new publishers quickly signed on our new offering.
The final two defining features of Cloud-Based Data Management are integration and security. A cloud based DMP should provide a standard set of APIs that can be used to integrate with any partner. What we see in the market right now – and what ails a number of our clients seeking to work with third-party data providers – is a tangle of bespoke integration methods that fail to consistently address cookie synching, taxonomy, and data access (server-to-server or client-side double redirect). At Krux we resolved to nail all of these issues down at design time. Anyone who wants to integrate with us (clients, partners, competitors) to get data out of our platform receives the same set of API docs specifying a standard, fire-tested protocol.
In the realm of security, because we’re operating in the cloud, it becomes especially important to main well-defined security practices that go beyond username/password, VPNs, and Firewalls. Cloud-based DMP requires support for encryption (public / private key) as well as digital signature based request verification. We’re layering in these capabilities now at Krux with the conviction that you can run but you can’t hide: privacy-friendly DMP ultimately requires the secure, encrypted transport of personally identifiable information (PII).
As DMP (Data Management Platform) contenders struggle to market their way out of caffeinated valuations and sparse references, it’s increasingly important for their customers to separate hype from reality. Next time you’re talking to someone who claims to offer Cloud-Based Consumer Data Management, keep this post handy. Ask your vendor about scaling up vs. down, real-time processing, atomicity, integration, and security. And feel free to give us a call. We’re happy to compare notes and share an informed perspective.