S
4
-
EPISODE
6
Data Masters Podcast
released
November 26, 2024
Runtime:
34m19s

How Data Streaming Transforms Data Flow and Analytics with Will LaForest of Confluent

Will LaForest
Global Field CTO at Confluent

On this episode, we are joined by Will LaForest, Global Field CTO of Confluent, to discuss the transformative impact of data streaming on businesses and platforms. He explains the shift from traditional monolithic big data models to data streaming systems, moving from data analysis to continuously bringing new data to the same questions, enabling hyper-personalized customer experiences and operational efficiency.

I'd rather read the transcript of this conversation please!

[00:00:00] Will: When you're trying to do anything for a customer, and let's take generative A. I. Because data streaming is critically important in generative A. I. You have to use the right tools. they work with data differently. And so, for the hyper personalization,different tasks require different approaches, and how do I augment it when I ask the large language model the question, because I have to deliver And 

[00:00:49] Anthony: Welcome to Data Masters. Today, our guest is Will LaForest, the Global Field CTO at Confluent. For those of you who are unfamiliar with Confluent, they are at the forefront of a new era of data infrastructure, focused on data in motion. Their cloud native platform serves as the intelligent connective tissue, enabling real time data streaming across organizations, helping businesses thrive by delivering rich digital experiences and real time operations.

[00:01:22] Anthony: Will, specifically, has a background that's as impressive as it is diverse. He has over two decades of experience in the data and technology space, and he's worked at organizations such as SPSS, MarkLogic, MongoDB, and Red Hat, and he's worked closely with both commercial and public sector organizations.

[00:01:43] Anthony: Now at Confluent, he's helping organizations unlock the potential of real time data, guiding them through complex digital transformations. In today's conversation, we're going to explore Will's journey through the data world from traditional relational databases to NoSQL and finally to the streaming data platform at Confluent.

[00:02:01] Anthony: And we'll dig into the nuances of data streaming, some use cases, and, how Confluent is redefining how businesses think about data products. So, welcome to Data Masters, Will. So, maybe,

[00:02:15] Will: Glad to be here.

[00:02:17] Anthony: you've worked in the data space for a long time and you've worked with many different types of databases as sort of alluded to in the introduction, traditional relational databases, NoSQL stores, now streaming. I always used to joke, actually I used to joke a long time ago that what the last thing the world needs, is yet another database, And I was a hundred percent and clearly wrong on this, repeatedly because clearly the world does need other and new kinds of databases.

[00:02:45] Will: but you've had this unique experience of working across many different types and use cases. so maybe share a little bit of the journey, both from a personal perspective, but also maybe from a technical perspective. Yeah. Well, so, I guess my technical career really started when I was a teenager, and I got this internship at DARPA and it was pretty cool, right? Because, I got to work with, at the time, these, Sun workstations and Next, if you remember Next computers, 

[00:03:15] Will: which I thought Next step was such a beautiful operating system, that was the beginning of my journey, working with data that's when I fell in love with it.

[00:03:23] Will: And, next step dealing with data at large scale. Actually, I would say I worked at an ad tech company. And there I was using,Sybase, which as you know,has a lineage,from Stonebreaker. and, so there I was doing like really, I would say at the time, very large scale data because of course, ad tech. Lots and lots of ads are being served. You need to analyze. You got to figure out how to schedule ads, et cetera. So, then I spent a little bit more time on the analytical side with SPSS, thinking about how you visualize data and analyze it. And they coined this predictive analytics at the time, I would say, but one of the things I did struggle with, which is, it's interesting is relational databases to your point, the world doesn't need another database. Well, actually like relational database was great for some things, but other things it was just woefully bad for. And that's what sort of led me down the NoSQL database route. first at Mark logic and then at MongoDB. And I think the interesting thing about MongoDB is many people don't realize this, but that was born out of ad tech too, in the sense that the founders of MongoDB was originally called TenGen came from DoubleClick. And they were struggling with trying to use relational databases, for the scale and the changing schemas. And so they created the database that they always wanted, right, and that was MongoDB. So, yeah, and then, a spot at Red Hat, but as you said, I landed at Confluent. And the thing that I love about this is that, a very fresh field in the sense like, what, databases have been around since, what, IMS? About 60 years ago, the approach to data for the last six decades, and it varies like you're traversing a network versus, CODs, relational algebra, etc. But principle is, let's stuff data someplace, and let's ask it questions. And it's been the same thing for 60 years. It's gotten better. You can ask different kinds of questions.

[00:05:19] Will: You can index it different ways. You can scale, data models are more flexible, but it's the same pattern. And what excites me about data streaming is it's turned that on its head, right? Instead of sort of bringing questions to a set of state, you're doing the opposite. You're bringing this constant stream of data to the questions, if you will.and, it's really changed how you think about data and how you work with it, right? Sometimes for the better, sometimes for the worse, right? Because it's a new paradigm, but fun.

[00:05:50] Anthony: Yeah, so let's pull on that thread for a second, and you mentioned Mike Stonebraker, who was the,academic founder at Tamr, but it was also, as you pointed out, invented many, different databases, and again, works, very much counter to my perspective that the world doesn't need another database, I think Mike would also disagree, would agree with you, and, you know, Mike built a company called Streambase, which was a stream, and this is probably, 10 years ago, maybe.

[00:06:17] Anthony: Confluent is not the first streaming database system, but maybe talk before we, we get into the details of Confluent and Kafka and how we think about that, but maybe just talk a little bit about. Data streaming. and to your point about bringing, the problem to the data, how do I think about streaming use cases in general?

[00:06:39] Anthony: And is this really something new or is this just a new spin on something that's existed?

[00:06:45] Will: Well, so I think, handling data in real time, what I would say, processing it, That's been around for a long time, right? It's clearly, it's we have some data. How do we act on it really quickly? And I think that's what Michael was trying to do with StreamBase is figure out how to act on data.

[00:07:02] Will: So you think about complex event processing, CEP. Actually, if you talk to, Gulateri, who is the Forrester analyst for this data, this new data streaming system. Platform segment or data infrastructure. He will tell you that way. was, if you wind back, wasn't originally called that.

[00:07:22] Will: Originally it was called complex event processing, then that kind of fell away, then it went to streaming analytics, and then it murfed into data streaming platform, because basically what happened is the aperture for how you handle data in real time just grew. from this niche let's do this in financial services with complex event processing. But that's kind of a long winded answer, for me to say that. I think there's a whole set of data technologies that exist today, that are various sort of continuum of ways you handle it. Things like stream processing, real time databases, real time analytical databases, And they're really about if I have a stream of data, how do I analyze it or, produce some state or act on it really quickly. But what makes data streaming a little bit different is that data stream, the data streaming itself, which really had its genesis with Apache Kafka, which was created at LinkedIn. The point wasn't so much like, how do I have the stream and act on it as soon as possible. The point was, how do I decouple some producer of data and allow these streams of data to flow to the consumers that need it in real time? And so there's this thing that people talk about with data streaming. it's actually, it's beauty is in its simplicity. It's about dumb pipes. It is literally this append only commit log, and it's, distributed and partitioned, and it scales massively. But the point is, delivering the data constantly in real time. And I would say things like, String processing. Flink is an example of that, which we can talk about later if you'd like to, but also like some of these really awesome, powerful, real time analytical things like Apache Druid, Clickhouse, Pino, et cetera, they need to get the data from someplace, and that, someplace is data streaming, and it's not always real time, fast systems.

[00:09:24] Will: It could be, you could, these streams of data could be going to Iceberg. They could be going to your analytical stores. But that's really the power of data streaming, which was solving that problem of how you decouple data producers to consumers, do it at scale, and do it really fast, right?

[00:09:39] Anthony: So, let me try to. Play that back to you, make sure I'm capturing it. So the differentiation you're drawing is between, moving the data around, dumb pipes, as you call it, and doing so in a very reliable, very scalable,and to your point, simple. Simple way. Versus, or in contrast to, the need to, take specific action on it, aggregate it in some way analytically, maybe to be a little marketing for a second, make decisions on it, is that a, is that distinction fair?

[00:10:13] Will: Yeah, absolutely. Clearly, you need to do those latter things.

[00:10:16] Will: Otherwise, there's no point in delivering data. But if you can't deliver data fast and reliably and continuously, then you can't do things in real time.

[00:10:28] Anthony: Got it. And, maybe at the risk of taking us off track here a little bit, you make this point, about data products, and, I feel like this is a term which has gained some traction recently, but it's also one of these terms which means a lot of different things to a lot of different people.

[00:10:44] Anthony: And so maybe if you don't mind sharing your definition of data products, and then Maybe bring these two ideas together. So we have this idea of, moving data around reliably at scale and then data products, and, how do I think about those two together?

[00:11:02] Will: Yeah, so you know, data products, that term has kind of been around For a while, I think part of the reasons why it's gained popularity and it's become, I would say, more concrete lately is just because it's, it was part of this data mesh pattern that Zamak popularized, the Sankha principle, which is data as a product, but, I think the thing that's interesting is if I look at the organizations that were sort of most successful with data streaming, That had the biggest impact. A lot of the principles that you see in data products, they were already doing, right? they didn't necessarily call it that, but, the Netflix of the world, they published lots of articles in the Ubers and the LinkedIn, et cetera. So,but just to answer your question more directly, like a data product. is a curated data asset that's been intentionally created to make it easy for data consumers to find and consume the data, right? and they really should be created by the domain, by the people in the domain who best understand the data and control that data source. And so usually that's like a database supporting some application, right?

[00:12:07] Will: It might be orders, it might be inventory, whatever, but there are some source database. But one of the key things is that the data in that source database should never directly be accessed,and the structure should never directly be just replicated blindly into the data product. Because what you're doing is you're coupling potentially, 10, 20, 30, 40, downstream systems to a source system. And so what happens if you change the schema? You break everything. So you know, this notion of data products is really a decoupling thing, right? And I say, the reason why data streaming has been adopted, for so many of these data product strategies is because fundamentally, as I said earlier, it's created to decouple.

[00:12:55] Will: That's one of the core principles, decouple producers from consumers and allow data to fan out. And I think the second thing is you will oftentimes see data products, people talking about from an analytical perspective, because clearly you want good, clean, well governed data.when you're doing your analytics, but the nice thing about data streaming is if you are producing these data products and they're represented as a stream of data inside of, Confluent or Kafka, then you can support operational real time use cases because they can find and tap into it and act on it, right?

[00:13:29] Will: They may put it in an operational store, but you can also support analytic. If on the other hand, you're only creating these data products within some sort of analytical tool, the best you're probably going to be able to do support some subset of analytical use cases, because it's just too slow.

[00:13:46] Will: You have these batch pipelines that are delivering data. Then you have this like bronze, silver, gold thing. And by the time you're done with it, it's just too slow for Fraud detection, or customer experience, or I. O. T., acting on I. O. T. data. So, so I think that's why data streaming works so well with this notion of data products.

[00:14:04] 

[00:14:43] Anthony: I think that's a,really, important point, actually making two very important points there. So if I can draw them out, the first is. This idea of decoupling so that a good data architecture should create, some level of decoupling between sources and systems and uses and consumers of data and that's a really good pattern for people to be thinking about when they're building, really anything having to do with data, which brings to the second, I think, important point you're making here, which is, separating analytical use cases from operational and making the point that, If you only focus on delivering, through a batch output to a, analytical use case, not that's not valuable, but that's not enough.

[00:15:25] Anthony: that's just a start, and that, delivering to these operational use cases, is, equally important to the analytical ones. and so maybe we could almost think of this, going back to my consulting days, it was like this two by two matrix of, decoupled, systems with, analytical and operational, but anyway, probably a bridge too far, but anyway, does that, is that a good summary?

[00:15:46] Will: Yeah, great summary. 

[00:15:48] Anthony: and also that what's helpful there is creating this conceptual linkage between the idea of a data product. And data streaming as you've articulated it here. So that, that, I think this should be very helpful for people to give context for that. So. I fear that so far in this conversation, we have spoken very theoretically about, which are very important, it's important to create the context.

[00:16:11] Anthony: the theoretical underpinning of anything that new that you're learning about. you are a field CTO. So presumably that doesn't mean you have a side business being a farmer. The field in this case means you're out with customers. so maybe share, if you don't mind, some real world examples of what Confluent's doing and maybe that'll help.

[00:16:31] Anthony: bring this to life for folks and understand these patterns, but in the context of real world use cases.

[00:16:37] Will: Yeah, you're right. Like we're geeking out, from a data nerd perspective. talking about Cod and Stonebreaker and data products, et cetera. So, yeah, so to make it more realistic and you're absolutely right. that's basically what I do for my job. It's I consider myself one of the best jobs in industry because. It doesn't get much better than talking to like really smart people across the globe in every different industry, right? It's just, it's awesome. But use case wise, I would say first of all, it, it's heavily adopted in financial services. for, lots of reasons. Like time is money, right? The faster you can act on data, the more valuable it is.

[00:17:12] Will: I think that's a worthwhile principle in any industry. So things like payment processing, like someone makes a payment, that's a record. How do you act on that piece of data? things like, trading, market data, fraud detection. Actually one huge use case is I would say security and cyber security because that's all about you have this constant streams of observability.how do I find threats? How do I do it in a cost effective way? And how do I do it as fast as possible? that is a big use case really across all industries. and financial services is a big one. Then you look at things that might be a little bit closer to, humans as consumers. Okay.

[00:17:54] Will: So these are things like, Customer experience when you're working with stores and retailers, et cetera, because, there, what they want to do is have a very good, they want to have good insights into your behavior and what you like and what your preferences are, and having an experience where the data is old.or the inventory is out of date is extremely frustrating. So like real time data in retail is used for what I would say, customer 360, real time customer 360. It's used for a real time inventory because they're all omni channel now. So you need to know what inventory is in the store and when you're ordering online to pick up in store, because if not, you're going to arrive and it's not going to be there.

[00:18:37] Will: That's terrible experience.I would say IOT. how data is processed from the devices that we all have is another big use case. Big in automotive, big in manufacturing. I have this telemetry coming from the car. Like I think a Tesla has I don't know, over 60 sensors. Right on these, they're basically big computers and big, and software. But like Mercedes is a great example. they spoke at our, essentially our users conference where the data, it's not a user's conference, a data streaming conference. It's really an industry wide thing. And they were talking about how critical it was for, creating this luxury customer experience, cause they want to hyper personalize. What happens when you sit in the car? Like,what channels do you listen to? Where should your seat be? what's your driving preferences? That's all based on these streams of data. that's being collected. And of course you opt in and all that good stuff. So I think there's Just a ton of use cases.

[00:19:36] 

[00:19:36] Will: I think one last one, and I'm probably bloviating here, but if anyone really needs to understand data streaming,the use case that almost everyone gets that drives at home is take a ride sharing Uber.and the idea is like this, I would submit this cannot work without data streaming. It would just be, it would be really difficult to achieve, right? You basically create data streaming is what you would do. Because what happens is, if I'm a driver, I have this app, and it's constantly sending events. I'm Will, I'm at this point, I have a rider, I don't have a rider, I'm in this car. every driver's doing that. And then all the riders, I have an app, I'm looking for a ride, I'm at this location. And it only works If you can essentially match up drivers to writers with these real time streams of data because they're constantly being produced, but the decoupling factor is key here because not only can you use these streams to match writers with drivers, That same data is being used to figure out surge pricing.

[00:20:38] Will: Oh, I have a lot of drivers in this area, and most of them have passengers. Or look at historical trends, or machine learning is another huge use case. So, usually ride sharing is a really good, really good use case to explain data streaming.

[00:20:51] Anthony: Right. So the thing listeners should be thinking about is like what piece of their business feels like ride sharing and,in doing so would likely identify use cases, that were data streaming. could or would be valuable. know, you started, out by talking about speed as a primary, driver, but as I listened to you talk, it feels like speed is actually not the most important piece of this.

[00:21:20] Anthony: obviously faster is better, so we can sort of agree more speed is better than less. it strikes me that In all of these cases, the two things that rise to the top are actually more about reliability, and, I'm struggling with a word here, but I'll call it, you used the term hyper personalized, so I'll just steal it.

[00:21:41] 

[00:21:42] Anthony: but this idea that, any particular node in the network has exactly the information it needs, And it has this kind of guarantee about knowing the best version of that data that the system can produce at that moment. And I'm thinking here about really failure points, where, you're taking your rideshare example, you lose network connectivity through the, through a tunnel and on your way to the airport, you're still producing a lot of events, but the network doesn't know about them, You emerge from the tunnel, presumably the network gets updated.

[00:22:14] Anthony: That's by definition slow. I mean, in the sense that I was in a tunnel maybe for minutes, depending on the traffic. And yet, the system wants to have, the best information it can have. What's the most reliable set of information to give me a better experience? And so I don't know, is that, I'm trying to make a little bit of a, differentiation here against, this over indexing.

[00:22:37] Anthony: We have, I think there's a tendency, especially for technologists, to over index on speed. Like, everything's about speed. And it'swell, actually in this case, it's about being really reliable and really personal.

[00:22:47] Will: no. I think you're absolutely right. Because actually, if you think about one of the things that sort of distinguishes this new data streaming movement versus sort of the tradition, listen, people have been working on data fast for a long time. We just talked about it. We talked about complex event processing. There's been queuing. Then in memory queuing, there's all these ways to work with DataFast for a long time. I think you're absolutely right. I think that it's the combination of things that's made it so powerful, right? It is incredibly reliable, and that sort of temporal aspect you talked about, the decoupling, is also another reason why DataStream has taken off, because Like, part of the decoupling is this ability to broker the data at scale so that if people, like people can get the data at the speed they want it. And sometimes, yeah, ideally for some use cases, you want to do it as quickly as possible. And if you don't, it's a problem. But in other use cases, it's not, it's more about just doing it effectively as, as efficiently as you can. And sometimes you may fall behind but that's okay. Right. That's okay.

[00:23:58] Will: Because in data streaming, this data is, persisted for some period of time and you sequentially work at it at the rate you can. So you're absolutely right. And to your point about the reliability, it's not an exaggeration to say, let's just imagine That one day you wake up in Kafka, which is the basis of the day streaming movements, is broken.the entire global break. It would. Because like it is used across all the GSIBs, the, these are basically the large banks too big to fail. Right. and part of the reason is it's just, it's so incredibly reliable. So yeah, it's a great, it's a great point, but I still do want to say that, if you're a business, so that's incredibly important. A lot of times the opportunity, to innovate is how can I create better experiences? How can I do smarter things by acting on data faster? So there's sort of two axes, maybe, is the way to think about it.

[00:24:56] Anthony: Right. again, so that isn't to suggest that, that speed is not important, to your point. it's, maybe the way to think about this is less about,which one's more important, but you could imagine, Scenarios where, or systems even, which act very quickly, but quite unreliably. They may not have the most up to date information, but they're willing to make decisions very quickly, based on, guessing or, based on a model or things like that, which have the benefit of being very quick, but may be out of date.

[00:25:29] Anthony: and to your point about, go back to this idea of hyper personalization, I think what's exciting about what Confluence working on here is this idea that my experience is different than somebody else's because the streams that I'm collecting or the pieces I'm collecting off of the stream are very different.

[00:25:46] Anthony: I'm being a little too theoretical here. Imagine for the moment that you tried to create a rideshare application which worked entirely off of like a local model and everyone had the same model, then everyone would have the same experience and that experience could be very fast.

[00:26:03] Anthony: Wouldn't feel as though, I have this unique experience, that's relevant to me.

[00:26:09] Will: I think that where you're going with that is,it's,I think you're trying to frame it from a, like a customer consumer perspective, like everyday person. And that is true. Like it does enable that. The example you talked about, imagine that you did have a, what I would say a monolithic approach to it, where you have some big data store and everyone's acting on that. The,the big trouble there. Is that it's just, it makes it very slow to build anything because everything is tightly coupled. So even if you could do it, which I would argue, I think it would be very difficult with ride sharing, but take something else similar.the issue is you make one change.

[00:26:47] Will: You got to coordinate that with 30 downstream consumers, because they're all using the same database. For instance, this is the problem with monolithic architectures in general. That's why decoupling is good. the key thing here is, you know, how you said. You, the world doesn't need another database, but yet we keep on getting new databases. When you're trying to do anything for a customer, and let's take generative A. I. Because data streaming is critically important in generative A. I. You have to use the right tools. they work with data differently. And so, for the hyper personalization,different tasks require different approaches, and so in the case of generative A.

[00:27:23] Will: I.,there's these essentially these rag architectures, but basically it's the way is how do I take context about my customer, about my product, whatever it is, about my business, that these general purpose large language models don't have, and how do I augment it when I ask the large language model the question, because I have to deliver And so you need specialized tools, in this case, like vector databases and stuff. So again, the reason why that's doable is because it's decoupled. That customer 360 that you're using to, for like your online app. So someone can see my driving behavior or the things I've ordered or whatever, is the same data that's being delivered immediately to some other technology that's supporting, you Or another system that's supporting supply chain optimization, right?

[00:28:14] Will: It's very different. They're very different technologies. And that, that's the power.

[00:28:19] Anthony: Right. And so your point is that as a developer, the freedom this gives you is to develop these use cases, to innovate, to drive forward without having to worry about, that each one of these new use cases and applications is another load on this big central database. Is that a fair way of saying that?

[00:28:37] Will: Yes, and that you can use the most effective, tool for whatever your use case is.

[00:28:43] Anthony: Right, you're not tied to the traditional tooling. You can, in your example, with, RAG, you could be using a large language model, which is obviously the newest, shiniest of these tools, but, yes, a hundred percent.

[00:28:57] Will: Yeah, yeah, 

[00:28:57] Will: exactly. 

[00:28:58] Anthony: so, I want to thank you for joining us on data masters. I can say confidently that I have a better picture and understanding of streaming data and of the innovation that the confluence driving in the market, and, and hopefully, I certainly myself have a better feel for, streaming data.

[00:29:15] Anthony: where we might use these technology,and the sort of modern architecture that it's allowing for. And I hope listeners have the same. So, Will, thank you.

[00:29:25] Will: You bet. this was fun. I wish I had like another, I could just likepontificate for like hours about these things. So, yeah, absolutely. Maybe one last thing that, I'll just add is I think that the thing that's also changed is that in the early days of data streaming, it was only like the godlike technology companies of the world that could use these approaches. I think kind of the point that's made it spread is companies like Confluent is focused on making as easy as possible to put these pieces together, right? You don't have to run it yourself, this powerful tech, powerful,data streaming architecture. We provide the processing aspect, which we really didn't get into, but it's that's things like Flink, how do I act on the data as quickly as possible? And of course, connecting to other tools that you want. The vector databases that we talked about or search indexes or whatever. So I think that's really key. It's becoming easier than ever for even a small business to, to use these approaches. So

[00:30:20] Anthony: So all of these things we've talked about and creating this architecture, this decoupled, high performance, reliable architecture, something that might maybe historically required, like only LinkedIn could do, to use an example used previously. now is open to and available really to anyone that wants to build this way.

[00:30:39] Anthony: yeah, and then maybe that's, the great hope with many of these innovations is that we can invent something that's, really unique and different, but then make it something that really anyone can take advantage of.

[00:30:51] Will: Yeah, absolutely.

[00:30:53] Anthony: Awesome. Well, Will, thank you for joining us. that was extremely valuable.

[00:30:57] Anthony: Hopefully folks, learned something today.

[00:30:59] Will: My pleasure.

[00:31:00]

Suscribe to the Data Masters podcast series

Apple Podcasts
Spotify
Amazon