The Secret to Boosting Data Confidence at Novocure
Matthew Holzapfel (Tamr) and Marshall Worster (Novocure) discuss how to implement a data product strategy with Tamr + Google Bigquery that ensures the delivery of the clean, trustworthy data needed to run your business.
Tamr + Google: The perfect combo for clean, trusted data
- Hear why Novocure chose Tamr + Google to boost data confidence
- Learn about the purpose-built, out-of-the-box capabilities that Tamr data products provide
- Understand how Tamr data products + Google Cloud can accelerate data curation and improve accuracy
Tamr + Google: The perfect combo for clean, trusted data
- Hear why Novocure chose Tamr + Google to boost data confidence
- Learn about the purpose-built, out-of-the-box capabilities that Tamr data products provide
- Understand how Tamr data products + Google Cloud can accelerate data curation and improve accuracy
Want to read the transcript? Dive right in.
Speaker: Matt Holzapfel
Time Stamp: 0:00
Alright. Thank you everyone for joining us today for this discussion, between ourselves, Tamr, Google and Novocure to talk about how Novocure is really boosting the confidence in their data internally and starting to drive towards a next generation data ecosystem to deliver more value for their business stakeholders internally.
Joining me today is Joan Kallogjeri from Google, and Marshall Worster from Novocure.
I'm Matt Holzefel. I lead corporate strategy at Tamr, where we help our customers deliver high quality trusted data products to increase the value of their data and make their data better so that they can ultimately make better decisions within their organization.
I'll hand it off to Joan to give an introduction and then hand it off to Marshall to introduce himself as well. Joan..
Speaker: Joan Kallogjeri
Time Stamp: 1:09
Thank you, Matt. Hello, everyone. I'm Joan. I'm a customer engineer here at Google. I primarily focus on data analytics and machine learning. Our team is primarily focused on biotech and biopharma companies based out of the Cambridge area, the New York area, and some in the San Francisco area. Marshall.
Speaker: Marshall Worster
Time Stamp: 1:29
Everyone, thank you as well for joining.
I’m Marshall Worster, senior director in the enterprise transformation group at Novocure.
I was formerly, actually a part of Google in my past, but made the jump into industry to help drive some of these data and technology strategies forward. So for those that are unfamiliar, Novacure is an oncology company that specializes in trying to treat some of the rarest and most aggressive types of cancer in the world. And so as we continue to grow and invest in our clinical portfolio, leveraging some of these technologies on the data, as well as getting into the cloud has been top of mind for this group.
And part of that really has been around the way in which we look at data. So organizationally, we've done a great job in having datasets that we can leverage for things like, BI, corporate reporting, and other ways to help us make decisions.
Now as part of that, we have yet to take advantage of some of the cutting edge technologies that are out in the market as it pertains to the cloud space, self-service analytics, AI and machine learning. Where we see our trajectory going as an organization, and really understanding, you know, where we are and where our true strengths that we can leverage today. And also what does the organization need, to move forward and to capitalize not only on new technologies, but also to capitalize on the outcomes and benefits that we can leverage by using new ways to interrogate our data.
And so as such, we've made a significant investment to build out a comprehensive data lake, leveraging Google Cloud's BigQuery as well as a handful of other technology partners such as Tamr to help us cleanse that data and ensure that when we go and start running more deeper and curated analytics, we know that our data is accurate, and we know that our data is going to give us the answers that we're looking for.
And part of our journey is not all just technology related. So part of my role at Novocure is really to help understand not just how we get there, but who is going to help us get there. So, one of the things that we've set out as an IT organization is partnering across not just IT, but across our business units to understand: what does our multi year strategy need to look like in order to support our growing business? So as I mentioned, Novocure is investing heavily to treat these very aggressive forms of cancer. So understanding what are those needs to support things like clinical trials, to support analytics of how how patients thrive on our therapy, and understand how we can become really a proactive enabler to our business counterparts, and give them a new way to engage or a new way to interrogate the data in which we've curated and and built over the last fifteen years or so.
Part of that is also a culture shift of leveraging new technologies and finding new ways to engage in the business and create true partnerships and relationships. As I mentioned, having that kind of proactive business first thinking of the outcomes approach is allowing us to be more specific with how we look at new technologies and how I want to pull them in house and to provide that best service to our downstream counterparts.
And then lastly, really around talent. So a big part of this journey, in taking on new challenges, building this modern data lake, as well as to leverage new technologies is really around upscaling our teams. We have a lot of really great and talented individuals at Novocure. And bringing in these technologies and enabling our teams and our groups to learn, and be able to leverage them to deliver for our business counterparts has been an awesome part of the journey, as well as we've had to augment with some other specific areas to help us grow and and scale the way that we want to. These three areas of finding that vision, having that culture shift and investing in the people to get us there, have really been a core tenant to allow us to be successful in bringing this modern data leak project to life.
And the way that we looked at doing that is really leveraging tools and capabilities that we already had in house. And then as I mentioned, upskilling our team to learn new technologies and new capabilities for us to take what we know, plus what we can know, and what we can learn, and have that drive our ability to excel in leveraging a new way of interrogating our data. So we started out leveraging Google Cloud, BigQuery as our foundation.
And we really went down that path after looking at several players in the market, you know, evaluating off the shelf solutions, other hyperscalers, and the team came to a consensus that Google Pinkware was gonna be the best solution for this team and especially the types of deep analytics, leveraging things like Jupyter notebooks in our studio that we're really looking to do. So in order to get us there, Right? We know where the landing zone's is going to be. We know where we're gonna get there.
Then the question became, what data are we gonna start with? And how are we going to ensure it's really good data? So we then looked to partner with a few key partners out in the market for technology. And since we already had SAP and we already had Viva, what we are really looking to do waws make sure that that data is clean and make sure that that data is able to be accessed. So we leveraged our partnerships with five Tran initially to help us replicate that data into Google BigQuery. But then we realized as well as we started going that path that on, for instance, the right hand side, we have a lot of conflicting data sets as it pertains to healthcare organizations, healthcare providers, and all the data that we have around that base of providers out in the world. So seeing as we can get data from the open market, for instance, from IQVIA, we use multiple other vendors to get data on the healthcare provider space globally. And we use Viva as our CRM for our field teams that are out there working with doctors, physicians, nurses on how Novocure's Optune therapy works and how we can best educate and leverage our knowledge, to help these patients out in the field.
And so what we quickly realized is, okay, great. We can get the data. We could move it into BigQuery,, but we had a lot of conflicts. And we had a lot of you know, well, is this the right data? Is that the right data? We're seeing things coming from different areas that are not giving us a clear picture. So that's where we decided to call on Tamr after looking at a handful of solutions in the market. And what we were really looking for was something that would be very complimentary to Google BigQuery. And we wanted to have something that would be very, very lightweight and efficient that would allow us to load this data, master it, and then start leveraging it incredibly quickly. And so for us, you know, working with the team, we saw that we could, you know, very easily load all of this data both clean and dirty into BigQuery or start leveraging it straight from the sources and loading it to where we could then go and and master it, clean it, curate it, set our rules, and use their machine learning technology to then allow us to get those master records and load them into the the final curated dataset within BigQuery.
So for us, it's been a very big project. And something that has been, you know, really successful to date. And on the backside, right, using our own teams internally, to to constantly ensure that we're not just pushing everything to the data lake, but we're also selectively, building capabilities to pull that clean data out and into our downstream master record system. So, for us, like I said, it's been a mix of leveraging new tech and leveraging the skills and teams that we have as well as up scaling our team to be able to deliver on these. And and and Tamr's been a pretty core part of that to help us cleanse that data and make sure that when we get into the data lake as we start running our models, we know that the data is accurate.
Speaker: Joan Kallogjeri
Timestamp: 10:02
Thanks, Marshall. That was super helpful.
Next, I'm gonna talk about why customers are particularly picking Google Cloud for modernization, whether that is their infrastructure or for more downstream analytics and how cloud can help you further gain insights on your data and your analytics based on using computing resources from the cloud as opposed to managing on premise infrastructure.
Specifically, for Novocu re, we really want to talk about why they chose Google Cloud to begin with, and initially it was their data lake modernization, right? So taking silo data of many disparate sources, like Viva and SAP, and then consolidating that all into one unified data warehouse, data lake, that they can then do downstream analytics from.
And sort of in the spirit of that, why are other customers also choosing to build on GCP, and there's sort of a lot of different pillars here that you can take advantage of, right? We make it really easy to migrate this data as you saw in the previous slide. We provide scalable, whether, scalable infrastructure, whether it is on demand processing, right? So if you're doing a lot of compute intensive workloads, and you don't necessarily want to buy upfront hardware, you can use the scalability of cloud to scale up and down as you see fit and not necessarily pay for more hardware that you will not use in the later stages.
As well as there's a lot of innovation that is going on, whether it's in the machine learning space or the serverless space, taking advantage of manage products and services that take away a lot of the headaches of actually managing infrastructure and more around what is the easiest way for me to get to a specific insight. Especially on that machine learning space, there's a lot of work being done on providing foundational models to then be able to build on top of APIs that you can use as a baseline for your research for your analytics. For your clinical, abstraction or or clinical research, as well as the the main pillar here is the security and governance at scale.
When you're on premise, you really have to have a secure network, a secure secure infrastructure, and a security team to be able to manage all that. But as you move more into the cloud, a lot of that gets taken up by Google and other cloud vendors and other partners that really go in and make sure that all the hardware that is storing your data or restoring your processing is secured.
Thanks. So I want to continue that conversation, but sort of put a bunch of tools and services that we offer and where they fit in the process. We have the data ingestion layer that really is all about how we can get data into that centralized storage. How can we catalog and sort of add metadata to that, to either the ingestion or the actual files or tables that are being stored to add a little bit more depth into what is actually being stored. So things like you know, where did this data come from? What version is this, etcetera? So you want to be able to essentially catalog that and add metadata to that so then you can later downstream add some searching capabilities to that. As well as, obviously, within the central storage here, BigQuery's domain, data lake and data warehouse service that we offer that really harnesses the scalability and cost affect business that you need, but also takes into account the security aspect.
And then a lot of that can be augmented with proper IM permissions. So, making sure that nobody has permissions and roles that they're not supposed to, federating everything to lease privileges, and then on top of that, having services like DLP where you can de-identify data, whether it's PII data on the fly without actually having to build your own service. Right? And then once you move on to the the sort of the next step, once you have data already centralized, you really want to kick off the the processing and analyzing of this data, right, whether it's just basic transformations of moving from IRA to more of a curated zone to way more downstream analytics, right, like machine learning and building actual pipelines that could use a lot of this data to either provide new insights or add a layer of prediction in terms of what drug discovery process you're you're running on.
so next, right? Like, we'll break that down a little bit further into much more of these five buckets here. Right? There's the capture zone. So this is where you're really doing all the data ingestion, right, whether it's from on premise, from a different cloud, we offer a lot of tools that are that, like the data transfer service or storage transfer service that is UI based where you can essentially quickly start a transfer job where you can get data from, you know, AWS or something and immediately ingest it in to GCP without ever having to write a line of code.
Once you have this data, within the GCP environment, we offer a lot of different tools for a lot of different stakeholders.
They're all away from, much more user friendly or non programmatic friendly tools like data prep and data fusion that really harness the ability of drag and drop, ETL, where you can, If you're not familiar with, like, Python or SQL or any other language, you can use a UI that's basically like Google sheets week and drag and drop and add filters that are out of the box without actually having to write a single line of code. A then all the way to the other where you're writing, you know, Python programming, whether it's on Apache beam or Apache Spark, which are much more scalable and skill intensive in terms of programming.
So we offer something that's for both audiences here. Then once you have this transformation, you want to end up storing that data somewhere. Right? So we offer a bunch of different services, whether it's unstructured data,like files, file systems, PDF documents, images, PDB files, pickle files, any type of file format that you can think of all within the cloud storage ecosystem, which just blob storage, And then for more structured data, this is where you want to dig into BigQuery where if you have schema based, files or or, like, parquet, CSV, etcetera, you really want to put that in a store in a, table like format so then you can do further analysis on top of that, and that's where the analysis engine of BigQuery comes in. Where you can essentially create queries, where you can further curate this data set to either grab specific insights or use something like Dataform, which is built within BigQuery where you can really start to build ETL platforms and ETL tooling on top of BigQuery all in SQL as well as we've, you know, added machine learning on BigQuery as well. To be able to, use our out of the box machine learning models that we've created, like, regression, classification, forecasting, anything like that all within BigQuery, all within SQL syntax. So you don't even have to program in anything more complicated than SQL should be able to do machine learning.
And then, obviously, there's the, what, much more advanced analytics on top of that. Right? So for TechAI, machine learning platform, for those that really want to go in-depth and really take advantage of Jupiter environments or our environments to really, further their analysis and prediction capabilities of that data with, you know, frameworks like TensorFlow pytorch, scikit learn, anything like that. You can really take advantage of it using our Vertex AI platform. Or if you if you really want to stick to the analytics standpoint, and mostly do data visualization on this on this data that you've now ingested, our Looker and and Google sheets both connect to BigQuery natively, so you can really start to play around with you know, all the different types of visualizations that you can create and then be able to share that across the company or even externally to your vendors or customers.
And the sort of things that overlay over all this, like I said, the data catalog feature where you can add metadata to every single part of this, ecosystem and really have fine grained metadata that is populated for every single step so you can better track lineage of where data is coming fro, what kind of transformation is happening to that data, as well as, you know, where that data is living. And then cloud composer is more of a tool that really stitches a lot of these products together to make a more unified solution.
Oh, and then obviously, we offer a lot of the tools and services to make all of this possible. But next, if you really want to take it a step further, like Novocure, this is where our partner ecosystem comes in, and fantastic partners like Tamr can come in and take this to the next level, and I'll let Matt talk more about that.
Speaker: Matt Holzapfel
Timestamp: 19:39
Yeah. Thanks. Thanks, Joan. And, I mean, we certainly have been the beneficiaries of a lot of the work that Google has done on these underlying services, given that our software is built on top of BigQuery.
It's great that we can have confidence in the scalability, reliability and security of our product because it uses these Google services.
And one of the things that I also want to call out, which, kind of ties into what we're going to be talking about with Tamrs data products specifically is the journey that Marshall and Novocure are on feels very representative I’d say of what we see pretty broadly the market. And I think one of the things that I really appreciate about Marshall's story is the way that he and the team have kind of taken this very holistic view to the problem, of thinking about ultimately how do we deliver more value for stakeholders, and what are all of the pieces that that we need to put in place in order to get there, between infrastructure and technology and people and culture. How do we make our organization more successful with data? I think one of the common themes that we see within organizations is when they're purely just focused on the storage and compute problems.
I think it could be kind of difficult to really capture all of the value of their underlying data, kind of, more specifically, the problem of storage and compute, think as Joan outlined, has really been solved. And so that is one really important piece of the puzzle, but it is just a piece of the broader puzzle.
Really, I think where we see customers, and companies being most successful with their data is when they are really focused on how this data is going to be consumed? What are those different applications that we have of our data? Where do we want to get to over the next, you know, three to five years? I really liked the slide that we saw at the beginning that showed on a kind of relative basis where does Novocure expect to be on these different dimensions of data consumption.
And I I think that that's, you know, never been more important. There are so many endpoints, so many ways to consume data. And now, so many people within the enterprise have a level of proficiency and competency when working with data that they're just now hungry for more data, better data. And being able to serve that high quality data in a very accessible, reliable way is really the core problem for most companies.
And I think the big, big question we face as a data industry more broadly, is if we do have all of these tools in place, if we've started to and made great progress on building skills around using data, then why do we still have so many organizations and companies and places within the enterprise where the data isn't trusted in dashboards and all of this data that people are consuming ultimately, they're not actioning because they don't actually believe in what's in the data though or they don't believe that it's right and applicable to their business.
And I don’t think the reason for this is for a lack of tools. Certainly, there is no shortage of data management tooling. And I think one of the things that is really encouraging is that in all of these dimensions the tooling and the technology has really matured to a point where it's no longer the bottleneck. Both from a sheer cost standpoint, a lot of these tools have become extremely accessible, and then also just in feature and functionality. I think it's fair to say that the quality of data management tooling over the past five to ten years has really increased and improved exponentially.
And so there's really something kind of much much deeper here at play, causing a lot of the issues that we see throughout organizations of not really getting the maximum amount of value out of their data.
And kind of more specifically, one of the things that we're seeing companies, like Novocure for example, doing very, very well is that they're starting to treat data as a product. And so rather than focusing on purely just the tools problem, really thinking about the problem much more holistically of who's managing this data on an ongoing basis? Who are the consumers of this data? What are the types of outcomes that we want to drive towards?
And by building this data as a data product mindset, this is one of the real key levers that organizations have to getting more value out of their data. It does require a more holistic view of the problem and specifically, when thinking about what does the data product itself need to do? It introduces some new capabilities and some new ways of thinking. I think, you know, one of the things that I'll highlight in here that I liked about how Marshall had laid it out is this idea of bringing in external data into the process as early as possible. This is something that we kind of generally find works extremely well, is this understanding that our data has limitations, or at least our internal data can only get us so far. We need to augment that information and that data with external sources in order to get the clean high quality data we need in order to ultimately drive these analytic outcomes.
And so when thinking about putting in place a data product strategy and having this kind of holistic view of how do we take all of the raw data that we have today, put in place all of the toolings and processes and people in order to turn that raw data into something useful? I think one of the key pillars within that is really thinking about how you augment what you have today with external data.
Another key part of this, which was highlighted by both Joan and Marshall in their discussions, was the use of AI/ML in order to solve these problems.
We're at a place now where the volume and variety of data that organizations can use and really need to use in order to be effective has gotten to be extremely significant. And so how you actually reign all of that in a way that's going to be efficient and also prompt to be able to serve data to business users and consumers when they need it. Really a key part of that is leveraging advanced technology such as AI/ML in order to automate a lot more of the process and really make this process of delivering the data products and managing those data products something that feels a little bit lighter weight. It doesn't feel like this massive undertaking because a lot of data teams are really just stretched thin right now. And so being able to make those data teams more productive requires the use of AI/ML in order to really help stitch all of that data together, clean it and then ultimately enrich it in order to create that mastered record in that clean single view of your customers, your patients, your providers, in any other kind of key domain or entity, that really drives your business forward.
And specifically when thinking about building out these data products and going on this journey of starting to create data products for your organization, thinking about this problem in a very domain driven and targeted approach is critical because the type of external data that's available, and the types of AI/ML models that you'll want to apply to the data is going be very different depending on the domain. And you want to be extremely in tune with what your business stakeholders' expectations are in terms of things like, you know, what attributes do they care about? What is the sensitivity that they have to data quality issues? And what are the things that I need to flag and address? Those are going to vary quite a bit depending on the individual domain. And so putting in place a strategy that is really targeted around the individual domain and working with the people within the organization on the business and the data side to really understand how do we build the best data product that's really going to serve our end consumers is critical. There isn't a one size fits all approach with every single domain. It's important to really understand what are the nuances of that domain and then use the technologies that are available that really work well for that domain. And also with the partners of yours who have expertise within that domain can help guide you through that process and give you templates so that you can be effective.
And this kind of end to end process of combining, linking, cleaning and enriching these data sources in order to get these highly comprehensive dynamic and ultimately insight and consumption ready data products is really at the heart of what we do at Tamr and and where we're really excited to help our customers.
We see a lot of companies that are in the early stages of modernizing their ecosystem and are really focused on building that first data product, the data product for that most important domain within their organization. And we encourage all of our customers to start there. Start with getting that win around that important and key domain. Usually, it's some form of customer data, or whatever customer might mean in their business and then start to kind of build out, and think through the vision for once we have that first data product in place, and we really start to drive consumption of that data product. We want the next two, three, five data products to look like, and putting in place a strategy that's going to enable you to really grow and expand the number of data products that you're managing within your organization over time.
What I think was clear within the slides in the presentation that Joan walked us through is the cloud has really enabled us to do much, much more with our data. Kind of the art of the possible here has really expanded. And so working with technologies, such as Google, such as ours that that really enable you to expand that ecosystem and build something that's going to really help you tap into the art of the possible is critical just given how fast the the industry is changing, and how quickly people are leveling up with how they use data.
And just within that, one important thing to keep in mind, and to bring this back a bit to our overall subject of the webinar is around building confidence in the data. One of the things that can't be overlooked is the power of just simple user interfaces in order to allow people to interrogate the data. We think of this as entity wiki pages, but really the key idea here is, if you're going to to make your data available to dozens or hundreds of people within the enterprise, and you want them to to trust that data and trust, ultimately, the aggregation of all of that data as it's presented in a dashboard or a recommendation coming from an AI model, one of the important things is making it very easy for people within the organization to drill down into individual customers for example, or just, entities that they know about, so that they could validate that the entities that they know about have and understand well, that those are being accurately represented in the data. Once people review and do their own spot checking of the data in the data product, then their ability and willingness to trust that data and use it more broadly increases dramatically.
And so one of the things that that we've been been focused on is building out this experience that makes it easy for users to to really do this this type of of drill down and be able to build trust because the anecdotes that they have about their data or the understanding that they have about these key entities could really validate on their own and see with their own two eyes that yes, this this is in fact accurate.
So just to wrap things up before we get into Q&A. If you're on this journey where you're starting to think about using data products and building out a data product strategy in order to ultimately drive and deliver more value from your data, I think there are five important steps to take and I'll highlight just a couple of these.
Specifically, I think the first one is knowing your why and knowing where you're going.
As we saw at the the beginning, having this more holistic view and really understanding what are going to be some of the the bottlenecks in the process is really critical to just being able to move fast and ultimately deliver value very quickly because you're thinking a few steps ahead and you're not kind of just myopically focused on this one individual pain point or problem that that you have. But you're really thinking about how we ultimately deliver value for the organization at a much higher level.
And then the final piece that I want to highlight is just the idea of a minimum viable data product.
I think one of the things that causes a lot of companies to not make as much progress with their data as they would like is when they take a ‘boil of the ocean’ strategy. We're going to catalog every single piece of data that we have. We're going to try to understand it and do discovery and deep investigation of it. And then we're going to try to clean it up and make it usable.
If instead, you start with what are those one or two key outcomes we want to drive towards? And then just focus on the data and the people in the process needed in order to get that in good shape. Then you could build this minimum viable data product and start to learn how can you effectively manage data as a product within your organization? and how can you start to capture some of the value that is really hidden in your data today?
And so with that, we still have a little bit of time here, and we wanted to save some time for questions.