Overcoming COVID-19 data collection challenges
Paul Balas
"Data is critical to combating COVID-19. But challenges around COVID-19 data collection and sharing prevent this information from being fully utilized by federal and state governments. That's the perspective of Paul Balas, who was previously the Chief Advisor Advanced Analytics at mining company Newmont. Paul shares how his analytics background gives him a unique perspective on the topic of COVID-19 data collection, what's required to better align government stakeholders and how the private sector resolved similar data collection issues."
I'd rather read the transcript of this conversation please!
Nate Nelson: Hey everyone, and welcome to the Data Masters podcast. My name is Nate Nelson. I'm here as usual with Mark Marinelli of Tamr, who's going to talk about the subject, and the guest of this episode of our show. Mark how's it going?
Mark Marinelli: Pretty well, Nate, thanks. Good to be back. So, in today's episode, we're going to talk to Paul Balas, who was previously the Chief Advisor for Advanced Analytics at the mining company, Newmont Mining. He's passionate about modeling information. He's especially interested in the challenges that state and local governments face around collecting and sharing COVID-19 data. Paul talked to us about how his analytics background gives him a unique perspective on the topic of COVID-19 data collection, how a people problem is contributing to this situation and how the private sector resolved similar data collection issues historically.
Nate Nelson: Without further ado, here's Paul Balas.
Nate Nelson: Greetings Paul, thank you for being here. You're an analytics and IT expert, but today we're talking about the challenges that the government faces around collecting and reporting COVID-19 data. How did you come to work on this topic?
Paul Balas: Just as a concerned citizen with a background in data. When I started to see the reporting on COVID statistics, I became somewhat alarmed, frankly, in that there seemed to be no cohesive strategy for ensuring this vital information was prepared correctly, disseminated consistently, in order to make life and death decisions about how to apply public policy to our resources, bring those to bear to limit the loss of lives and improve the overall response. So that was just me saying, "Hey, this is a problem." And seeing it quite early in the reporting.
Nate Nelson: Right. But, of course, your background isn't in healthcare. So what among your skills and professional background qualifies you to speak on this topic?
Paul Balas: It all started when I was in my graduate program and I was deciding what I wanted to do. I was doing a double master's degree, an MBA and an Ms in information systems. And I had the good fortune to work with some really talented data professionals as part of that program. And really just had a natural gravity towards modeling information, managing information. And it just became my interest and my passion over time and developing those skill sets. So having worked with data and reporting for IPOs for companies, for managing supply chain, for improving customer experiences, truly understanding what customers were doing, I had a lot of different experiences that really set me up to understand how data can go wrong, how horribly wrong it can go. And when I found this specific problem with public reporting of COVID information, I just said, "I need to focus some energy on this." I think it's important. I know I'm a small voice in a very large ocean, but I think if a number of small voices come together and are saying things that are similar, then we can really have an impact and change outcomes for the country as a whole.
Nate Nelson: So, as we know, by now, COVID-19 hit hard and fast revealing the faults in many of our institutions. What kinds of issues do you think it revealed in health care and government data collection?
Paul Balas: As I started to research what the root problems were and the quality issues around data collection and processing, the first thing I looked to was our federal government and what they were doing to pave the path, so to speak. And frankly, the CDC, as that agency enabled to manage data around life and death pandemics, has really been the experts in that capacity, but I found that they had an aging infrastructure. And in fact, they had plans to modernize their data fabric, the tools and the techniques they bring to bear to manage this data, but I think that they got caught behind the curve. There wasn't enough funding, there wasn't enough urgency. And so they had the intentions of modernizing that infrastructure that's many decades old, but they hadn't gotten far enough along to really respond to how fast this pandemic was spreading and how many cases were being generated.
Paul Balas: And they hadn't, maybe, asked the questions or applied the answers to these questions about, "What do we need to know in order to manage a pandemic?" And that's evolving and the CDC and the state agencies are improving their position on how to do that, and they're starting to gain consistency. So I think they just really got behind the curve on this. And to really, I don't want to say catastrophic, but to really dire consequences in terms of lives lost and putting stresses on our healthcare system and our amazing and courageous healthcare workers who show up day after day. So the things that need to happen going forward is a more consistent and invested effort in modernizing this platform. And then also, importantly, agreeing how we should measure and respond to this pandemic.
Nate Nelson: Would you say that these are evidence of longstanding institutional problems with how we collect data? Or is it simply that the government and relevant sectors didn't have adequate time to prepare for such a seismic event?
Paul Balas: I think it's a little bit of the former and the latter. You've got to sympathize with these great professionals at the CDC and all the state health agencies who are really doing the best they can and applying a lot of effort and skills. I know that these people in these situations are working pretty tirelessly. But if you look at this type of problem, if you're a CEO or a CIO or a CDO, in any company of reasonable size, this is the exact type of problem that every leader faces. Everybody says that data is critical to their business. In this case, to managing public response, applying resources. But getting people to come together in order to agree across functions within a company or even within functions or, in this case, across the states and the health agencies, it's a very similar type of problem. And it's more about people.
Paul Balas: The technology is really getting much, much better in terms of how to manage data at scale, how to measure quality of data, how to put systems and frameworks in place. The technology's really pretty good to do that. It's more of a people problem and bringing all these different perspectives on what the things that are important should be measured to achieve the outcomes that we want as a business or as a government. And so, because of that, in this age of social distancing, I think a key concept for being able to get to that end game where we agree on what we should measure, we agree on how we should measure it, the speed with which it needs to be measured, needs to be some sort of collaborative process, some sort of electronic collaborative process around the data itself. Email does not work.
Paul Balas: The CDC, for example, publishes documents and guidelines. And they're novels. They're very exhaustive. They have a lot of information and it's almost like getting a degree in how to manage pandemic data. And that needs to be simplified. It needs to be institutionalized as part of the framework. And we need a way in which people can come together and have that debate virtually and say, "Yes, this is what we believe truth is. These are data governance standards around this data, and this is how we're going to manage it." But it needs to be part of the process of delivering the data. It can't be what traditionally is done with many data governance efforts, which it's a wart, it's an artifact, it's separate from the process itself of managing the data and getting agreement and collaboration.
Nate Nelson: Now that we've gone over much of the context for the problem, can you describe for the listeners here some of the actual discrepancies that we found in the numbers?
Paul Balas: Oh yeah. There's a myriad of problems. If we just focus on death and managing public response to increasing death count. There's different ways in which you measure death. There's provisional counts of death, which means that it hasn't gone through the process of having a coroner or medical examiner say, "This was the cause of death in this case. It was COVID related, it was directly attributable to COVID, or it may have been attributable to COVID." And so that is a provisional measure of what death, it's a proxy, it may be COVID related. And then there's this certification that goes on a death certificate, which says, "Yes, COVID." So there's gray in there, there's subtleties in there. And a lot of states and public health agencies and some of the other more popular sites that are aggregating this data from the states, they have not agreed to what those distinctions were or on how to combine it so that when you report a death number, a death statistic, it was well understood as to how it came about. Was it covert related or not?
Paul Balas: And that's a just prime example of not having a good governance standard of how to publish the data, which says, "Why are states publishing data with different meaning?" It creates a lot of confusion. It's one point of confusion. And by applying some data quality processes and rules to something as simple as that, we would be way ahead of the response in terms of how to apply our resources to manage the crisis. And if you take that one example, and you look at any of the other measures, which are critical right now in terms of infection rate, in terms of number of cases reported, in terms of capacity to be able to deal with the crisis within our healthcare system. Each one of these things has subtleties and areas of gray that need to be demystified and agreed to and aligned. There's no reason why Florida or Georgia or California should be reporting in a different manner. And that's really the root problem that we have to solve.
Nate Nelson: So by the time that this information gets to the public and there are issues with it, does more of the responsibility lie with the structural issues in our institutions, or is it more a function of media reporting?
Paul Balas: Well, again, you've got a spectrum of interpretation and you've got a spectrum of due diligence that when the media reports numbers is to be believed. And what I mean by that is, some reporting that I've seen has really gone and digged down into what these issues are. For example, in Florida, a data scientist who was in the Department of Health for a state agency was showing that the death rate was much higher than the people above her, that she reported to, were asking her to report. Rightly or wrongly, if there was ill intent or not, there was a discrepancy that had to be agreed to in terms of what that statistic was. And this data scientists said, "I stand on my principles and I believe that to change this number would be wrong. And I've got a responsibility to do the right thing here. It's that important to me." And she was fired.
Paul Balas: Now the reporting, there was one article that was reported that dug down into the details. And it explained why there were differences sometimes. And that was great because it really gave you the nuance of why this person was putting her job at risk. Some reporting on it was spun a different way, where they didn't dig down into the details. It was more about, he said, she said, and more about, "This person wouldn't listen to directions." An HR matter. And so both are interesting perspectives, but being a data person, I want to see what motivated her to do that. And we see this type of reporting go on, depending on what news feed you choose to look at, you get a different version of the facts. And so what I would ask when people look at the reporting on this is that they do a little bit of digging if they truly want to understand, or they've got the interest to understand what's happening there.
Paul Balas: A lot of public reporting right now is being done by the news media and they referenced John Hopkins University. Now John Hopkins built a dashboard pretty early in this COVID crisis because they recognized that there was no central place with which all this data existed to do the analysis necessary to shape public response. And they did a great job. But if you look at their GitHub, which is a list of all the problems that they've had with their data and all the discrepancies, they're unsolvable really. And the quality of that data should be somewhat suspect. Now, you might say, "Is it good enough to be able to measure public response?" At this point, I don't know if it's good enough.
Paul Balas: And it is challenging for these aggregators right now to be able to report accurately because we don't have the data governance in place nor the processes or standards to have really great data. So these news agencies, they want to tell a story and they rely on John Hopkins, for example, or the COVID Tracking Project, or some other site. Georgia, for example, had a problem with a graphic, which was a blatant error in how a chart showed the curve flattening with the infection rate, but in fact, the X axis was wrong and out of order. There were three problems with that data. And then the news media went off on that. And so I think we really need to dig down a little bit and understand the systemic problems.
Nate Nelson: So we've all come to understand at this point, this problem of information around COVID-19. There's too often been too little good information and too much bad information out there. What sorts of problems arise when we can't trust our data, especially in times like these?
Paul Balas: Well, people go with their guts, right? They do the gut check and they go, "Okay, I'm not sure I believe that number." Rightly or wrongly. In this case, we're balancing the economic livelihood of our country with the mortality rate. And that's a very tough calculus to have to deal with as a public servant. I envy nobody in that role who has to make these decisions, which are really life or death. And this, and in times of war, making those decisions without having good factual data is... You just go with your gut and it doesn't need to be that way. If we apply the right level of resource to these systemic problems for things that are really important in managing the economy, managing the health of our country, then it's definitely achievable. It's not an insurmountable problem. It's not even a huge problem in the scope of things. So I think that those are kind of foundational problems that need to be addressed in order to rise to this challenge.
Nate Nelson: Now let's talk about solutions to these problems. Paul, what is this term that I've heard from you, collaboration based data curation, mean?
Paul Balas: Yes. So it's really a great innovation that is taking place. If you've ever built a data warehouse, a dashboard, you know that when you report that data and you bring it to the boardroom, for example, that oftentimes people disagree with the numbers. It's like, "What does it mean?" And so leaders within companies will spend a lot of time debating about, "What is it? What is the data?" And in order to have everyone align and do things that are meaningful and impactful for the organization, they have to come to that level of agreement. And that's why data governance is important. The process by which we've been doing that for the past many decades is a fairly manual effort.
Paul Balas: When I did a lot of work as a data architect modeling systems, I would meet with business stakeholders and I would first look at the data and I would, I would do a statistical analysis of it. And then I would bring that to the table when I was in interviewing them and saying, "Well, this is what we're trying to get to with this dashboard. But I look at this data and what does it mean? What is this category? What is this value?" It's a very slow, laborious process. And modern day platforms have the ability to do that type of activity, to come to what the meaning is of the data using e-collaboration. And the best ones that I've seen don't make it a separate process from actually understanding and looking at the data. They make it part of the process and they make it efficient. So it's all about getting people together and having them be able to look at the data together and respond to it in an efficient way.
Paul Balas: And so that's basically what modern day collaboration looks like. It's efficient. It doesn't care about time zones. It's got kind of a cue, a response, you can ask questions of people, topical questions, I call it birds of a feather, around say a specific domain in the data. It might be your customer data, it might be COVID death data, but you can have all these people who are subject matter experts participating in this efficient, electronic collaboration. And then right then and there, they can apply their understanding to the data and the exceptions in the data. And really that's the linchpin of a modern curation platform that can get you much more quickly to the answer of agreement.
Nate Nelson: What you just said makes sense to me. But, of course, the problem we're dealing with here is a bit more broad than what might occur within the confines of a single company. How do you begin to effectively curate data when you've got to deal with all kinds of different independent players, in this case literally independent states, each with their own way of doing things?
Paul Balas: Well, again, it's mainly a people problem. And in order to tackle meaningful problems that involve more than just yourself, you have to be able to understand why. Why are we even talking about this? And that why becomes the what. What are we going to do in order to answer these questions? What questions do we need to answer? So without that kind of leadership, and in this case for COVID and pandemic response, federal leadership, because states aren't going to necessarily just come together organically, they may do it, but it won't be efficient without first saying, "Here's what we believe is necessary to manage in terms of data in order to respond." And that idea of taking information, it's only useful with a plan of action.
Paul Balas: So defining that plan of action and what we need to do to be able to enable that plan of action is the key first step in, frankly, any analytics effort. And then if we can start with that as our north star for how we manage and respond to pandemics, then the rest just becomes separate functional groups coming together in an efficient way to bring their view and perspective until we come to a commonality for that topic, for how we're going to measure that thing, in order to answer that question that we needed to ask upfront. Start with why. Figure out the whats. And then the rest becomes just execution.
Nate Nelson: What lessons of COVID-19 data and your collaborative curation concept can be applied to the business sector?
Paul Balas: In any business you've got different functions and businesses are broken up by IT, HR, finance, marketing, sales, and it's done because each discipline is a discipline, and each discipline serves the organization. But oftentimes they operate with their missions. Sometimes the missions are cross-functional and a lot of the important questions that you need to answer as any company are cross functional in nature. You've got to get those two functions to come together. So, for example, supply chain and marketing, there's a direct relationship with how much you need to build, or how much service you need to provide and staff up for based on what the customer demands are. And so if you answer these types of questions, you'll have a much better response to whatever your customer, whoever they are, and your ability to deliver a great product. So it's that nature of how we break apart organizations in order to do important functions that has to be bridged. And in modern day and age, with modern platforms, we can enable those types of solutions to allow people to efficiently come together, efficiently address problems, and get on with the business of serving their customers.
Nate Nelson: Paul, is it too late, or is there something that we can do about COVID-19 still to help the data collection reporting process? If you think positive steps can be taken, are we starting to see the relevant organizations taking them or no?
Paul Balas: You know, I think everybody who's engaged in this data flow is working very hard. My hat's off to the data stewards, the data champions, the data engineers, the infrastructure folks, the business people in government who are taking that information and shaping public policy. So everyone's working very, very hard. It's very, very inefficient. We are making improvements in terms of getting alignment of what measures are important, how they should be categorized. But we have no framework in place to really make that efficient and to make sure that what we're doing today is going to stay of good quality and improve the quality and then be able to apply what we've done to the next crisis. So there's a big gap still in terms of how people are going to work together in the future, and then how we're going to manage the quality of this type of information. No doubt.
Nate Nelson: Paul, have you got a last word to leave us with?
Paul Balas: Yeah. One other thing I think that politics needs no hand in this. I think the data's the data, it's in the state it's in, and what we need to do is to focus on improving the data, irrespective of the perception of it. We need to be honest and frank about how we're reporting the problems with it and focus on attacking each of those problems in a systematic fashion, based on what is most critical to start with; death data, for example, infection rate, is what types of problems are each of the states having with managing that data and producing it? What are they saying their problems are? I don't think anybody has taken a look yet at what are the problems that are systemic, pervasive, common across the states, across the labs, across the hospitals, and then understand what those commonalities are and lay down a plan to deal with it.
Paul Balas: So the first part, I think, we need to focus on is what are those systemic issues that everybody can nod their heads to across the agencies and go, "Yes, we need to solve that." Because if we don't have that targeted approach to solving this problem, we won't get there. Never try to boil the ocean. Focus on a piece at a time. And that piece at a time has to be relevant for enough people, with enough impact if we fix it that we can make a difference. And so I think that's what we need to focus on as a country, as people, is let's just dive into these problems, acknowledge them, and let's solve them.
Nate Nelson: That was my interview with Paul Balas. I'm back here with Mark Marinelli. Mark something to leave us off with?
Mark Marinelli: Sure. A key point here is that data challenges transcend industries. Paul doesn't have a healthcare or public policy background, but he knows data and analytics. And that experience is relevant to the topic of COVID-19 data collection. The challenges he brought up, the technology being available, but the people not being aligned around, "What outcomes do we want and how do we want to measure them?" The struggles around modernizing data infrastructure, making decisions with your intuition because you don't trust the data, all of these problems apply to the public sector, as well as the private sector. His solution to some of these issues, allowing subject matter experts to collaborate, to answer questions around the data, it makes a ton of sense and it's worth considering since it gets people to agree around what the data means faster, so that outcomes can be achieved more quickly.
Nate Nelson: Well that's it then. Thanks to Paul. And thank you, Mark.
Mark Marinelli: Thank you, Nate.
Nate Nelson: This has been the Data Masters podcast from Tamr. Thanks to everybody listening.