Can Data Ever Be Perfect?!

In this episode of “Talk Tech with Data Dave,” Alexis and Data Dave dive into the intriguing question posed by Jo from ZoomInfo: “What stands in the way of organizations having perfect data, and is perfect data a realistic objective?” Dave breaks down the complexities of data collection, highlighting how inaccuracies often stem from human error at the point of data entry, such as the common issue of mandatory fields leading to false information being entered. They discuss the importance of understanding the purpose of data and the necessity of accurate observations to ensure meaningful predictions.

Dave emphasizes that striving for perfect data may not be as practical as aiming for accurate and relevant data, tailored to the specific needs of an organization. They explore how data’s historical nature can influence its use in predictions, stressing the need for confidence in the data’s ability to answer the questions posed. The episode wraps up with a real-world example from Dave’s experience, illustrating the pitfalls of poor data practices and reinforcing the critical role of accurate data capture. For listeners keen to learn about the balance between data accuracy and practical application, this episode offers valuable insights and a bit of humor with a mention of dolphins on mountains.

HAVE A QUESTION?
Ask Data Dave about all things data, cloud, or technology.
We'll be happy to answer your question on the podcast.

or send us an email to: techtalk@d3clarity.com

Published:

October 10, 2023

Duration:

00:20:00

Transcript

Alexis
Hi everyone. Welcome to Talk Tech with Data Dave; I’m Alexis, you know me from Talk Tech with Data Dave, and I’m here with my friend Data Dave.

Data Dave
Good morning, Alexis. Good morning, everybody. Welcome to Talk Tech with Data Dave with me.

Alexis
I said that enough times. Yeah, we both work for a company called D3Clarity, and we run this podcast. We’re super excited because today, we have a listener question.

Jo from ZoomInfo
Hello, Alexis and Data Dave. Jo from ZoomInfo here. Want to hear your perspective? What stands in the way of organizations having perfect data? Is perfect data a realistic objective? Thanks.

Alexis
Jo, that was an awesome question. Thank you for asking. Dave, before we move on, I need to kind of put this in Alexis words so that it makes sense to me. What stands in the way of a person or an organization or a group of people having good data, and is perfect data even something to strive for?

Data Dave
Thank you, Jo. That’s a fabulous question. That really is a great question. So, let me repeat it, and I know we’re repeating it a lot. What stands in the way of good data, and is perfect data too much to strive for?

Alexis
Yeah. Is it even possible? Is it even something we should be aiming for?

Data Dave
So, let’s talk about that for a minute. Let’s go back to our very first episode for a minute, which I think was “What is data?”. Let’s start there because if we look at data, then we have to start saying that if data is evidence of behavior, if data is just simply the evidence of something that happened, then what is good data versus bad data. Just to dive straight into it. So, if you start there, all data is evidence. So if the data is bad. What do we mean by bad? Did the data not collect the correct event? Not describe the event correctly? Or did the data get corrupted? What caused it to be bad?

We have to look at the observer. When you observed the event, did the observer not collect the data properly?

Let me take a point of sale example right. If you collect the data at point of sale that says something was sold to somebody, and you’re collecting that as an event at point of sale then, did you collect it accurately? How was it collected?

If you want to know what was sold, that’s fairly straightforward. It’s on the barcode. You scanned it. If you want to know who collected it, how do you identify that person who bought it? Who purchased it? How do you identify that person? We see a large number of organizations that say, oh, we want the email address of every customer. Great. Nice question. You’ve probably been asked that as you check out of stores?

Alexis
Absolutely, yeah.

Data Dave
I have your e-mail address please, right?

Alexis
I usually say no.

Data Dave
You usually say no, so now you’ve got a problem, right? Because if the store has mandated that their point of sale people are going to collect email addresses. What do they put in that field? Often they’ll make that field mandatory. But you don’t have to provide it, and they make the field mandatory, because if they don’t make it mandatory, then the point of salesperson will take the shortest path possible, which is not to ask. And then it stays empty. If they ask and you say no, what do they do? Well, now what often happens is you get an invalid address put in. It might be the email address of the store. Yeah, they will. They’ll put in the e-mail address of the store, or they’ll put the info at the company that did the selling. Or surprisingly, you’ll be surprised how many times you see Mickey Mouse. Mickey Mouse at… Where people will put all kinds of rubbish in there.

So, you have to be super careful when you do this because now what you’ve done is you’ve just corrupted your data. You’ve just corrupted the history of that purchase because now you’re saying that Mickey Mouse just bought a refrigerator.

Alexis
What’s Mickey Mouse need a refrigerator for?

Data Dave
Thank you. A mouse did not buy a refrigerator in Austin, TX, because that’s where the story is. So, no Mickey Mouse didn’t. We can capture that, and we can clean that up afterwards as we get into the data, and we can do statistics and all kinds of things to say, well, it looks like Mickey Mouse is actually an invalid email address and that sort of thing. We can do that.

Alexis
Right.

Data Dave
And we do do that. But you’re corrupting your data by definition. But is the data wrong? No, it’s actually the observer that was wrong. The observer didn’t collect the correct event. Therefore, now you’ve corrupted your data.

Alexis
I like that for the beginning of his question of, you know, what stands in the way of perfect data. Like it was a perfect example, an observer, whoever collects that data or whatever, collects that data.

Data Dave
Exactly. So that’s one point. Another point would be perfect for what? So, I’m using this data for something. So, we asked the question, what stands in the way of perfect data? Perfect for what purpose?

If you look at the data and say this data is just a collection of history. Then it can be perfectly reasonable data. It just collected history. Great. Nice. But when we get into analytics, and we get into using that data for something more interesting than history, we want to predict the future. We want to make some prediction based off this data.

We want to predict sales. We want to predict something. Then we have to ask the question, is the data good enough to make this prediction? Is the evidence in the data actually rich enough to predict the future that we want to predict? I’m going to use the example of the winter coats in Chicago that we talked about some episodes before, right?

Alexis
Oh yeah, that was our very first episode.

Data Dave
Right. I think the very first episode again, which is if we use the data of last year to predict the sales of this year, and we say in January we’re going to sell so many winter coats in Chicago. But we ran out last year. We were out on the 15th of January. Okay, we can predict that we’ll use the same number, but is the weather going to be the same? What if it’s warmer this year, so we have to add to the data. We can use the sales data, but we need to augment that with the, “ What was the weather like in Chicago last year?” and “What is the weather going to be like in Chicago this year?” and then make an inference. We can infer that the weather’s either going to be the same warmer or colder and make a prediction. We can use it to a certain extent, but we’ve got to correlate our prediction with our confidence level that our data of history accurately describes the future as well. So, when we extrapolate that data, we’ll use mathematics to extrapolate that data. We’ve got to sort of be cautious of how much that extrapolation is valid.

Alexis
Dave, I need you to take that down a notch for me. “Extrapolate”?

Data Dave
Ohh extend into the future. Right, extend into the future. So, it’s kind of, I can draw a graph of the past, can draw a straight line of the graph of the past, right? Goes up. Well, if I say here is now, and then I want to get… Well okay, based off of what this line doing here, I expect it to do that, right, expect it to carry on in exactly the same pattern.

Alexis
Right. Okay.

Data Dave
It has to date the history has carried on. And then from this point, it’s going to continue in exactly the same way or similar way. And my pattern is correct. And this is where you start to say that mathematics is the language of prediction, and data is the language of history. So, data describes history, and then you use mathematics to extrapolate, to extend, the patterns of the past into the future and then make a prediction based off them. Does that make sense?

Alexis
Okay, yeah, let me recap quickly. I’m looking at that question again. So, we’ve talked about the idea that probably an observer is often what stands in the way of perfect data. But more than that, we have to figure out whether or not the data that we’re needing is actually there, and whether the data that we’re following is actually the right data to follow. Like the example of that winter coat.

Data Dave
Yes, I would say it this way, which is, we’re asking the question of the data, does the data contain the answer that we’re looking for? Right? Is it plausible to ask this question of the data, right?

Alexis
Okay. Okay. Yeah.

Data Dave
We’re going looking for dolphins. We can’t look for dolphins on a mountain. Right? We’re not going to find data that describes a dolphin on a mountain, and I’m being flippant, but it’s true. We’re looking for a prediction of sales in Chicago. Don’t look at San Francisco data.

Alexis
Right, right.

Data Dave
But yes, I’m being sort of flippant and broad, but in the same way we’re looking for sales data in Chicago in January, does the annual prediction, does the weather forecast, is it plausible to take this forward, or how plausible is it to predict based off of this data? And it’s not that we can’t do the prediction. It’s the fact that we have to have a confidence level on that prediction as well, based off of how well the data of history describes the future and is valid for that prediction.

Alexis
Let’s talk about this from Listener Jo’s perspective, because Jo works for ZoomInfo, not a sponsor. Awesome company. Their essence is to figure out correct information for people, correct data points for people. As far as phone numbers, emails, job titles, companies, stuff like that. So, he’s not often predicting things. He’s just trying to get the most accurate data.

Data Dave
So, I like ZoomInfo. I use ZoomInfo quite a lot, actually. I use it for augmenting data sets, so I use ZoomInfo when I’ve got a card or a person’s data and data about organizations and things like that. So, I use it to augment the data that I’m provided to add to the evidence that they have.

Alexis
I.e. fix when someone has a Mickey Mouse email.

Data Dave
I fix when somebody has a Mickey Mouse email. Potentially yes.

Alexis
Okay, cool.

Data Dave
Exactly. So, I use ZoomInfo data quite a bit. I would come back to the observer. First of all, there’s a latency. If somebody changes their job role or whatever, how quickly does ZoomInfo pick it up? I know ZoomInfo does a really good job, and it’s good data. But you can only take it so far. It can only be so accurate. If you make a decision to leave your job and you resign, that takes a while for the world to know. After you’ve made that decision, is it appropriate for you to expect the data in the world in the public domain or ZoomInfo or anywhere to know that tomorrow if you resign today? No.

So, what is the expected latency that occurs on this data set? So, there’s always going to be a certain latency.

Alexis
Right, right.

Data Dave
In some of these, because it’s the, “How was that event observed?” and ”When was it captured?” that creates this data set, so there’s always a latency. There’s always inaccuracy. Things change all the time. Company 1 bought Company 2, Company 3 isn’t quite big enough to be on the radar.

There’s all kinds of things and not enough web traffic of company X or so on. And so, at company X, for us to realize that there is a contact or whatever it might be. So, there’s the, “What is happening?” that is observable and how well is it observable and what confidence can you therefore put on the data?

Alexis
I guess my best example of this would be, like, I just moved and so I have a new address. And so right now, my credit card company has the wrong address for me because I haven’t changed it yet. That doesn’t make their data bad, right? It just means that there’s the latency. There’s some time where I have to fix it.

Data Dave
That’s right. Exactly. The data isn’t bad, but it isn’t correct either. But you can’t blame the data. You have to blame Alexis.

Alexis
Yeah, it’s my fault. I really is. I really need to do that.

Data Dave
That might throw people off, and so the laws of large numbers start to come into play when you use this data and start, say, well, yes, Alexis’s address is wrong. In the next month, that will be corrected, and that will be good. Again, but somebody else’s address is right this month. That was wrong last month. So, now if I’ve got 1000 addresses then it doesn’t matter, and I’ve still got the same prediction.

But you’ve got to treat the data with the level of confidence that it deserves, which starts at, “Yes, I’m going to pick 1000 people in your local area.”

And how far did you move? Right? It’s questions like that that says, “Did you move all the way across country? Did you move down the road? What is this data good for? How much data do I need to eliminate the errors in it or to make the errors less relevant to the prediction that I’m trying to make?”

If I’m trying to predict where is Alexis going to go last thing at night when she goes home from the theatre, then the data is incorrect. If I’m trying to predict sales in the state where you live, the fact that your address is slightly wrong is irrelevant, right? What is the level of accuracy of the prediction, and what is the level of precision in the data? Does that make sense?

Alexis
I love that, and that leads perfectly into the last part of Jo’s question, which is, “Is perfect data even something worth trying to strive for?”

Data Dave
So Jo, I really love the depth of this question because it’s really good. Philosophically, I would say what we should be striving for is to make our observations as accurate as possible and as real as possible.

So, going back to the checkout counter, don’t force somebody to enter wrong data. Observe the event accurately and capture the accuracy of that event. That makes your data correct, though it might be sparse, or it might be not filled out. Now it’s telling you a better truth shall we say? Then we’ve got to temper our prediction and temper the questions that we ask with the confidence that the data actually contains the answer that we’re looking for. If we look at a cadre of data and ask it an invalid question. “Do dolphins live on mountains? Is there a dolphin on this mountain?” No, there is not a dolphin on that mountain. Any evidence that you find is probably false, right? We’ve asked a bad question to the cadre of data that we have.

Now you might get the right answer, which is, “No, there’s no dolphins on this mountain.” but it was kind of stupid question now.

Alexis
I just, but it’s a stupid question. [laughing]

Data Dave
If we ask what animals live on this mountain, expecting to find dolphins is probably bad. So, how do we temper that? Is the accuracy of the data valid? How do we move that? So, is it really perfect data? Let’s make the observations as accurate as possible. Let’s make sure the data describes what we think it describes.

Have we got time for another story?

Alexis
I think we do.

Data Dave
Okay, so another story. I was working with a company some years ago, and we were doing the create customer. “How do you create a customer in your ERP financial systems?” And so, we sat down with the head of sales or head of sales ops, I don’t remember, and said “Okay, what triggers creating a customer?” And he said, as he should, he says, “Well, when we close a deal. When we close an opportunity. Then we create a customer because that prospect becomes a customer.”

Sounds perfectly reasonable. Yeah. Excellent. We said,  “Great. okay, let me look at the data.” So, I looked at the Salesforce automation data. I looked at the ERP data, and I said, “Okay, I can’t correlate that to. I can’t see any correlation between when this prospect closed that’s supposed to be this customer. There’s no correlation.”

He said, “Ohh, that’s that’s weird, because that’s when we do it.”

I said, “Well, how do you do it? How exactly do you do it?”

“Well, I honestly don’t know. Mary is our person that’s responsible for that.”

Alexis
There’s always a Mary. I like that.

Data Dave
There’s always a Mary, right? There’s always Mary, and Mary’s fabulous. I love Mary. Yeah,

Alexis
Sometimes I’m Mary if I’m being honest. So, I like it.

Data Dave
Exactly. People do what’s right. They make things work, and it’s fabulous.

So we I said, “Ohay, well, let’s talk to Mary.” So we got Mary on the phone and said. “So, when do you create a customer?”

She said, “Ohh. I create a customer when the salesperson calls me and tells me he’s closed the deal.”  So now we’ve got two people in this event. I’ve got the salesperson, and I’ve. Mary. So, Mary is dependent on a manual email or manual notification from the salesperson to say, “I’ve closed this event”.

So, now we’ve got a breakdown in the data. If Mary forgot to do it, she received an email on Friday. She didn’t see it till Monday. I’ve now got lacking correlation. If the salesperson didn’t close it on Wednesday but didn’t send the email till Monday. Mary gets it on Wednesday. Now. I’ve got a week’s delay.

What if the salesperson mistypes something in it? Now I’ve got a correlation. What if Mary mistyped something? Mary misreads. So, we went further, and she said, “Well, and this isn’t good, right? Because sometimes the salesperson forgets to tell me, and I’ve got examples where the salesperson calls me and says, where’s my commission? The salesperson saying, why haven’t I been paid on the sale?” And Mary goes, “Huh? I never created them as a customer.”

And the salesperson actually says “Ohh, I forgot to tell you, we closed the account.”

So now we’ve got the customer being created because Mary was never told that it closed, which means we never sent an invoice, which means we never created a customer, which means the whole thing just falls apart.

And again, why? Because the event wasn’t captured and the observation wasn’t there, and now you’ve got a sales leader who is describing a business that isn’t occurring. Your data does not describe the business that you think you have. So now we get into what’s in the way of good data. It’s well, it’s the observation. The observer isn’t doing it, and the data that we think we have doesn’t describe the business that we think we have, yeah. Or the business that we have, let alone the business that we think we might have.

Alexis
Oh, my goodness.

Data Dave
So now you see this breakdown, and then we’ve got the audacity to ask that data what we should do next.

Alexis
The data that mostly is there because somebody didn’t get a check.

Data Dave
Right. The data that is describing a business that you don’t have but then asking it what we should do next and then we’re asking expecting it to give us the right answer. So, we’ve got to temper that. So, if you look forward, it’s, “Are we observing actions correctly and capturing them properly.”

Alexis
Jo, just for the record, did put a little caveat on his question when he originally submitted it, and he said, “I know the answer to this question, but I want to hear Data Dave’s perspective,” and I just want to put that out there for our listeners. I think Jo probably was thinking the same thing, but he wanted to hear what Data Dave had to say.

But if you have a question for Data Dave, email us at talktech@d3clarity.com or send us a question on our website, d3clarity.com.

Other than that, I hope everyone has a great day, and Dave, thank you so much for teaching me about this topic of confidence and history and mathematics and data and so much, I feel like I learned so much today.. and dolphins.

Data Dave
Yeah. So thank you, Jo. Fabulous question. And connect with us on LinkedIn as well. That would be fabulous. Let’s continue this dialogue. I would love to enter into a dialogue if you agree or disagree with what I said. I know I rambled a little bit and probably went in some interesting directions, and digressed a little bit, but I would be more than happy to enter into a dialogue as to whether you agree or disagree with the points that I’ve put forward. And thank you. Thank you, Alexis.

Alexis
Awesome. Thanks everyone. Thanks Data Dave. Bye.

Ask Data Dave!

Listener questions are the best.
Ask Data Dave any question you have about all things data, all things cloud, or all things technology.
We'll be happy to answer your question on the podcast.

We will never sell, share or misuse your personal information.

Let's Talk.

An expert, not a sales person, will contact you quickly.
Usually in less than 20 minutes during business hours.

We will never sell, share or misuse your personal information.

Schedule a free meeting with an Expert.