Demystifying Data Science

Anthony Scriffignano

former Chief Data Scientist

Dun & Bradstreet
Michael Krigsman

Publisher

CXOTalk

What is data science and how can we use it in business most effectively? Data science is not about the latest shiny tools, but understanding business problems and data. Industry analyst Michael Krigsman, the host of CXOTalk, speaks with one of the world's foremost data scientists to explore this exciting frontier.

51:56

Anthony Scriffignano is the Chief Data Scientist at Dun and Bradstreet. He has over 35 years experience in information technologies, Big-4 management consulting, and international business. Sciffignano leverages deep data expertise and global relationships to position Dun & Bradstreet with strategic customers, partners, and governments.

Transcript

Michael Krigsman: What in the world is data science?! We hear that term used all the time. I don't think all that many people really, really know. But today, on Episode #274 of CxOTalk, we are speaking with one of the world's top data scientists and, by George, he knows and he's going to tell us today.

I'm Michael Krigsman. I'm an industry analyst, and I'm the host of CxOTalk. Before we begin, I want to say thank you to Livestream, which has been supporting CxOTalk and providing our video streaming infrastructure for the last two years. They're great. Go to Livestream.com/CxOTalk, and they will give you a discount.

Then I have one small favor. Would you please, right now, tell a friend? Tell a friend to join. Like us on Facebook also. Please do that.

Without further ado, it is my unalloyed pleasure to introduce Anthony Scriffignano. He's the chief data scientist at Dun & Bradstreet. He's been on this show before, and it's always a pleasure. Anthony, how are you? Welcome back.

Anthony Scriffignano: Michael, thank you very much for inviting me back. It's always a pleasure to talk to you. I'm looking forward to the conversation.

Michael Krigsman: I am as well. Anthony, let's begin. Very briefly, tell us about Dun & Bradstreet.

Anthony Scriffignano: Dun & Bradstreet is one of the oldest companies in the United States. It's a very global company. We've been around. We're in our 176th year right now, which is pretty rare air. We deal with commercial credit, also sales and marketing, and compliance. We help businesses know about each other all over the world for various purposes: total risk, total opportunity.

Michael Krigsman: Well, you are a company that's been around over 100 years, and that is definitely pretty rare air. You're the chief data scientist. What does that actually mean?

Anthony Scriffignano: Well, I like to joke that I have four thirds to my job because there's no reason why there should only be three thirds. Part of what I do is focused on hardcore data science, understanding emerging capabilities as those emerging capabilities may apply to what we do and to what our customers do. Another part of what I do has to do with working with governments around the world to help them understand the implications of regulations and various uses of data, and sometimes misuses of data. Then I work with our largest customers to help them basically use what we do in a higher order to answer better questions, more complex questions. Then I try to help our company behave in a more global way, so interacting around the world with our network of companies that we deal with and making sure that we are truly behaving in a global way in this very global economy that we're in right now.

Michael Krigsman: Okay, so you've got your hand throughout Dun & Bradstreet's data, business data analysis. Obviously, you're a data business, so you're central to this very large organization's use of data. We're talking today about demystifying data science. Maybe a good place to start to ask you the most basic question, which is, when we talk about data science, what is that; what do we mean?

Anthony Scriffignano: It's a great question, and I'm going to warn you right now that if you ask different people, you will get slightly divergent answers to that. I have a background in hard science, so I tend to approach this in a very scientific way. To me, data science is the science of using data. Science means doing things scientifically: observing the world, understanding what's going on in some way, forming some sort of hypotheses, asking important questions, selecting methodologies to answer those questions. Then collecting data, understanding the data in the context of the question and the bias in the data, then forming some sort of conclusions, and then informing other. Hopefully, lather, rinse, repeat.

Data science is using data in a scientific way to do things that are meaningful to answer questions. You will get other answers to that question that involve lots of mentioning of tools, environments, and technologies. People will talk about artificial intelligence, the Internet of Things, and blockchain. All of those things are, to me, part of the environment that informs these important questions that we ask in the data science community.

Michael Krigsman: You clearly make the distinction between the notion of data science, which is really where I think we should drill down into next, and I was going to say the expression of data science through tools or techniques, but I'm not sure that that's even the right way to describe it.

Anthony Scriffignano: Yeah, I think that's very true. If you asked someone, "What is medicine?" you've asked a doctor, "What is medicine?" the doctor wouldn't start by listing all of the medical procedures and all of the pharmacology. The doctor would start with, "Well, we start with the human body, we start with understanding the human body, and we start with observing something about that body that we want to address or prevent. Then we intervene in certain ways."

It's the same kind of thing. We have to take an approach that says, "Look. We've got all these tools. We've got all these technologies. We're living in a world that is just increasing amounts of data at arguably unmeasurable rates of increase. What do we do with all of this to address these questions that we have, and how do we address those questions in a way that might be reproducible to others?" That's where the science part of it comes in.

Michael Krigsman: Again, I'm basically confused. Rather than me try to say it, would you unconfuse me? Then again, how do you know what I'm confused about, right?

Anthony Scriffignano: I think I have an idea. Maybe an example would be helpful. Our company collects data from all over the world. It's updated millions of times a day. It's updated from integrated supply chains. There are different languages. There are different writing systems. It's an insanely complex data environment, overwhelmingly complex.

Someone might come in and say, "Well, what can we do with all this data?" My answer would be, "Well, what makes you think you need to use this data? What's the question? What's the problem? What's the challenge?" Eventually, we go through this little kabuki dance.

Let's say, for example, that someone says to me, "Well, I want to see if I can understand better if I acquire this other company, which I'm considering acquiring. They seem to have a million customers. I think I have three million customers. I'm pretty sure that means I don't have four million customers when I'm done because we have some customers in common. Let's start there. What's the overlap between us and them?"

Then, as I scientist, I would say, "All right, so when you say 'overlap,' I want to define that term. You mean customers in common." Then they might think about it and say, "Well, yeah, but maybe vendors in common would be helpful too. Maybe we should understand if people who are my competitors are their customers. Maybe we should understand if their counterparties in lawsuits are maybe people that I have relationships with."

We start to get more nuanced about this concept of overlap. Eventually, when I get a good working definition on what they mean by overlap, and I get a good working definition of what this problem is that they're trying to address, then I start to look at the data and say, "All right, now, how can we address that important question that you asked about overlap, defined very carefully in these four or five dimensions with this data that we have--by the way, maybe with other data that's not in this corpus of data--to inform the question that you're asking?"

Then we go at it with all the tools, all the technology, and all the capabilities to answer the question. We don't jump right into the data and the tools, which is easy to do these days. There are lots of tools, and there are lots of data, so it's very easy to do that upside down and just start mucking around in the data before you ask those important questions. Then you'll forget about the vendors, you'll forget about the lawsuits, or you'll forget about the partners. Then you'll be back doing it again, and you'll be sort of repeating your journey into this data. You'll very possibly miss opportunity because it'll take you too long.

Michael Krigsman: Okay. Then to summarize, you begin with some business problem that you believe that the data can help you solve. Then the next step is to determine what data do you have available. Then the next step after that is to figure out how can we slice, dice, use, combine, intersect, de-intersect, and/or mangle in a thousand different ways. Is that fair, what I've just said, or not really?

Anthony Scriffignano: I might add a few steps in there, but I would agree with all of those steps. You start by observing the environment. Then you ask an important question. Then you formulate some hypotheses that are related to that question and those observations of the environment. Then you go understand what other people may have done before you to answer that question, so you're sort of not reinventing the wheel. Then you jump into picking a methodology, defending it, executing it, collecting your data, et cetera.

Yes, it's all those things you said, and then there's these things we all learned in physics class or chemistry class. How do you construct a valid experiment? You've got to do all of those things. A lot of times they're done intuitively.

If you talk to many data scientists, they'll say, "Ah, forget all that. You're making it way too complicated. That's going to take forever." Many of those steps that I just talked about can happen somewhat in an instant with proper experience and with proper exposure to methods, the literature, and what's going on. We try to stay very current in terms of what's happening in the world around us. I can almost do a literature review, not completely in my head, but I know where to go if somebody says, "Look, we're dealing with cybersecurity and the Internet of things."

I'm not going to start by going to a search engine and trying to figure out what those two terms mean. I've read current literature on what's happening in that space. I have some idea of who the key players are. I have some idea of what the recent evolutions are. So, I'm not starting from this blank slate that we often start in if we were going to study exoplanets or something, you know, a very geeky, scientific thing.

Michael Krigsman: Okay, so you have a palette of techniques that will naturally and usually pretty quickly, seem appropriate to addressing a particular type of domain problem.

Anthony Scriffignano: At the same time, don't ignore the fact that there might be new things happening that you're not aware of. It's a lot like going to an emergency room. You walk in with some problem, and you don't expect the people that you're talking to to just be on their iPads trying to figure out what these symptoms mean. They have some experience that they bring to the table. They've got a room full of tools and technology, and they've got experts that they can call in to work with them. And, at some point, they might do some research too if it gets tricky. But, because of their experience, their tools, and their comfort and facility with using those tools, they can get you to a solution in a way that maybe you would never get to, and certainly in a way that's faster than you would get there.

Data science is a lot like that. I'm not trying to make it sound overly complex, but there is a tendency these days, when you ask that question, to start talking about tools or to start talking about methods and data. Important, necessary, but not sufficient. You should start with your belief systems, with a guiding question, with an understanding of the fact that you want to do this in some methodological way that you can repeat, that you can explain, that you can defend. That's when the science part comes in, and that's the difference between somebody that just has a toolbox with a bunch of tools in it and somebody that's acting as a practitioner in this space.

Michael Krigsman: I want to talk about how you can define the problems in the right way. First, I want to remind everybody that we're speaking with Anthony Scriffignano, who is the chief data scientist at Dun & Bradstreet.

Right now there is a tweet chat going on using the hashtag #CxOTalk. You can ask Anthony your data science questions. Anthony, how do you formulate the problem in the right way?

Anthony Scriffignano: Let me give you an example. You just mentioned that there's a tweet chat going on. I was trying to listen to that as a data challenge. If somebody said to me, "There's a tweet chat going on. Well, how do they feel? How do these people who are tweet chatting feel? Do they feel positive about what's going on, or do they hate what's going on?" If they hate what's going on, maybe it's early enough in the show that we can kind of change the direction of the boat a little bit, you know, do something to serve them better.

That is a question that involves unstructured data. It involves observing that, well, the tweet chat, there are tweets, right? People are articulating things. It's not just the number of characters in the tweet. There's a bunch of metadata that comes along with every tweet that tells you about the profile of the user and possibly their background. There is information that can be connected to that about what you might have known about them before or what they might have disclosed about themselves.

There are methods that can be used to do very simple sentiment analysis, more complex network analysis, or even more complex inferential computational linguistics. We've got to figure out where we need to go with a question like that. The answer involves, we only have a certain amount of time. We need to do it kind of now, so maybe we're going to have to cut it off in terms of what could be done versus what will be done.

At the same time, we've got to sort of assess whether the answer we're getting is useful enough to be actionable, that we can do something. That would be an example where you wouldn't want to just jump in and start looking at the tweets. Now, if there are only ten tweets, just read the tweets. You don't need tools for that. But, imagine there were 10,000 or a million tweets. Then maybe you need a little help with some data science, some technology, some tools, some practitioner experience around language, sentiment, clustering, and things like that.

Michael Krigsman: One of the differences then between a data scientist and somebody using Excel to perform analysis on a set of data is the scale. Is that an accurate statement?

Anthony Scriffignano: It can be. I use Excel from time-to-time. It's a tool, and it has limitations in terms of the amount of data it can munge and what it can do with that data. Sometimes that's all you need to do. There are times when an experienced automotive mechanic uses a hammer. It's fine. And, I'm not calling Excel a hammer.

Definitely, the volume of data, so we always have these five Vs. We even argue about how many Vs there are. Volume, velocity, veracity, variety, value: those are the ones I look at. When any of those Vs overwhelms the best attempts to deal with them, you have a big data problem. When you have a big data problem, you probably need some data science to help you address that in a methodological way.

Yes, volume is a good place to start, but I wouldn't just say, "Because you have a lot of data, now you have one of these problems." Imagine I just had weather data for the past 100 years and I want to know what the average temperature is in New York. Find the column or the row that tells you that you're in New York and the temperature and do the math. That's not a big, complicated problem.

If somebody came along and said, "Well, wait. Hold on. You just assume that these temperature readings were taken at the same interval and that they were taken with the same equipment, that they had the same kind of bias into it, and that's not really true." You can imagine how we could nuance that answer a little bit by somebody who knew a little bit more about the data, sort of reminding you to ask a few questions you might have forgotten to ask because you were too busy jumping into the data and calculating something.

Michael Krigsman: Would you just repeat for a moment the five elements that you mentioned?

Anthony Scriffignano: Hopefully, I can repeat them: volume, velocity, veracity, variety, and value. Volume is how much you have. Variety is how it changes, so sometimes you have string data. Sometimes you have numerical data. Sometimes you have video. That's variety, the different types of data. Velocity is how quickly it's changing over time. Is it changing once every millisecond or once every month? Veracity is the truthiness of the data. Old data isn't true, and not even all true data is true all the time. If you looked at the number of people in Time Square, it might have been true at the time you collected it, but it might not be true anymore right now. Then value is really the degree to which the data that you're looking at intersects usefully with the question you're trying to ask.

If we go back to that Twitter data, and I wanted to use the Twitter data to tell me something about how frequently people use Twitter, that'd be a bad question to ask just by looking at the tweet chat about this CxOTalk. You have to kind of zoom out to ask a question like that. That would be a value type conversation.

Michael Krigsman: Okay. Let's be concrete for a second. By the way, we're getting questions from Twitter. Zachary Jeans is asking about data scientists and where they get training. Let's come back to that because I definitely want to talk about that. Yeah, it's an important question.

Let's use an example. I want to know about the people who are engaged in the tweet chat going on right now using the hashtag #CxOTalk.

Anthony Scriffignano: Let's have a little dialog, Michael. What would you like to know about them?

Michael Krigsman: Well, I want to know who they are and what are the patterns.

Anthony Scriffignano: When you say, "Who they are," do you mean their identity as a person or do you mean that you want to understand the archetypical users, the archetypical engaged people?

Michael Krigsman: I want to know the pattern. Are there similarities among these people? What are the common attributes or, in other words, why are they here?

Anthony Scriffignano: Now, those are two different questions. The common attributes, we can start with their Twitter profiles, and we can look at what they've disclosed about themselves, or we could possibly fold that into trying to see if they have a discoverable social media profile, maybe on LinkedIn. I don't want to keep naming platforms, but maybe in some other social media platform where they might have information about their title, the company that they work for, and their educational background, et cetera. I might see if there's a way permissible to use data like that to give you an answer.

You also talked about why they are here, and that's a very different question and one that I would say philosophers have been asking for centuries, right?

Michael Krigsman: [Laughter]

Anthony Scriffignano: Probably millennia. What I would do with that one is I would probably start with a few hypotheses. I would probably start with, "Well, they're here because they're in some way associated with technology," or, "They're here because they're in some way associated with new media," or, "They're here because they are in some way associated with my network or your network." Those are three hypotheses. I would come up with seven, eight, or ten hypotheses, and I would say, "What are the attributes that we can discover that would confirm or refute those hypotheses?" Scale them, do some math, do some curation of data, and I'll bet you, within a very short order, I could come up with a pretty good profile for you.

Would it be perfect? Absolutely not because you don't know if somebody is here who is somehow your competitor and they're watching you so that they can see how they can do what you do better or something like that. We don't know. There will be things that will not be easily discoverable, so we have to also have a conversation about the bias. Michael, let me tell you some things I can't see here.

Then there are also the things I won't see, right? I'm probably not going to have a conversation about the things you have to be very careful about making observations about, maybe race, religion, or something. I would probably steer away from those things. We'd have that conversation. It wouldn't take long. You know because you've had conversations like this with me. Then we'd do a little fancy data science, and I'd give you some answers. Hopefully, they would be actionable.

Michael Krigsman: Okay. You do some fancy data science. But, aside from the volume and the velocity issues of that data stream that's coming in, quite frankly, I can pick out these conclusions you were just describing simply by eyeballing the tweets.

Anthony Scriffignano: In this particular case, we're kind of torturing this example. There's probably not enough data to really require an overly robust analysis. Imagine if you asked that question about every CxOTalk going back several hundred. Imagine if you wanted to then juxtapose that observation with some other show that you consider to be either very similar or very different. You wanted to understand the nuanced interaction between. I can overwhelm it very quickly.

What you just talked about is what's called a heuristic approach. Can we build a data science method that would mirror the behavior of a group of similarly instructed, similarly incented experts? Can I teach this thing I'm building to behave like you, to observe what you would observe, and then go do it a million times because you're busy or you would get tired?

By the way, even if you think you can do that, when you try to do that a thousand times, you start to get tired. You start to maybe remember things that you've already seen. There are types of bias that are introduced even in an expert like yourself when you try to do the same thing over and over and over again. Then there's confirmation bias. You want the answer to be a certain way, so you notice certain things more than other things. People often feel like, "Gee, if I just did it myself, I could do this." When you do it yourself, you introduce types of bias that might be problematic depending on what you're going to do with the answer.

Michael Krigsman: Well, it turns out that computers are pretty good at these kinds of rapid calculations.

Anthony Scriffignano: I always say that it's better to be consistently wrong than inconsistently right. When you design a method, you can at least be consistent. If you don't like the method, if it's consistently wrong, you could tune it. You can work with it over time. But, when you get a bunch of people doing something, certain things, they have different opinions. They have pessimism and optimism. They get tired. They want to do well for you. They want to please you. You've got to control for all these things, and it gets very complicated when people get involved.

Michael Krigsman: What I'm still confused about is how is data science different from any other kind of analytic technique? You've got a body of data. You understand what you have. You understand where you're trying to go [and] the problem that you're trying to solve. You may look for sentiment. In the case of language, you may do lexical parsing. There are all kinds of techniques that you can use. Data science has come to be deified.

Anthony Scriffignano: Or vilified, yeah.

Michael Krigsman: Well, yeah, I guess that's an interesting point: either deified or vilified. If you're Facebook and making a lot of money, then it's pretty darn good for you. On the other hand, if you're suffering from being manipulated.

Anthony Scriffignano: Yeah. There you go. Exactly.

Michael Krigsman: Yeah. Okay, so let's actually talk about this notion of using data science. How do you? How does an organization like Facebook use data science to manipulate perception? That starts to become a pretty interesting and a lot more complex problem than the tweet chat that's going on, analyzing the CxOTalk hashtag.

Anthony Scriffignano: Yeah. I don't want to try to channel my inner Facebook executive to answer that question, but let me answer the question in a different way. There are some roles. Let's get off the term "data scientist" for a second. There are some roles that existed way before we started using this term: analyst, modeler, statistician, methodologist, data steward. I'll use the term data curator. There's a slight difference. These things are all part of the role of being a data scientist today, certainly, but so is being a governance expert, to some extent. Maybe not an expert, but you've got to have more than a passing awareness of what you may use and may not use, where you may and may not use it, and how you can move data.

Problem formulation: We've been talking about that. That's huge right now. Opportunity formulation, if you will, detective, visionary, storyteller, diplomat, these are all skills that are part of that job now. You don't get all those skills with everybody, so you have to understand where the strengths and maybe opportunities are within a cadre of data scientists, just like you would within a cadre of ER physicians or a cadre of diplomats. You've got diplomats that understand technology, and you've got diplomats that don't. Well, you've got technologists that understand diplomacy and technologists that don't. This is really becoming a field that requires a fairly renaissance set of skills to get it right because there are so many tools and there's so much data.

It's not about the tools and the data anymore. Yes, you need to be able to be very good with tools and very good data. That's kind of table stakes. But, you need to be better at at least one or two other things to be meaningfully useful and differentiated from the crowd in this space these days because a lot of those things are becoming commoditized.

Michael Krigsman: Okay. Let's now take this to the next level. I know that it's not about the tools. But, you've got your data. You have the problem that you're trying to solve. Where do tools like artificial intelligence, for example, come into play? Is it even fair to say that artificial intelligence is a tool, or is it more of an umbrella marketing term? What about terms like machine learning, deep learning, cognitive? Where do those fit into play?

[Laughter] I've just asked about five or ten questions all rolled into one.

Anthony Scriffignano: Yeah, that's about eight CxOTalks right there.

Michael Krigsman: [Laughter]

Anthony Scriffignano: If I could sort of try to summarize that, this term, "artificial intelligence," we need another word. We need a new term because it's just taking on too much, and it's becoming this all-encompassing term that can mean so many things. In general, you have supervised methods, which essentially involve training. You look at a bunch of data. You do a bunch of regressive analysis. You come up with equations that kind of model the data and would have worked had you had them in the past. Then you make the assumption that the future can somehow be compared to the past. You start to apply those models, and you tune them as you go along. Those are supervised methods.

A lot of things that we do with language, for example, because language is so complex, involve asking people to sort of look at sample documents and do the digital equivalent of highlighting them in various ways. Then we teach. We train algorithms to understand how that was done. Then we say, "Go forth and do it." Well, what happens when the text changes, right?

Unsupervised methods are kind of the antithesis of that. They're not working with training data. They're working with different types of, you mentioned, deep belief and deep learning. In many ways, there's an intersection, a big intersection there with forming digital hypotheses working with the data and sort of constantly revising those and forming new pathways.

Sometimes you'll hear the term "neuromorphic methods." Those are methods that are designed to model the way we think we think. That would be an interesting diversion there. Then you also have reinforcement methods, which are also part of artificial intelligence. You mentioned cognitive. Cognitive computing is an awesome, amazing approach that really doesn't give you an answer. It kind of walks alongside you, reads everything you would read if you had time, knows everything you would know if you remembered everything you knew before, makes suggestions to you about what you should probably do, watches whether you take those suggestions or not, and then tries to get better at whispering in your ear the next time. What is that? Reinforcement methods are very, very interesting.

Now, I'm going to fold all of that, all of those AI methods on top of something like trying to use artificial intelligence for something like autonomy. Artificial intelligence generally has a goal. It's trying to achieve something: the best chess move or the next best design for the seal on this tank that's got to be in this kind of an environment, that kind of thing.

What happens when we have autonomous devices that have been given goals and then the environment changes in a way that was unanticipated? The AI needs to be able to modify its goal. If we're going to let the AI modify its goal based on a change in the environment, then it's got to have some sort of higher-level understanding of a broader set of goals, or it's going to be doing the equivalent of flipping a coin.

I don't think I want drones and self-driving cars flipping coins, so we're going to probably have to give them guiding principles just like we have when we drive a car or we do something inherently complex like that. We say, "Look. You're supposed to get where you're going. You're supposed to drive safely. And, if an elephant walks out in the middle of the road and you've never seen that before and never contemplated it, you don't just let go of the wheel. You do something by using a higher set of reasoning. We've got to be able to embody that.

AI goal modification is both fascinating and terrifying at the same time. There are some very famous people who have written on this subject and basically said, "Be very afraid." I'm a little afraid, but I think we'll still be able to stay out and ahead for the next generation or so. It's definitely something to think about.

Michael Krigsman: AI goal modification, that's where, I was going to say, it becomes scary because we need to make a leap of faith that the people designing the system A) haven't introduced the kind of bias that would lead to very, very terrible, unintended consequences, and B) haven't co-opted the system for their own personal, organizational, political, nationalistic objectives.

Anthony Scriffignano: Yeah. These are all issues that are very much at the forefront of what data scientists are doing in their day job when they're working in those fields. I love it when people talk about eliminating bias. In almost all cases, when that's your goal, when you set out to eliminate bias, you sort of trade it for some other kind of bias. If nothing else, a bias towards structured data, standardized things, or whatever.

If we use some sort of AI method to figure out who gets parole, who gets arrested, who gets stopped, or who gets the extra super-duper police come into this room screening before they get onto the airplane, there are certain guiding principles. We don't want to use certain types of information to make those decisions because it's either constitutional, it's unpopular, or it's ill-advised. It might be that a machine, if it weren't given those higher order principles, might reach the conclusion that those are exactly the types of attributes you want to look at to correlate most with the type of person you're trying to find. You can't do that, you shouldn't do that, and we won't do that.

How do we teach our tools and our technology like we teach our children? This is a technology that's in its adolescence right now. It's a really good question to ask. Then you bring up implicitly this rush to market, right? If you take forever to answer questions like that, your competitor beats you and they get their product on the market.

I was watching. I was surfing around this morning. In Japan Times there was an article about a ryokan in Japan. You go into these sort of beautiful, bucolic settings, and they have these magnificent hotels. You walk around in the yukata, and there are shoji screens everywhere. It's a very ancient feeling environment.

They have self-parking slippers. You can take your slippers off, and it's kind of like a rumba. It'll find where it's supposed to go. The slippers kind of park themselves. It's one of the companies that's making the self-parking technology for cars. I guess they might have done it as a publicity stunt. The cushions self-park themselves on the tatami mats, and the slippers self-park themselves where they should be.

It's very clever. I'm sure that people find it very amusing to look at. It's brilliant, right? At another level, if I'm in this kind of ancient environment, do I want my slippers walking away from me? Is it really so hard for me to just go put my slippers where they belong?

I think that hopefully that was done with some element of tongue and cheek marketing, but is that the best problem that we can solve right now? Let's zoom out. If you're the company that wants to get more customers in your ryokan or you want to demonstrate how well your parking technology works, maybe you say no.

That's exactly the best bet I can make right now. I'm not a CMO. Fortunately, I don't have to make decisions like that. Those are really tough decisions. Then the legal people get involved. They say it's probably safer to smash into another slipper than into another car. Let's test it this way.

Michael Krigsman: We have a question from Twitter, a really interesting one from Gus Bekdash, and it relates to this ethical discussion we just were having. Is there, should there be, a set of ethical gateways that are applied to the development of these technologies? You're a data scientist. You're developing these techniques, technologies. How would you like to have the big brother overseer being part of your work, looking over your shoulder because, as a society, we have to be careful about what you're doing, Anthony?

Anthony Scriffignano: You do?

Michael Krigsman: [Laughter]

Anthony Scriffignano: I'm not a lawyer, but I work with them all the time. There are guiding principles in the law, and lawyers don't agree on them. We've been trying. We have constitutions, which set out sort of general principles, and then we write laws that try to follow that.

This is very tricky stuff when you start setting out guiding principles. AI should be explainable. That sounds great. I should be able to understand why the AI agent reached the conclusion that it did.

If I said to you, "Well, you can either have explainable AI or a better auto flight system on the airplane, but you can't have both because some of the methods that it's using to make a decision about whether to change the flap settings or do something with the auto-thrust, it's complex. It's not explainable in English. We're doing this with a model. It's too complex to explain it. If you reduce my actions to only those things that I can explain, then I'm not going to do the best thing. Are you willing to make that choice?"

If it's a robotic surgeon or an auto flight system, I'm probably going to say I need to think about that. If you want to breathe down my neck and say, "You just violated the explainability edict and, therefore, we're going to sue you," then I'm going to be super careful about what I create, and it's going to be a lot more pedantic and a lot more sort of obvious because I'm going to be careful about not getting sued. Then maybe I won't do the best job I can for you.

Michael Krigsman: There is a direct tradeoff between oversight and the ability to innovate, essentially.

Anthony Scriffignano: It doesn't mean, therefore there should be no oversight. There should absolutely be oversight. But, these are tricky problems. We are not done figuring this stuff out yet. There's a whole degree of ethical consideration, and we still have bioethical conversations today. We just have to have more of these in the digital space. We're not done with any of this anywhere. I don't think we've figured out the universally best use for a hammer.

I think it's important, and I'm sorry I forgot the name of the person who asked the question. Please keep asking that question, right? It's important that that question be brought to the forefront and that people who understand what they're talking about sort of advice on the pendulum can swing both ways, and you have to understand what happens if it swings all the way over here, and what happens if it swings all the way over there. Let's not just look for this binary answer because it's not that simple.

Michael Krigsman: What happens when these self-learning systems--? Let's not use that term. As you were describing it earlier, systems that adapt based on changes in the environment, what happens when that sequence magnifies or is multiplied over and over and over and over and over again so that the explainability that you just described becomes, in a practical way, almost impossible because you can't go back, back, back, back a thousand, a hundred, a million iterations ago to see what caused this new path?

Anthony Scriffignano: I think, to some extent, we're already there with the check engine light, right? Our devices tell us things that they then do a very poor job of explaining why they told us, and we just say, "Well, you need a new oxygen sensor," or you need a new this or that, and we start changing things until the light goes out. Right? I know mechanics do a better job than that, but there are situations where the feedback that we're getting was not nuanced enough to tell us really why it happened. That doesn't mean we don't want that feedback.

Before when the Space Shuttle was a thing, there was more than one computer looking at the same parameters and voting on whether we should take off or not. Once in a while, one of the computers votes no and you stop. Then sometimes you can, usually, sort it out, but occasionally there are glitches and gremlins. I think you have to decide how many false positives you want to have, what's the cost of being wrong in saying no versus wrong in saying yes. The good news is none of these are new questions. We're just applying them to a new science.

You just did something, which I think is a great first principle, which is, try to avoid anthropomorphizing this technology. When the computer decides. When the computer learns. Technically, it's making a decision, but it's not cognitive in the sense of our human brain making that decision, even if you used a neuromorphic method. I think we need some better words for learning and for synthesis and for decision that apply to autonomous agents, that apply to digital agents. We don't have the right nouns and verbs, and so we're using the ones that apply to human beings, and our devices are getting a lot smarter, and it's getting a lot more dangerous to use human terms to describe these nonhuman devices.

Michael Krigsman: We're almost out of time, and we haven't at all spoken about what's the best way to learn to be a data scientist. What is the best way for organizations to hire data scientists? What is the best way for organizations to manage data scientists to get the best results? What kind of problems are most useful for data science? And, if you want to become a data scientist, let's say you have kids who are interested in becoming data scientists, what should they do?

[Laughter] How's that?

Anthony Scriffignano: Okay. I got: learn, hire, manage, focus their effort, and inspire. That's an awesome list, right?

Let me start with learning. How do you learn all this? I advise a number of academic programs around the world, and one of the questions I always ask, I sort of channel the question that's already been asked in the academic community, which is, "What do we teach today that's going to be relevant in two or three years when these students graduate?" That's a very big question.

Increasingly, my advice is, I get why you have to teach them specific tools. I'm trying not to rattle off a set of tools because then I'm advertising for products, but this language, this environment, or this database system. There are certain favorite tools and environments today, right? Do you need to teach that? Well, I guess you need to teach that. But, if you show me your curriculum and all you're doing is teaching the students how to use those tools, I'm pretty sure that by the time they graduate, those tools will be old. There'll be newer versions of those tools or there'll be other tools. Yeah, they need to know how to use a tool because you can't do anything without using a tool, but it's sort of necessary but not sufficient.

Then the conversation goes to, well, what else do we have to teach them? Well, problem formulation would be nice, something about understanding basic methods like statistical sampling methods and bias. You don't have to be necessarily the world's best mathematician, but you probably should understand basic probability, basic statistics, basic mathematical techniques for doing things, understanding sampling methods, et cetera. There's some element of that.

What about the actual new, emerging capability? Do we need to teach them about the Internet of Things and blockchain? Yeah, there probably should be some sort of general survey course. Where are the new technologies going? We get there. We eventually get to this curriculum approach that has all these different pieces in it.

Then someone comes along and says, "I don't think you need no sticking degree. I think you can just learn all this. There are open source tools. There are things you could go on the Internet." I'm not going to name places, again. "Take courses. Why don't you just learn it yourself?" Well, you might have the discipline to do that.

The danger would be that you don't have a spirit guide. How do you know what you should be learning, what you should go and learn? That's a tricky question. Unless you have someone walking alongside you who is going to say, "Well, if you go down that path, you're really looking at a supervised learning method. Just be careful you don't reach the conclusion that you can machine learn your way out of everything. There are other types of methods."

Then you say, "Well, what types of methods?" Then we have the conversation we had before. Having a spirit guide is pretty important.

How do you hire? I gave you a list before of sort of the before and after requirements. Data steward, analyst, modeler, statistician, methodologist: you probably should have some grounding in one or two of those hardcore things that you've done or can do, but I'm also asking questions around governance. I'm asking questions around storytelling. I'm asking questions about diplomacy.

When I'm generally talking to people that are more senior, I'm interested in, can they argue with me about something in a meaningful way that drives us to a solution. Can they look at a problem they've never seen before and do something more than just talk about how it's not fair and they don't know? Can they abstract what they know to something they've never heard about before? Do they have the ability to do that in a methodological way, in a scientific way, or do they just try this, try that, try this?

You talked about, how do you manage them? Great question. Data science in different organizations has different faces. In some organizations, it's very centralized. You go to the cadre of data scientists. More often now it's not. It's federated. It's all over the organization. You've got shadow data science everywhere. Everybody downloaded some tool, installed it, and thinks they're great because they have that tool.

In the federated environment, fully federated, it's really about helping best practice get spread through the organization and making sure that we are moving forward and not just trying things, and maybe providing some expert advice where it's needed and getting people to have a certain core set of skills. In the very focused environment, it's more about making sure that you don't become so insulated that you really lose touch with what the customers care about, what the organization cares about, what the business problems are. You want to understand where your organization is in that continuum of fully federated to fully centralized.

Are you there on purpose? How do you effectuate change in the direction that's most meaningful for your organization there? For that, by the way, I look at not just the technology and not just the processes, but the people and their mindset as well because it's all part of that equation.

You talked a bit about how do you focus them. The big advice I would give there is, the cost of doing nothing is not nothing. Be very careful that you understand what you're not doing while you're choosing to do what you're doing, and making sure that you're choosing mindfully and that you're choosing meaningfully among these cherished and often very scarce resources to get the best bang for your buck and the most value for your customers and your shareholders.

In terms of how do you inspire them, I talk about this a lot. We need to be learning leaders. You can't just call yourself king and move on. You've got to learn something every day in this field. If you're just learning, then you're being selfish. You ought to be teaching something every day, too.

I think you've got to be very humble. If you want to inspire people who have these awesome capabilities, you've got to give them exciting things to work on. But, you've also got to help them understand that not everything we get to do is super sexy and exciting. Sometimes we have some basic blocking and tackling that we need to do.

You've really got to be, in my opinion, much more servant leader in this environment. You've got to be bringing the skills to the table. You've got to be accessible. You've got to be a good listener. You've also got to inspire by example, and that means learning what you're talking about and not just talking about it.

Michael Krigsman: Wow! Well, this has been a fast 50 minutes. That's for sure. Next time, we need to talk about applications of AI in areas, processes, and functions like marketing, supply chain, accounting, whatever it may be. I think that would be a pretty interesting follow-up to this discussion.

Anthony Scriffignano: I agree, totally. Every time we talk, I think we spawn seven or eight new ideas and wish we talked about that 50 minutes ago. I think this is really, really valuable. I thank the people who raised the questions about education and inspiration. It's inspiring that people think about that.

The one piece of advice I'd leave you with is, don't just wave your hands and say data science. Think about what you mean when you say that and mean it for some reason. Don't just try to channel something. Understand it. Learn it. It's exciting. There's so much to do in this field.

Michael Krigsman: Okay. With that, just before we go, very quickly, I want to read. This is completely separate. I want to read a few sections from the San Jose Mercury News. This is from the police blotter in Atherton, California. If you don't know, Atherton is a very wealthy community in the heart of Silicon Valley. Okay? Anthony, are you ready for this?

Anthony Scriffignano: Yeah, I was just reading that this morning.

Michael Krigsman: Okay. This is from the police blotter--

Anthony Scriffignano: Yes.

Michael Krigsman: --in Atherton. Okay. "A resident worried that a noisy hawk in a tree was in distress. When authorities arrived the hawk was quiet and enjoying dinner. A pedestrian was reported after midnight wearing black pants [laughter] and a white dress shirt." [Laughter] Whatever.

"A man was reported to be lying on the ground possibly writing." Finally, actually two more, "A family reported being followed by a duck who resides on Tuscaloosa Avenue." Last--

Anthony Scriffignano: [Laughter]

Michael Krigsman: --but definitely not least, "A resident reported a large light in the sky. It turned out to be the moon." [Laughter]

On that profundity, I want to thank everybody for watching Episode #274 of CxOTalk. We've been speaking with Anthony Scriffignano, who is the chief data scientist at Dun & Bradstreet.

Again, I'll ask you. Please tell a friend about CxOTalk. Like us on Facebook. Subscribe on YouTube. We'd really appreciate that.

We have another great show next week, next Friday. We are speaking about the role of data and AI in healthcare, drug discovery, and personalized medicine. That's going to be an incredible show. Everybody, thank you so much and have a great day.

Published Date: Jan 26, 2018

Author: Michael Krigsman

Episode ID: 499

Demystifying Data Science

Transcript

Audio Podcast

Related Episodes

Data, AI, and Algorithms: New Year's Resolutions for 2018

Dun and Bradstreet: Programmatic Advertising, Trust and Confidence

AI Research: McKinsey Global Institute on Artificial Intelligence

Designing AI: IPsoft CEO on Artificial Intelligence

Buying AI: How to Invest in Artificial Intelligence

Comcast: Digital Transformation and Innovation