Tune in for our third episode with Piethein Strengholt from Microsoft
3# Piethein Strengholt – The art of data management at scale
Data management is one of the biggest barriers for enterprises to create impact with AI. Simply put, it means that companies often struggle to stay in control of and get value out of their data. In this episode of the AI Change Makers podcast, we talk to Piethein Strengholt, who currently works as a cloud solution architect at Microsoft, about what these typical data management challenges look like and how you can go about solving them. Piethein has acquired extensive knowledge and experience on the topic over the years, especially during his former role designing and implementing data mesh architectures at ABN AMRO. Based on this, he recently published the book ‘’Data Management at Scale’’, a practical guide for how to build a modern scalable data landscape. And there a lot of factors to be considered! To start exploring this topic, the most common challenges, their solutions as well as best practices for doing data management right – tune in now!
Episode transcript: 3# Piethein Strengholt – The art of data management at scale
Brought to you by GAIn, this is the AI Change Makers Podcast. My name is Wouter Huygen and on this show, I talk to business leaders about how they create industry breakthroughs with AI.
In this episode I had the pleasure to talk to Piethein Strengholt about data management. Piethein was a principal data architect at ABN AMRO and is now a cloud solution architect at Microsoft. He recently published a book on how to do data management at scale, in which he shares his knowledge and experience with designing and implementing data mesh architectures at ABN AMRO. Personally I’m particularly fascinated by the topic of data management for a couple of reasons. First it’s one of the, if not the biggest barrier to creating impact from AI at scale for many organizations. Any business that has embarked on building AI solutions will confirm the many challenges they have with respect to the availability and quality of their data. And also the pitfalls that teams run into when building data pipelines for their AI applications. So getting data management right is really a big deal. Second reason I’m excited about this is that there has been a recent search in new paradigms in data management. Exploring these and translating them into practical use cases provides an exciting source of innovation, and a whole new level of quality and speed for AI deployment. So, without further ado let’s get into it.
Wouter: Piethein, welcome to the AI Change Makers Podcast. I’m super excited to talk to you about data management. In this podcast we talk about breakthroughs in AI, but many organizations that have taken their first steps in developing AI applications now want to sort of industrialize AI. They run into a major bottleneck, which is data management. So I’m very keen to talk to you and understand what the typical challenges are when it comes to enterprise data management, but also how to you look at new paradigms for solving those challenges and how to make it work.
Piethein: Makes sense, thank you for having me here.
Wouter: So let’s start with a little bit about you and your recently published book “Data Management at scale”. I’m curious, why did you write this book?
Piethein: Very good question! Well I gave so many presentations in the period I used to work for ABN AMRO. Many people confronted me after presentations with questions like ‘where do you get the knowledge from’ and ‘why don’t you write a book about this?’. So after the fifth or sixth person asked me the same question, I started puzzling indeed. What the heck, why isn’t there a modern book about data management? For me this was the starting point of a new journey of indeed writing a book.
Wouter: There are so many buzzwords when it comes to data: data management, master data management, data governance, data architecture… What is data management? How do you define data management?
Piethein: Good question, well… The DAMA (Global Data Management) Community, I have it here in front of me, they have a very nice definition. Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycles. Well for me it’s just staying in control and get the most value out of your data. I think that’s the shortest definition I can give to you.
Wouter: So in that sense it’s a very broad concept?
Piethein: Indeed, so there are lots of disciplines you need to take into account. Data security for example, master data management you just mentioned, I think data integration, the probability, the data distribution aspect, new things like data meshes for example, are all part of data management.
Wouter: All the things you need to do as an organization to manage your data, surprise surprise. What about the scale part? Why do you talk about data management at scale? Wasn’t it at scale before, or..?
Piethein: What I mostly see within large enterprises is that they struggle to get value out of their data. I think the size in which those complex organizations work, I think that makes a difference so therefore the word ‘scale’ is applied.
Wouter: Alright so it’s about doing this at scale within a large organization?
Piethein: Indeed. I think most typically very small companies they might like to book, but lots of practices and patterns may not apply for them.
Indeed. I think most typically very small companies they might like to book, but lots of practice and patterns, maybe not apply for them.
Wouter: So it’s not about the volume or the scale of the data itself, it’s more about…
Piethein: …the complexity of the organization. The size of applications you have, the large number of people you have within the organization.
Wouter: And obviously there is a reason like you mentioned why you wrote this book that means you ran into the typical challenges related to data management. What are those typical challenges?
Piethein: Most typically three areas. On the sourcing side we most typically have challenges like data quality, master data management, how do you extract the data from these complex, transactional, operational systems? I think this is one point. Then the second part is the whole distribution and integration of data and then the last part is how to turn data into value. So, advanced analytics, business intelligence, machine learning and these kind of concepts come into play. So these three areas I mostly define.
Wouter: Okay, and if we go over them one by one. The first one, the source of where data is created, what sort of challenges do most organizations run into there?
Piethein: Most typically companies have lots of applications, and those applications are either outdated or not very well documented. I think that’s one of the first problems. How to interpret the data, to get the data out of these systems, I think that’s the first part. Often you also see the quality is low or polluted, because these systems are outdated and not so accurate anymore. There’s a lot of work to tune the data. Master data management is also often a concern. Large organizations for example they have multiple custom administrations. And maybe from a consuming point of view, you want to have one single version of the truth for all those consumers, so then you need to first integrate all those different custom administrations, which might be built for different purposes. So that’s a lot of integration work, so that’s another area.
Wouter: Why is this then, because you might say well these are often well established organizations that have their systems running for years, decades even sometimes, without a problem. So why is this then now becoming an issue?
Piethein: I think you want to combine the data, not only from one particular system, but with many different systems. You need to extract data from all those different systems. This is I think a crucial part. And what I also see is that vendors have a crucial role to play. To delivery and read data, so systems which deliver their data easily.
Wouter: So it’s like the scale at which data is sort of consumed, demanded, used in an organization has increased over the past decade or so. Much more than before you might say, which basically brings this issue to the surface.
Piethein: Exactly. So what you typically see, when you do for example machine learning, you intensively use the data. So when you train their model, you constantly read data. And I think those transactional operational systems, they are not designed for serving out spontaneously lots of data. Because they need to work under certain conditions, guarantee stability, they’re heavily normalized, typically. They use an asset compliant database for example, so these systems are not designed to spontaneously serve out lots of data.
Wouter: And then the second area that you mentioned. Because that’s also where typically organizations have set up their data warehouse to make sure that they pull everything in one place, nicely integrated, so the data can be used for all kinds of purposes. But you also mentioned that as a problem area.
Piethein: Absolutely. When I started my career at ABN AMRO I was surprised how complex these big models became over the last couple of years. And also target operating model does not help to scale these complex systems. So there’s typically one central team maintaining all these different pipelines, doing all the integration, they need to understand the context of all these different systems. Everything has to be funneled through this single team or system. I think that in itself does not make it scalable.
Wouter: What are the typical symptoms that you see? That central team becomes a bottleneck?
Piethein: It’s a stovepipe often. A key risk or single point of failure, also within your organization. Also the common model often is complex or a problem.
Wouter: And by common model you mean?
Piethein: So one single data model is often used within these enterprise data warehouse systems.
Wouter: Which basically stipulates how we define our data in our organization, throughout the entire organization.
Piethein: That surprises me, because often the data and the way you would like to use it, is very unique within a particular context. So a uniqueness context within your organization often surprises me. Take for example the term ‘household’, if you ask maybe the stakeholders or the businesspeople from the customer central system, the administers of clients, they take for example the local township as the definition of people who are grouped and live together. If you take for example marketing and you ask them for a definition of a household, they look at the people who actually live together.
Wouter: And why is that then an issue for having sort of one central team with one global enterprise-like model. A bit naive maybe, but you could also say I have two names for a household, I just give them two different names and have two definitions and we live by two definitions. You could just add both definitions to…
Piethein: You could do that, but I think that undermines the whole idea of the enterprise data warehouse, where you have one single version of the truth stored within.
Wouter: Then you end up with one model which basically comprises of…
Piethein:… all the different languages within the organization. But then I think another problem is the single integration layer, so you always need to integrate the data from the different source systems first in order to create value at a later stage. And I think this also creates a stovepipe, because the more systems you will add the more coupling points you will see in such a data warehouse system, the more complex it will be then to pull out new data from the system.
Wouter: And by coupling points… What do you mean by coupling points?
Piethein: So it’s a dependency between all the different data. The more sources you add, typically, it’s an exponential number. People maybe who listen to this podcast they should look up “big ball of mud” and there’s a nice diagram on the internet you can find. But the more data you will add, the more tables you will see, the more ETL scripts. So basically everything is tied and coupled.
Wouter: And why is that a problem?
Piethein: Because of the complexity. Changing one table could cause a rippling effect. Before you know it you start changing all your tables, just for the sake of adding one new system.
Wouter: So if someone changes something in the source system, in the source database, then the whole value chain breaks.
Piethein: No, not necessarily, but if you need to make a fundamental change for example within the integration layer of the data warehouse system it could cause a rippling effect. Because all these different tables which sit in this integration layer and have dependencies and references.
Wouter: But I think you also describe in your book that in certain cases the classical way of building a data warehouse means that we just pull out the source data and pull it towards a data lake and from there do all the ETL and integration. When it’s just making a raw copy from the source system to your data lake, which is also tight coupling, it also creates a dependency with changes in the source system.
Piethein: Absolutely, so when people start to change their original source systems, their data pipeline will look different and then the whole pipeline will break.
Wouter: You could say then just make sure that data delivery is solid.
Piethein: Yes, make it stable and make a proper extraction. Yeah, we will probably come to that later. I think another problem of the data warehousing system is you see typically there’s one central piece of technology also sitting underneath these data warehousing systems, so there’s a relational database. While if you nowadays look more on the consuming side, there’s large variation in the ways they process and would like to use the data, they would maybe like to use a graph database for social network analysis, and this is a different type of technology than used underneath the data warehousing system. So then in any case, with duplication and distributing that data that also undermines the whole concept of a enterprise data warehouse where all the data is sitting together in one single store.
Wouter: Right, so that’s your third area then, basically the consumer side, where you see that there is an increased variety in the way data is being used.
Piethein: indeed. Used.. processed..
Wouter: So there is a new demand which poses new requirements for your data architecture. That’s the thirds area where you see new challenges arising.
Piethein: over the last couple of years we have seen a tremendous number of new tools, analytical frameworks, methods, techniques to use data.
Wouter: Are some of the challenges well recognized by now?
Wouter: Most organizations realized that there is an increased need for sort of proper data management because of the rush towards advanced analytics and AI. This becomes the bottleneck. So then what?
Piethein: This whole new architecture, we came up with this enterprise scaled architecture which has four fundamental pillars. One critical area you need to consider is data domains. So you need to find the logical boundaries, also within your enterprise and its architecture. Often I see that this is the most difficult part. So what is that granularity? Is it an application, a business capability, is it an organizational unit?
Wouter: Why is this the first pillar?
Piethein: Because you also need to set the responsibilities and accountability straight within the organization and it is strongly related to the data governance. Data ownership often needs to be aligned with something, is it the application, the domain or the business capability?
Wouter: In the previous set up you would have a central team responsible for everything but they did not have knowledge nor say about the data. And in your view the new paradigm put the responsibility more at the source: the domain level.
Piethein: More on the data level. So in the past we mostly looked at systems and applications and now we put the data more centrally and for that you need to find logical boundaries and know where to put the responsibility for the data. Of course you could look at the systems which often deliver particular business value so maybe then it is more natural to go one more level up and look at the business areas and the business capabilities and to use those as a boundary for the responsibilities.
Wouter: So do you have an example because from practice I know that this is not a trivial question. So maybe just to clarify before we go there. You describe a domain as having a harmonized language within the domain.
Piethein: Yes, one common area of interest. Most typically they by nature also speak the same language. The same people work on the same objectives.
Wouter: They define household the same way and there may be another domain with a different household.
Piethein: If you see problems in language you know for sure it is a different domain.
Wouter: Let’s take an example, suppose you have a global manufacturing organization that operates in 50 countries. They do more or less the same across the globe but obviously they have grown through acquisitions, organically, etc. so they have country organizations and those will have different systems. In one area they might have SAP and in other are something different. From a business perspective they do the same thing. They manufacture the same products, they have the same business model, etc. How would you define domains in such a case? To clarify, you could say it is logical to see countries as domains, but you could also do it by function, and say marketing, production or sales, are the domains and we let them cross country boundaries.
Piethein: This is where business architecture comes in. So look at your business capabilities, if across all of the countries or functions, they speak the same language and pursue the same objectives. Is it then feasible or efficient to have multiple instances of the same business capability across your organization? There could be different reasons for it: you could want to give some teams or people more autonomy or they operate the business capabilities in a slightly different way. These could be arguments to still have this federation applied. But you could also argue that if the objectives are the same why not instantiate a single business capability and share responsibility.
Wouter: And what do you mean by business capability in this case?
Piethein: A business capability has a technological aspect, people are involved and there is some data.
Wouter: For this example you would need to harmonize the different systems?
Piethein: Yes, for instance the ERP system but also the processes around it and the data that comes with it.
Wouter: Coming back to your pillars. This is the first one: moving from centralized to decentralized data domains.
Piethein: Critical in this aspect is that those data domains are the data products: how do you make the data available from these domains? So do not make the data available as raw assets, tightly coupled with your operational systems. I think lots of work needs to go in to make the data ‘read-optimized’, so ready for consumption. This is important because if you make the data available poorly with data quality problems that all these different consumer domains will be confronted with the same problem. So you see then federated work created in all these different domains and if we step away from this single model/enterprise data warehouse system and we make it much more federated.
Wouter: So, you asked the domains to take ownership for delivering that data as a product. So it is no longer a central team that says ‘give me all your data and we’ll make nice products out of it’, but you give the domains the responsibility and obligation to provide data as a product.
Piethein: So all these different products will be built, decentrally, by these teams. It is not just the data, but also the metadata so you need to think about a central taxonomy you want to have within your organization and that everyone understands each other’s context. Because we no longer will integrate all data into a single model and because of this I think it is crucial to think of ways that you can help teams to understand each other’s data.
Wouter: And the third area?
Piethein: The third area is the data platforms. So you need to have underlined technologies in order to distribute all of the data, to envision your metadata, repositories, you need to standardize ways of communicating so protocols for example. Also within the domains, they would like to leverage certain technologies so you need to think of at least a platform strategy. If you don’t come up with a data platform strategy, you take a big risk of proliferation so each domain then starts to explore different technologies that are maybe incompatible with each other. So then the whole distribution becomes a big problem.
Wouter: What makes it a platform? You could interpret this as having one central place of having decentralized people to put their data. But then you would still have a central platform again. In your book you talk about platform as a service. What makes it different from having a centrally managed environment where people can upload their data?
Piethein: The platform is required to distribute the data, but not as a single team that is responsible for distributing all day. You want to make it self-service, so domain teams can self-service distribute data.
Wouter: So there is a central architecture or data platform and teams get access and their own environment within the platform where they can provision their data and everyone else get access to it.
Piethein: Yes, it is basically a data management platform so you need to adhere to all the things we discussed before. It has to be a secure environment, a governed environment, you want to know where data originated from, you want to capture the metadata, you want to know the schemas of these different sources. So that platform as a whole needs to do a lot of different things.
Wouter: So you put the responsibility for delivering that data at the federated, decentralized teams, but they still provision their data to a platform that is accessible throughout the organization.
Piethein: This naturally brings me to the last point: data governance/community for operating this platform for all these different domains. So you need to think of strategies on how do you envision this, how much responsibility you would like to give to the different domains and how much mandate you would like to give to the central team operating the platform and the data quality controls.
Wouter: So in essence, the shift from where we are coming and where you advocate for going towards, is from a centralized approach to data management, to a decentralized approach where you give decentral teams divided in the right way, through proper domain definitions, the responsibility to deliver data in a harmonized and decentral way. Making use of all kinds of standards: technology, governance etc. Going from centralized, it is always a tradeoff.
Piethein: I think today’s large organizations they behave as an ecosystem, so they no longer behave as a central organization. Also organizational boundaries are very much unclear these days. Modern banks they work together with Fintech, collaborate with external parties, they use data, buy and sell data, and no longer operate within your own organizational boundaries, they very much operate within an ecosystem which is by itself much more disciplined.
Wouter: Is it then correct to say that in a centralized way you would have, as an organization, full control over your data, the quality and but also the use of the data, by having it centrally. That doesn’t work, so decentralizing, giving autonomy back to the provider. But to safeguard the way data is used, to make sure it doesn’t become a new kind of spaghetti mess. You need to have increased emphasis on setting those standards and having the right governance, tools and principles.
Piethein: Spot on. This brings me also more towards a hub-spoke model. Rather than allowing all the domains to distribute data directly themselves, they for example first speak to that central community or authority, do an intake, assess the data quality, look at the metadata, ensure the pipeline is properly scripted and stable. And then from there consumption can start.
Wouter: There are two questions that come to mind. One, how do you know this is going to work this time? It’s a new concept. What are the early signs of proof of these new architectural designs? Are there any organizations that are front running this and provide proof points?
Piethein: For sure. I worked at ABN AMRO where we envisioned this, so I know this concept is working but also if you now look on the internet and things like data mesh are so popular. Because enterprise data warehouses and data lakes are hard to operate and to scale.
Wouter: Probably when they came to market, people believed that those were the solutions back then?
Piethein: Yes, but on the other hand I’m also very happy that the whole concept of a data mesh emerged, because I think we were pioneering this concept already for quite some time but we had no name yet to give this new type of data architecture. When I saw the first materials of a data mesh I was so happy, because I could very much recognize the way of thinking.
Wouter: What I also found clarifying is that, it’s not like the data lake or the data warehouse as a technology or concept is bad, it’s just used in a different way. It’s not used as a single point technology for a full organizations, but it’s part of your mesh, you can use the data lake in the mesh.
Piethein: Yes, it does not exclude for example a data warehouse. You can still do data warehousing in one or few domains if you want to combine lots of data and do stable reporting. I think also this domain-oriented architecture, people might think it’s too difficult, but I truly believe that building a complex integrated data warehousing system is often typical, and takes years and maybe with this new approach, each domain can in a federated fashion publish their data products relatively easily by themselves. You just need to build that central platform for onboarding all the data and distributing it. Probably this takes time but that’s more the infrastructure, and therefore also cloud players will have a big role to play here. My believe is that these big cloud players, and therefore I joined Microsoft, will also come with approaches to facilitate this at scale. Taking away the pain of building and operating such a platform yourself.
Wouter: So cloud is a major enabler for these principles. The second thought that came to mind, is around the change. I can envision that once you have this, that’s the ideal world from a data management perspective. But the tough question is, how do you get there from where you are? You describe that in the old model there was a discrepancy between demand and supply, from a demand perspective there is an increase demand because advanced analytics and AI has exploded. And supply of demand can’t keep up. These two worlds are misaligned and disconnected. But in a new model you could say that the tension does not necessarily go away, because going back to the example of a global company. Suppose I have an AI application which I super valuable to the businesses, and I want to build it, scale it and roll it out across the globe. But at this point we don’t have the data managed in a way that allows us to do it. So I want the countries and those domains to provide me with the data. Maybe I don’t want data from one domain, but from multiple domains, because typically the value of AI sits in combining data from different domains. From a demand perspective in this case, an AI application, I would like that data provisioning to happen as fast as possible. But moving to the federated model requires all your domains to make a huge change.
Piethein: Yes, it’s a cultural shift for sure. Also building these read-optimized pipelines, I know from practicing this, it takes time. It’s not that easy. Because you want to make data available in a manner that it’s ready for quick consumption. We do not want to move the pain to all the different data consumers. I think this takes time.
Wouter: And do you have view on what is then the right approach? You could also say it’s an opportunity to align that need – the demand side with supply side – or make the value creation, the demand side, be the driver of change. In the domains, as you mention, it requires a cultural change, and people need to view data differently. But the business will say it has other objectives and priorities, whereas it also might create value for the business side. Is there, in your opinion, a way where you can align the incentive for value creation and doing the required hard work to bring the data to market?
Piethein: I think if you do this properly, it will save you time. Let’s look at an example – a transactional system these days has tons of interfaces from this central custom administration. In most companies you will see there is a central or multiple custom administrations. Over the years they have designed hundreds or maybe tens of interfaces. All these interfaces need to be operated and maintained, and this is a lot of work and also keeps you away from making practical changes to you applications. Because if you start modifying your application you also need to ensure that all these interfaces are guaranteed. Stepping away from this model and envisioning one single interface towards all these different domains, with read-optimized data, will help you very much because you are entirely decoupled from all you differnet consumers.
Wouter: That triggers another remark, that you also mention in the book. You say that the operational use and analytical use of data actually merges. Where traditionally these were different world, we pulled out the data from the source for BI but in a modern data mesh architecture these is no distinction anymore.
Piethein: Yes, all these application will start communicating with each other. Transactional, operational, analytical.
Wouter: The future architecture consists of one way in which data is exchanged throughout an enterprise, no matter whether it’s use for transactional purposes or maybe someone who trains or runs a model on it.
Piethein: I see one single distribution layer sitting between all the different domains. This single distribution layer must consist of several technologies to facilitate indeed a single communication – API communication, event communication, or batches for example. This will allow both operational, transactional and analytical applications all to work together.
Wouter: And that’s a big shift!
Piethein: Yes, I think it’s a big shift because the data warehousing architecture more or less worked asynchronously, because it was batch oriented. You always had to wait for data to be processed, it think it’s more domain oriented architecture – data sits everywhere and domains can interact with each other at any speed if they would like to do so.
Wouter: What would be your recommendation for companies that want to start a change journey, that want to transition from legacy data architecture towards this new type of architecture?
Piethein: Start small, not try to boil the ocean at once. Start with a single use case, not so many domains. Do a pilot, the cloud by itself has all the capabilities to do this easily, so you can make it self-service, provision easily. You can use policies for automatic enforcement. That will make it at least for a large companies a whole lot easier.
Wouter: What are the minimal viable building blocks that you need?
Piethein: Identify your first domain, work on the data products, make the data available and read-optimized, and ensure that a minimal form of data governance and metadata management is in place. And then think of the consumption. Maybe also add a small distribution layer in between to decouple the data provider and the data consumer.
Wouter: When it comes to data architectural principles, those are by definition fairly high level. Once you start doing things you have to make concrete tradeoffs and choices for technologies, for detailed ways of implementing things. That is where it can become tricky or messy. What are the non-negotiables that need to be clear once you start this? I could imagine that within that grey area, some choice might still be okay whilst unknowingly you might start implementing things which actually do no longer adhere to the most important principles.
Piethein: I listed a number of principles in the book, but to give some examples; data ownership is important, if you don’t assign owners, then accountability and follow up on data quality problems will be very difficult; data sharing agreement or data contracts are also important, you need to agree on stable delivery; metadata management is crucial, you want to describe the data assets with proper metadata, so the schemas for the different interfaces need to be there to allow building stable pipelines on the consumer side. Also capturing only the unique, what I call, ‘golden data’ from these different golden sources. Often you see a complex environment that data is duplicated. You see that the same data is also distributed to other transactional or operational systems. You want to ensure that you recognize what is truly unique data created in a particular system, and make sure you capture this data and assign proper ownership to it.
Wouter: Thanks Piethein, this was pretty clarifying, if people have more questions they should just read your book, data management at scale. Any last remarks from you end or thoughts that you’d like to share with us?
Piethein: I enjoyed my first podcast. Wouter thanks a lot for having me here. I enjoyed this very much, let’s do it again after a year!
Wouter: Yes let’s see where the world stands, and if we have progressed on this area. In my experience, developing AI solution with organization is one of the more pressing challenges that companies have. I also view it as an opportunity, to accelerate the value creation from AI but do it in a way to get the foundation right. I’m particularly excited to connect the two.
Piethein: Absolutely, and there’s plenty of fun out there also in the world of data, so don’t be afraid.
Wouter: Thanks Piethein. See you next year then!
Piethein: Thanks Wouter!