Supercrunch Blog

Dr. Ralph Wirth June 8, 2017 Big Data

Big Data, Broken Promise? Our learnings on how to avoid Big Data frustration

Part 4: Elephants and human experts: Infrastructure, tools & data governance

It may be surprising to some readers that the first few articles of our “Big Data, Broken Promise?” series have not covered many Technology topics at all. Is it really that easy to set up a suitable IT infrastructure and all the required tools for Data Scientists?

Of course this is not the case: The fact that we have seen many more Big Data initiatives fail due to an unclear understanding of Big Data use cases and data requirements than due to unsuitable IT infrastructure does not mean that Technology challenges should be underestimated. Matt Turck’s impressive map of current Big Data technologies clearly illustrates what a diverse and complex field we are dealing with.

Even more remarkable than the sheer number of tools, frameworks, and applications is the pace at which this ecosystem changes. Tools that are hyped in one year may very well be irrelevant one or two years later.

What does this mean for companies that are starting their Big Data journey? First of all, it emphasizes once again that you need a good understanding of your data and what you probably want to do with it. For example, technology frameworks and tools that specialize on processing streaming data in incomparable speed may be totally irrelevant if your data will arrive from time to time in big chunks. A second learning is: Yes, you will need an IT team that is familiar with these new technologies, that is able to configure the right IT platform for your use cases, and that can operate it. Typical profiles that you will need to look for are Big Data architects and Big Data engineers.

Another profile that is often forgotten when setting up a team responsible for Big Data initiatives is the data steward. Any Big Data platform will usually include data from different organizational silos – with data of different granularity, quality, completeness, etc. Data governance and data curation – i.e., decisions on

  • what data goes into the “data lake” and what data does stay out,
  • who has access to the data and who has no access,
  • how data is tagged and documented

 

is crucial to making sure that your data lake does not become a data dump. The data steward is a dedicated role that owns these tasks, and while it may seem like an optional investment when you are in the setup phase of your Big Data initiative, we emphasize that this role is a crucial success factor. The data steward is not an easy profile to recruit though. The role requires a thorough understanding of the company’s entire data landscape, of typical data issues and topics, and the strategic and commercial relevance of use cases. Furthermore, the data steward needs to be a convincing communicator, not shying away from potential conflicts with “data owners”. This also means that full organizational backing is an absolute requirement to make your data steward a successful one.

When it comes to deciding on the level of data curation, we usually suggest a layered model. There are key data assets that require very thorough data curation. These data assets, very often at the core of a company’s value creation, have to be thoroughly documented, quality controlled, enriched by suitable meta data (on file level and ideally also on variable level), and indexed for easier search. Other data assets which are seldom used may undergo much more relaxed data curation processes. We call this approach the layered data curation model (see illustration)

The Layered Data Curation model

So you see: While Big Data Technology certainly may appear challenging and overwhelming, it is again the human factor that can make a difference.

We are looking forward to hearing about your experiences with Big Data technology and data curation!