Are we still discussing Big Data or we should start reviewing and monitoring its different tools and platforms? Also, how organization should determine the right tool to utilize as part of its data analytics architecture?
In today’s competitive business environment, Internet of Things devices and Information Services increasingly produce large amount of data in disparate structures. Many Open source and commercial tools continue to pop up to deal with the different characteristics of Big Data. As a result, there is an abundance of tools and platforms to analyze Big Data or act as building blocks of such. Just by reviewing open source tools, we have come across 300 tools. That number is final after we applied strict filters like legitimity of the source, license type, and last commitment activity. We are not talking about Big Data anymore, we are talking about Big Tools.
To extract value from Big Data, an organization should determine the right tool to utilize as part of its data analytics architecture. The right tool would depend on the characteristic of the data to be analyzed and the domain that the organization is operating under. The organization would train its IT workforce to obtain the technical expertise to be effective with those tools. Businesses incur costs when they try to adopt these tools or change their existing source codes to run on newer versions. In other words, technical debt. In the Big Tools era, there is no standard on how these tools come together and compose a data analytics architecture. Most of these tools are unknown to business world, some of these tools even we didn’t know. To illustrate this, Apache Beam and Apache SAMOA are good examples. Latest trends in the big data domain is moving towards providing a level of abstraction to utilize popular data processing platforms. Apache Beam implements its dataflow programming model on multiple processing platforms like Apache Spark and Apache Flink. Apache SAMOA enables programmers to apply machine learning algorithms on data streams. Applications developed with SAMOA can be executed on Apache Storm and Apache Samza. Moreover, new models and tools continue to emerge at a fast pace in Big Data domain. There is no established method to track the newest developments particularly for the open source tools.
We are working towards developing an open source big data analytics architecture. We are trying to keep it as simple as possible to provide a comprehensive picture on big data analytics lifecycle. For academia, the architecture will provide the state of the art, tools that are missing, and tools that are mature enough to be used as part of a research. It will also provide the method for tracking notable new open source tools popping up in different sources. For technical people, it will help determine the tool to use for a particular implementation. Small and medium sized enterprises can provide services using some these tools addressing the gaps in a bigger architecture. For an established firm trying to develop a strategy, the architecture will provide the comprehensive picture on what fits where. Commercial big data solution providers can also benefit from this architecture. They will see the capability they lack and collaborate with a small sized enterprise to provide that capability.
Mert Gokalpand Keres Kayabay are working with Mohamed Zaki to build this architecture. We will publish a working paper soon on this topic.