The Internet of Things (IoT) creates a lot of data, obviously. When we have more data than we can handle we tend to talk about data ‘streams’… and these streams flow so fast that they end up creating the so-called ‘data lake’. So how do we navigate the IoT water table and avoid drowning?
Indubitably Hadoop-y
A key factor in the IoT data mountain (sorry, lake, let’s not mix metaphors) is Hadoop. Essentially a large piece of software, Hadoop is an open source technology framework for storing large amounts of data and running correspondingly weighty applications on ‘clusters’ of commodity hardware.
Weighty applications in the Hadoop sense might typically have to handle ‘virtually limitless’ concurrent tasks or jobs. This in itself will create lots of data ‘events’, all happening at the same time i.e. precisely the kind of thing we would expect to see in IoT scenarios.
For example, imagine an IoT sensor in ‘always-on’ mode reporting back temperature and humidity readings to a central management data store. That is a lot of concurrent data events.
If all we need to do is monitor a set of defined factors such as temperature and humidity, then we are mostly dealing with structured data. The trouble is, many IoT sensors will collate masses of unstructured data from sound to video and so on.
Raw, like sushi
We often call this data ‘raw’, because we have yet to apply any contextual, humanistic or circumstantial meaning o it.
When we have too much data (much of it unstructured) to even handle and process, then all we can do it store it. This is in a place we call the ‘data lake’ i.e. we can think of it as a kind of repository that we use as a ‘holding pen’ for data that we haven’t even managed to get to yet.
So what of Hadoop and the IoT for 2017 and the road ahead?
A big data trend to watch
Big data company Syncsort has explained that in terms of who (or more specifically what) now helps to fill up Hadoop, it is the streams of data coming from the IoT that are increasingly responsible.
The firm says that traditional data sources are more popular for filling the data lake, but newer sources are a sizable part of the mix. As Hadoop projects multiply, so too, will the sources and volume of data required to support them.
According to Syncsort, “Newer sources are gaining importance, especially those that generate streaming data, such as smart devices and sensors. Therefore, having a tool that allows organizations to easily access and integrate all data – legacy and newer sources, batch and streaming – is vital.”
So of course guess what? Syncsort is known for producing the kinds of software tools that will help integrate data in this kind of scenario.
Embrace IoT data, all of it
Syncsort postulates on the road ahead for Hadoop and says that the number one role of Hadoop continues to be increasing data warehouse capacity and reducing costs.
GM of Syncsort’s big data business Tendü Yo?urtçu says that it’s clear businesses are realizing the importance of tapping into a full range of data sources but, as yet, they are still struggling to do so.
It’s a question of all data is good data, even if we don’t know what that raw unstructured data means yet while it’s still sloshing about in the data lake.
Peace and (data) love everybody.