Hazelcast Jet aims to DAG-propel IoT data flow

Open source in-memory data grid (IMDG) company Hazelcast is increasingly aligning its ‘application-embeddable’ technologies to IoT use cases.

Apache Spark rival Hazelcast recently hit the 0.5 version release for Jet, its application-embeddable distributed computing platform for fast processing of big data sets.

This release features the company’s Pipeline API [application programming interface] for creating data connections for both batch and stream processing.

IoT application architectures

So what makes this technology a good fit for the IoT? For a start, Jet is delivered as a single library with no dependencies. Hazelcast executives claim that this means it suffers from fewer system incompatibilities when deployed across architectures where threading, parallelism and concurrency concerns could hamper other code sets.

This, in turn, means that it can (arguably) be suited to deployments in embedded systems where multiple systems connectivity does not exist. As a result, standalone IoT use cases come to the fore.

Hazelcast suggests that typical application use cases include online trades (where, presumably, an entire transaction execution has to happen inside a defined contained space of logic), sensor updates in IoT architectures, real-time fraud detection and system log events among others.

The Pipeline API is the primary programming interface of Hazelcast Jet for batch and stream processing, so one imagines that this should make it more appealing to a wider Java audience. Indeed, the Java 8 Stream API is also available in Hazelcast Jet 0.5, a well-known and popular API in the Java community which supports functional-style operations on streams of elements.

DAGs, not the Aussie kind

Overall, this is a big update to the Hazelcast Jet low-level core API, which uses directed acyclic graphs (DAG) to model data flow – allowing detailed DAG assembly of processing jobs.

As explained here, DAGs are used to model probabilities, connectivity and causality, so that a ‘graph’ in this sense means a structure made from nodes and edges – so kind of like a graph database.

“Since its first release, Jet has put the ‘fast’ in Fast Big Data with performance up to 15 times faster than Spark and Flink,” said Hazelcast CEO Greg Luck. “In this release we have been working on bringing Hazelcast’s programming simplicity to Jet, which we think we have now achieved with the Pipeline API. Programmers, start your Jet engines.”

Also new is fault tolerance using distributed in-memory snapshots – in Hazelcast Jet 0.5, snapshots are distributed across the cluster and held in multiple replicas to provide redundancy.

Jet is now able to tolerate multiple faults such as node failure, network partition or job execution failure. Snapshots are periodically created and backed up. If there is a node failure Jet uses the latest state snapshot and automatically restarts all jobs that contain the failed node as a job participant.

Not so daggy on DAGs after all then? Well, Hazelcast Jet accumulates records into micro-batches for processing ‘as soon as they come in’, if you will… and the software itself is built on top of a one-record-per-time architecture (sometimes we call this ‘continuous operators’).

To be honest, you had me at continuous micro-batch processing parallel-execution model (remember our lack of dependencies?), yeah? The IoT data flow rate is getting faster, some of this stuff is going to matter.