At MaibornWolff, we are continuously working on many IoT projects in different industries for various customers. All of these projects have in common that people want to gain insights based on the data being produced by IoT devices and sensors. In other words: Basically all IoT projects are also data analytics projects. When doing data analytics, it is important to understand the environment and the data that you are working with. This is necessary in order to choose the right technologies and solutions. To improve the understanding of IoT data, this blog post will talk about some of the things that are specific to IoT projects and IoT data. Looking at it from a distance, there are two main aspects that make IoT projects a bit different from other data analytics projects: (1) The devices and (2) the data. The following sections will go into details regarding these two aspects.
Large amount of concurrently connected devices
In most IoT projects, you have a large number of devices that are concurrently connected to a central communication infrastructure (typically a MQTT message broker cluster). The exact setup differs a bit between projects in which end user devices (cars, household appliances etc.) are directly connected to the backend and industrial IoT projects.
In many industrial IoT projects, you might only have a couple of hundred or thousand devices or sensors connected to a gateway or edge computing solution but these devices or sensors send updates with a high frequency (e.g. every couple of milliseconds).
In projects with end user devices, you usually have a lot more devices with standing TCP connections (possibly millions), but the devices send data less frequently (in worst case every second or a couple of seconds).
In IoT projects, you might need to aggregate the bi-directional communication between backend services and the data sent by the devices into business objects (e.g. a car ride or a wash cycle of a washing machine) to do certain analyses, so you have to make sure you even ingest data that might not be stored in some backend database. It’s also important to ingest the messages from the devices so you have access to the raw data that has been sent from the devices because data that has been preprocessed by other services might not contain all information anymore.
Unreliable networks (e.g. mobile)
The networks in many IoT projects are inherently unreliable (e.g. mobile networks) so data sent can get lost, will be sent multiple times, will be late (possibly days or months) or might even get damaged.
Large amount of combinations of hardware and software versions
Especially in IoT projects with end user devices, you often have a large number of different hardware versions and software versions out in the field (including all kinds of combinations of these hardware and software versions). Even if you are lucky and you can continuously force all devices to do remote software updates, you will still have new devices coming online that were just built and that still might have older software versions installed. Unfortunately, you often cannot make sure that all devices update themselves. For example, cars cannot be updated while somebody is driving them or when they are parked in parking garages without an internet connection. Also since the end user owns these devices, it’s typically up to the owner if he or she wants to install updates.
Many different types of hardware and software versions.
Icons by Font Awesome are licensed under the CC BY 3.0 license (Font Awesome 4.7.0).
The main problem with all these hardware and software combinations is that they tend to have different bugs and quirks, and you will need to be able to deal with all of them. This applies not only the ones that are still in the field but also with all historic combinations in case you have to reprocess old data.
Continuous flow of data
The sensors and devices in IoT project create a continuous flow of data. The data usually consists of certain events (e.g. engine started) or time-based updates that are sent with a certain frequency (e.g. temperature from a sensor). For this kind of data typically, event processing based on streaming technologies makes sense so you can act on events and possibly do predictive analyses.
Small data packets, but a lot of them
The events and time-based messages described above are typically small: often smaller or around 1kb (assuming they are encoded in a binary format like Google Protocol Buffers) but you will have a lot of them. The following illustration gives you a general idea of the amount of data that you are dealing with.
Small data packets, but a lot of them result in a large amount of data.
Image provided by Dominik Obermaier from HiveMQ GmbH.
Putting the data-puzzle pieces together
The individual messages sent by the IoT devices are often not useful by themselves. For many analyses, you need to aggregate them to some form of business object on which you want to do analyses. For example, in connected car settings you often do analyses based on rides. The rides are aggregates of events and time-based messages that happen during a ride. To do these analyses you have to aggregate all these messages, and due to the problems caused by the unreliable networks, it’s a bit like putting a puzzle back together.
A lot of small data packets form a larger image.
Icons by Font Awesome are licensed under the CC BY 3.0 license (Font Awesome 4.7.0). Color changes were made.
Data quality is even worse than usual
Experience shows that the data quality in many IoT projects is even worse than what you are used to in regular enterprise data analytics projects. One reason for this is the problem described in the section „Unreliable Networks“. Another reason is that in contrary to many enterprise data analytics settings, the data from the IoT devices is not an inherent part of some sort of business transaction that is being stored in some database following ACID principles. It’s up to you to create these business objects or transactions out of individual messages that were not built for this purpose.
Edge computing required in many cases (industrial IoT, cars…)
In many cases, the sensors in an IoT project produce A LOT OF data. This is not only true in industrial IoT projects but also in other IoT projects like connected car projects. To give you an idea about the amount of data: A typical number quoted when talking about cars with autonomous driving sensors is that a single car produces up to 10 TB of sensor data per day. Of course, the number will vary per car and per manufacturer but the important thing is the order of magnitude. Due to the amount of data being created, you can’t just transfer everything into your backend (it doesn’t matter if it’s in the cloud or on premises). Instead you will often need to do some preprocessing of the data on the edge. In industrial IoT settings, you might have a full edge computing cluster available while in other IoT projects you often have some kind of a gateway that can do the preprocessing (e.g. the headunit in a car).
In a follow-up post, I will talk about what decisions can be made based on the characteristics described above.