Mqtt + Spark streaming and dynamodb - apache-spark

I am trying to design an IoT platform using the above mentioned technologies. I would be happy if someone can comment on the architecture, if its good and scalable !
I get IoT sensor data through mqtt which I will receive through spark streaming( There is a mqtt connector for spark streaming which does it). I only have to subscribe to the topics and there is a third party server which publishes IoT data to the topic.
Then I parse the data , and insert in AWS DynamoDB . Yes whole setup will run on AWS.
I may have to process/transform the data in future depending on the IoT use cases so I thought spark might be useful . Also I have heard spark streaming is blazing fast.
It's a simple overview and I am not sure if its a good architecture. Will it be a overkill to use spark streaming ? Are there other ways to directly store data on DynamoDB received from mqtt ?

I cannot state whether your components will result in a scalable architecture, since you did not elaborate on how you will scale them, nor what will be the estimated load that such a system should handle, or if there will be peaks in terms of load.
If you are talking about scalability in terms of performance, you should also consider scalability in terms of pricing which may be important to your project.
For instance, DynamoDB is a very scalable NoSQL database service, which offers elastic performances with a very efficient pricing. I do not know much about Apache Spark, and even if it has been designed to be very efficient at scale, how will you distribute incoming data ? Will you host multiple instances on EC2 and use autoscaling to manage instances ?
My advice would be to segregate your needs in terms of components to conduct a successful analysis. To summarize your statements:
You need to ingest incoming sensor telemetry at scale using MQTT.
You need to transform or enrich these data on the fly.
You need to insert these data (probably as time-series) into DynamoDB in order to build an event-sourcing system.
Since you mentioned Apache Spark, I imagine you would need to perform some analysis of these data, either in near real-time, or in batch, to build value out of your data.
My advice would be to use serverless, managed services in AWS so that you can only pay for what really you use, and forget about the maintenance, or the scalability, and focus on your project.
AWS IoT is a platform built into AWS which will allow you to securely ingest data at any scale using MQTT.
This platform also embeds a rules engine, which will allow you to build your business rules in the cloud. For example, intercepting incoming messages, enrich them, and call other AWS services as a result (e.g calling a Lambda function to do some processing on the ingested data).
The rules engine has a native connector to DynamoDB, which will allow you to insert your enriched or transformed data into a table.
The rules engine has also a connector to the new Amazon Machine Learning service, if you want to get predictions on sensor data in real-time.
You can then use other services such as EMR + Spark to batch-process your data once a day, week, month.
The advantage here is that you assemble your components and use them as you go, meaning that you do not need the full featured stack when you are beginning, but still have the flexibility of making changes in the future.
An overview of the AWS IoT service.

Related

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Real time Streaming data into Azure Datawarehouse from sql server

I'm trying to build a real-time reporting service on top of Microsoft Azure Data Warehouse. Currently I have a SQL server with about 5 TB of data. I want to stream the data to the data warehouse and use Azure DW's computation power to generate real-time reporting based on data. Is there any ready to use/best practices to do that?
One approach I was considering is to load data into Kafka and then stream it into Azure DW via Spark streaming. However, this approach is more near-real-time than real-time. Is there any way to utilize SQL Server Change Data Capture to stream data into data warehouse?
I don't personally see Azure SQL Data Warehouse in a real-time architecture. It's a batch MPP system optimised for shredding billions of rows over multiple nodes. Such a pattern is not synonymous with sub-second or real-time performance, in my humble opinion. Real-time architectures tend to look more like Event Hubs > Stream Analytics in Azure. The low concurrency available (ie currently a max of 32 concurrent users) is also not a good fit for reporting.
As an alternative you might consider Azure SQL Database in-memory tables for fast load and then hand off to the warehouse at a convenient point.
You could Azure SQL Data Warehouse in a so-called Lambda architecture with a batch and real-time element, where is supports the batch stream. See here for further reading:
https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx
If you’re looking for a SQL-based SaaS solution to power realtime reporting applications we recently released a HTTP API product called Stride, which is based on the open-source streaming-SQL database we build, PipelineDB, that can handle this type of workload.
The Stride API enables developers to run continuous SQL queries on streaming data and store the results of continuous queries in tables that get incrementally updated as new data arrives. This might be a simpler approach to adding the type of realtime analytics layer you mentioned above.
Feel free to check out the Stride technical docs for more detail.

Spring Cloud DataFlow http polling and deduplication

I have been reading much Spring Cloud DataFlow and related documentation in order to produce a data ingest solution that will run in my organization's Cloud Foundry deployment. The goal is to poll an HTTP service for data, perhaps three times per day for the sake of discussion, and insert/update that data in a PostgreSQL database. The HTTP service seems to provide 10s of thousands of records per day.
One point of confusion thus far is a best practice in the context of a DataFlow pipeline for deduplicating polled records. The source data do not have a timestamp field to aid in tracking polling, only a coarse day-level date field. I also have no guarantee that records are not ever updated retroactively. The records appear to have a unique ID, so I can dedup the records that way, but I am just not sure based on the documentation how best to implement that logic in DataFlow. As far as I can tell, the Spring Cloud Stream starters do not provide for this out-of-the-box. I was reading about Spring Integration's smart polling, but I'm not sure that's meant to address my concern either.
My intuition is to create a custom Processor Java component in a DataFlow Stream that performs a database query to determine whether polled records have already been inserted, then inserts the appropriate records into the target database, or passes them on down the stream. Is querying the target database in an intermediate step acceptable in a Stream app? Alternatively, I could implement this all in a Spring Cloud Task as a batch operation which triggers based on some schedule.
What is the best way to proceed with respect to a DataFlow app? What are common/best practices for achieving deduplication as I described above in a DataFlow/Stream/Task/Integration app? Should I copy the setup of a starter app or just start from scratch, because I am fairly certain I'll need to write custom code? Do I even need Spring Cloud DataFlow, because I'm not sure I'll be using its DSL at all? Apologies for all the questions, but being new to Cloud Foundry and all these Spring projects, it's daunting to piece it all together.
Thanks in advance for any help.
You are on the right track, given your requirements you will most likely need to create a custom processor. You need to keep track of what has been inserted in order to avoid duplication.
There's nothing preventing you from writing such processor in a stream app, however performance may take a hit, since for each record you will issue a DB query.
If order is not important, you could parallelize the query so you could process several concurrent messages, but in the end your DB would still pay the price.
Another approach would to use a bloomfilter that can help quite a lot on speeding up your checking for inserted records.
You can start by cloning the starter apps, you could have a poller trigger an http client processor that fetches your data and then go through your custom code processor and finally to a jdbc-sink. Something like stream create time --triger.cron=<CRON_EXPRESSION> | httpclient --httpclient.url-expression=<remote_endpoint> | customProcessor | jdbc
One of the advantages of using SCDF is that you could independently scale your custom processor via deployment properties such as deployer.customProcessor.count=8
Spring Cloud Data Flow builds integration streams for data based on the Spring Cloud Stream, which, in turn, is fully based on the Spring Integration. And all the principles exist in Spring Integration can be applied everywhere there on the SCDF level.
That really might be a case that you won't be able to avoid some codding, but what you need is called in EIP Idempotent Receiver. And Spring Integration provides one for us:
#ServiceActivator(inputChannel = "processChannel")
#IdempotentReceiver("idempotentReceiverInterceptor")
public void handle(Message<?> message)

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Geolocation App Google Cloud

I've no experience with geo location based apps and want to build a geolaction based app with a backend written in nodejs and running on google cloud.
My main problem is how to design the database and which db should I use (Bigtable or Datastore)? The main query is to query places at a given location and radius. I have read a lot about the geohash, but the nodejs librarys aren't so good now.
So what are you recommend me for chosing and designing database?
If you want to store the data in relational format, perform frequent
joins between location/co-ordinates and the amount of data being
processed is less (>50 GB), then go for Google Cloud SQL.
Cloud Bigtable is ideal for storing very large amounts of
single-keyed data with very low latency. It has great integration
services with most of the Apache projects.
If there is no requirement of data to be in the relational format,
and frequent insertions and updations are required on huge amounts of
data, go for Google Cloud Datastore. The querying process would be
slightly different and difficult for a naive person to understand.
You can also use Google BigQuery which processes TBs of data within a
few seconds, if frequent insertions and updations are not required.
It is more of a data store.
Have a look at the following URL for better insights: https://cloud.google.com/storage-options/
Google has also announced Cloud Spanner which is a relational
database service that offers great consistency and speed (still be
beta). It is still in early stage, but can revolutionise the concepts
of SQL vs NoSQL.
All of the above databases have querying libraries written for NodeJS.
GeoMesa, an Apache licensed open source suite of tools that enables large-scale geospatial analytics, works with Cloud Bigtable. I don't know how well this will interact with node.js, but it's worth considering a framework like GeoMesa since it will likely enable you to focus more on your core product.

Resources