Spark Streaming join with GreenPlum/Postgres Data. Approach - apache-spark

What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)

Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.

You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.

Related

Process real time data using kafka

I have a requirement to implement the solution for below usecase.
Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution .
Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic .
Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned) . So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic . We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion .
Could you provide pro's and con's for step2 comparing to sqoop incremental.
please share code snippets/links if any which helps my thought process.
Getting data into Kafka is easy - use Debezium.
For getting it out...
I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.
Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.
I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.
Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.
And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.

Spark Streaming: in-memory aggregation - correct usage

I have a Spark 2.2 Structured streaming flow from an on-premise system into a containerized cloud spark cluster where kafka recieves the data, and SSS maintains a number of queries that flush to disk every ten seconds. A query console-sink is not accessible to external sessions outside the streaming context (hence the CSV flush); the monitoring dashboard runs spark sql from another context to get metrics.
Right now I am only aggregating the data that has come in since streaming was last started. Now I need to aggregate data since forever with the incoming streaming data to provide (near) realtime views. This will mean running a bunch of GROUP BY's on billions of records - maintaining several million aggregate rows in-memory.
My question is regarding how Spark streaming queries can scale like this: how efficient is memory usage (I'll probably use 32 worker contaiers) and is this the correct way to manage a (near-) realtime view of incoming data using kafka and SSS?

Kafka to Spark batch processing

I'm looking for an optimal data architecture.
I'm dealing with TS data that is flushed from a Redis database to OpenTSDB database each week.
OpenTSDB stores its data on HBase which is launched on a Hadoop cluster.
Then, the time series data available on OpenTSDB has to be batch processed (at 1-6 months interval).
Knowing that OpenTSDB data is stored in Binary large object format on HBase, I can't currently tackle HBase HTTP API.
Since Spark cannot directly access OpenTSDB API (While Kafka seems to be Okay with HTTP Api)... I'm facing architecture issues which can be expressed as follows, would it be more convenient to:
Use Apache Kafka to extract batch data (TByte) and use it as a pipeline to ingest and analyze data into Spark Dataframes ?
Flush redis data directly in HBase and hence use Spark directly from it ?
That said, I want to be sure than Spark can handle terabytes batch analytics and kafka can handle that amount before loading it as Spark RDD.
Any suggestion on help will be welcome. Thanks

running interactive sql queries over millions of parquet files

I have millions of streaming parquet files being written . I want to support running ad hoc interactive queries for debugging and analytics purpose ( added bonus if i can run streaming queries for some real time monitoring of key metrics as well).
What is a scalable solution for supporting this.
The two ways I have observed is running spark sql interactively over millions of parquet files (not too familiar with spark ecosystem but does this mean running a spark job for every sql user submits or do i need to run some streaming job and submit queries somehow) and second being using a presto sql engine on top of parquet (not exactly sure how presto ingests new incoming parquet files).
Any recommendations or pros and cons of either approach . Any better solutions considering i have > ~10Tb data produced every day .
Let me address your use cases :
Support running ad hoc interactive queries for debugging and analytics purpose
I would recommend building a presto cluster if you care about minimizing the latency of your queries and are willing to invest in many machines with a large amount of memory.
Reason: Presto would run fully in-memory without touching disk (in most cases)
A Spark Cluster can also do the job, however, it won't be as fast as Presto. The advantage of Spark over presto is its fault tolerance capabilities and its ability to fail over to disk in case of out of memory conditions which may be important for you given that you have too much data.
Run streaming queries for some real-time monitoring of key metrics as well
As long as you have basic queries, you can build dashboards on top of Presto which could run these queries every x minutes.
Having a considerable amount of processing may be a good reason to look at Spark streaming if real-time monitoring is important.
If it isn't then you could build an ETL (using Spark) for calculating your metrics, storing the data as a new hive table and then expose for querying via Presto/SparkSQL again.
How presto ingests new incoming parquet files?
I'm now aware of your architecture, but in any case, you need to provide Presto with a Hive connection (Hive Metastore to be precise).
Hive provides Presto with few schemas attached to the directories where you ingest your data. Presto dynamically sees the new data by default. Spark is not different by the way.
Presto has nothing to do with data ingestion. It only starts its job once the data is there.

Spark as Data Ingestion/Onboarding to HDFS

While exploring various tools like [Nifi, Gobblin etc.], I have observed that Databricks is now promoting for using Spark for data ingestion/on-boarding.
We have a spark[scala] based application running on YARN. So far we are working on a hadoop and spark cluster where we manually place required data files in HDFS first and then run our spark jobs later.
Now when we are planning to make our application available for the client we are expecting any type and number of files [mainly csv, jason, xml etc.] from any data source [ftp, sftp, any relational and nosql database] of huge size [ranging from GB to PB].
Keeping this in mind we are looking for options which could be used for data on-boarding and data sanity before pushing data into HDFS.
Options which we are looking for based on priority:
1) Spark for data ingestion and sanity: As our application is written and is running on spark cluster, we are planning to use the same for data ingestion and sanity task as well.
We are bit worried about Spark's support for many datasources/file types/etc. Also, we are not sure if we try to copy data from let's say any FTP/SFTP then will all workers will write data on HDFS in parallel? Is there any limitation while using it? Is there any Audit trail maintained by Spark while this data copy?
2) Nifi in clustered mode: How good Nifi would be for this purpose? Can it be used for any datasource and for any size of file? Will be maintain the Audit trail? Would Nifi we able to handle such large files? How large cluster would be required in case we try to copy GB - PB of data and perform certain sanity on top of that data before pushing it to HDFS?
3) Gobblin in clustered mode: Would like to hear similar answers as that for Nifi?
4) If at all there is any other good option available for this purpose with lesser infra/cost involved and better performance?
Any guidance/pointers/comparisions for above mentioned tools and technologies would be appreciated.
Best Regards,
Bhupesh
After doing certain R&D and considering the fact that using NIFI or goblin will demand for more infrastructure cost. I have started testing Spark for data on-boarding.
SO far I have tried using Spark job for importing data [present at a remote staging area/node] into my HDFS and I am able to do that by mounting that remote location with all my spark cluster worker nodes. Doing this made that location local to those workers, hence spark job ran properly and data is on-boarded to my HDFS.
Since my whole project is going to be on Spark, hence keeping data on-boarding part on spark would not cost anything extra to me. So far I am going good. Hence I would suggest to others as well, if you already have spark cluster and hadoop cluster up and running then instead of adding extra cost [where cost could be a major constraint] go for spark job for data on-boarding.

Resources