How to do an incremental load/upsert in spark-redshift - apache-spark

I have an ETL pipeline where data coming from redshift, reading the data in (py)spark dataframes, performing calculations and dumping back the result to some target in redshift. So the flow is => Redshift source schema--> Spark 3.0 --> Redshift target schema. This is done in EMR using spark-redshift library provided by databricks. But my data has million of records and doing a full load everytime is not a good option.
How can I perform incremental load/upserts in spark-redshift library, the option I wanted to go with is delta lake(open source and guarantees ACID) but we cannot simply read and write delta files to Redshift Spectrum using delta lake integration.
Please guide me how can i achieve this, also, if there are any alternatives.

Related

How to use Delta Table as sink for statefull spark structured streaming

recently I'm working with spark stateful streaming (mapGroupsWithState and flatMapGroupsWithState). As I'm working on Databricks I'm trying to write results from the stateful streaming do Delta Table, but it is not possible in any of output modes (complete, append, update). Ideally, I would like to store all states in memory and saves periodical snapshots in Delta table. Do you have any idea how to achieve that?

Need to load data from Hadoop to Druid after applying transformations. If I use Spark, can we load data from Spark RDD or dataframe to Druid directly?

I have data present in hive tables. I want to apply bunch of transformations before loading that data into druid. So there are ways but I'm not sure about those.
1. Save that table after applying transformation and then Bulk load through hadoop ingestion method. But i want to avoid extra write on the server.
2. Using tranquility. But it is for Spark Streaming and only for Scala and Java, not for Python. Am I right on this?
Is there any other way I can achieve this?
You can achieve it by using druid kafka integration.
I think you should read data from tables in spark apply transformation and then write back it to kafka stream.
Once you setup druid kafka integration it will read data from kafka and will push to druid datasource.
Here is documentation about druid kafka integration https://druid.apache.org/docs/latest/tutorials/tutorial-kafka.html
(Disclaimer: I am a contributor for rovio-ingest)
With rovio-ingest you can batch ingest a Hive table to Druid with Spark. This avoids the extra write.

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar.
We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.
I have a particular use case to write >10gb data per day and data is columnar
that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.
If your data is columnar consider parquet or orc.
Regarding option2:
if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.
Conclusion :
I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.
It depends on your final use cases. Please consider below two scenarios while taking decision:
If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.
If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches.
If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.
There are so many tools & approaches are available but go with that which fit in your all cases. :)

Parquet with Athena VS Redshift

I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift
2 Scenarios:
First,
EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ
Second,
EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT
Issues with this scenario:
Spark JDBC with Redshift is slow
Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago
I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?
Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)
P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.
Here are some ideas / recommendations
Don't use JDBC.
Spark-Redshift works fine but is a complex solution.
You don't have to use spark to convert to parquet, there is also the option of using hive.
see
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
Athena is great when used against parquet, so you don't need to use
Redshift at all
If you want to use Redshift, then use Redshift spectrum to set up a
view against your parquet tables, then if necessary a CTAS within
Redshift to bring the data in if you need to.
AWS Glue Crawler can be a great way to create the metadata needed to
map the parquet in to Athena and Redshift Spectrum.
My proposed architecture:
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena
and/or
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum
You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).
Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.
On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.
There are few details missing in the question. How would you manage incremental upsert in data pipeline.
If you have implemented Slowly Changing Dimension (SCD type 1 or 2) The same can't be managed using parquet files. But This can be easily manageable in Redshift.

Importing blob data from RDBMS (Sybase) to Cassandra

I am trying to import large blob data ( around 10 TB ) from an RDBMS (Sybase ASE) into Cassandra, using DataStax Enterprise(DSE) 5.0 .
Is sqoop still the recommended way to do this in DSE 5.0? As per the release notes(http://docs.datastax.com/en/latest-dse/datastax_enterprise/RNdse.html) :
Hadoop and Sqoop are deprecated. Use Spark instead. (DSP-7848)
So should I use Spark SQL with JDBC data source to load data from Sybase, and then save the data frame to a Cassandra table?
Is there a better way to do this? Any help/suggestions will be appreciated.
Edit: As per DSE documentation (http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkIntro.html), writing to blob columns from spark is not supported.
The following Spark features and APIs are not supported:
Writing to blob columns from Spark
Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serialising.
Spark for the ETL of large data sets is preferred because it performs a distributed injest. Oracle data can be loaded into Spark RDDs or data frames and then just use saveToCassandra(keyspace, tablename). Cassandra Summit 2016 had a presentation Using Spark to Load Oracle Data into Cassandra by Jim Hatcher which discusses this topic in depth and provides examples.
Sqoop is deprecated but should still work in DSE 5.0. If its a one-time load and you're already confortable with Squoop, try that.

Resources