Hadoop Data Ingestion - apache-spark

I have a below requirement:
There's an upstream system which makes a key-entry in a database table. That entry indicates that a set of data is available in a database-table (oracle). We have to ingest this data and save it as a parquet file. No processing of data is required. This ingestion process should start everytime a new key-entry is available.
For this problem statement, we have planned to have a database poller which polls for the key-entry. After reading that entry, we need to ingest the data from an Oracle table. For this ingestion purpose, which tool is best? Is it Kafka, Sqoop, Spark-SQL etc.,? Please help.
Also we need to ingest csv files too. Only when a file is completely written, then only we have to start ingesting it. Please let me know how to perform this as well.

For Ingesting relational Data you can use sqoop, and for you scenario you can have look at https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
write sqoop incremental job and schedule it using cron , each time sqoop job will execute you will have updated data in hdfs.
For .csv files you can use flume. refer,
https://www.rittmanmead.com/blog/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/

With Sqoop you can import data from database in your Hadoop File System.

Related

How to do an "overwrite" output mode using spark structured streaming without deleting all the data and the checkpoint

I have this delta lake in ADLS to sink data through spark structured streaming. We usually append new data from our data source to our delta lake, but there are some cases when we find errors in the data that we need to reprocess everything. So what we do is delete all the data and the checkpoints and re-run the pipeline, having the correct data inside our ADLS.
But the problem with this approach is that the end-users stay one whole day without the data to analyze (because we need to delete it to re-run). So, to fix that, I would like to know if there's a way to do an "overwrite" output using the structured streaming so we can overwrite the data into a new delta version, and the end-user could still query the data using the current version.
I don't know if it's possible using streaming, but I would like to know if anyone had a similar problem and how you went to solve it :)
Thanks!

Azure Databricks: Switching from batch to streaming mode

Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Resources