Spark and JDBC: Iterating through large table and writing to hdfs - apache-spark

What would be the most memory efficient way to copy the contents of a large relational table using spark and then write to a partitioned Hive table in parquet format (without sqoop). I have a basic spark app and i have done some other tuning with spark's jdbc but data in relational table is still 0.5 TB and 2 Billion records so I although I can lazy load the full table, I'm trying to figure out how to efficiently partition by date and save to hdfs without running into memory issues. since the jdbc load() from spark will load everything into memory I was thinking of looping through the dates in the database query but still not sure how to make sure I don't run out of memory.

If you need to use Spark you can add to your application date parameter for filtering table by date and run your Spark application in loop for each date. You can use bash or other scripting language for this loop.
This can look like:
foreach date in dates
spark-submit your application with date parameter
read DB table with spark.read.jdbc
filter by date using filter method
write result to HDFS with df.write.parquet("hdfs://path")
Another option is to use different technology for example implement Scala application using JDBC and DB cursor to iterate through rows and save result to HDFS. This is more complex, because you need to solve problems related to writing to Parquet format and saving to HDFS using Scala. If you want I can provide Scala code responsible for writing to Parquet format.

Related

Performance of spark while reading from hive vs parquet

Assuming I have an external hive table on top parquet/orc files partitioned on date, what would be the performance impact of using
spark.read.parquet("s3a://....").filter("date_col='2021-06-20'")
v/s
spark.sql("select * from table").filter("date_col='2021-06-20'")
After reading into a dataframe, It will be followed by a series of transformations and aggregations.
spark version : 2.3.0 or 3.0.2
hive version : 1.2.1000
number of records per day : 300-700 Mn
My hunch is that there won't be any performance difference while using either of the above queries since parquet natively has most of the optimizations that a hive metastore can provide and spark is capable of using it. Like, predicate push-down, advantages of columnar storage etc.
As a follow-up question, what happens if
The underlying data was csv instead of parquet. Does having a hive table on top improves performance ?
Hive table was bucketed. Does it make sense to read the underlying file system in this case instead of reading from table ?
Also, are there any situations where reading directly from parquet is a better option compared to hive ?
Hive should actually be faster here because they both have pushdowns, Hive already has the schema stored. The parquet read as you have it here will need to infer the merged schema. You can make them about the same by providing the schema.
You can make the Parquet version even faster by navigating directly to the partition. This avoids having to do the initial filter on the available partitions.
So something like this would do it:
spark.read.option("basePath", "s3a://....").parquet("s3a://..../date_col=2021-06-20")
Note this works best if you already have a schema, because this also skips schema merging.
As to your follow-ups:
It would make a huge difference if it's CSV, as it would then have to parse all of the data and then filter out those columns. CSV is really bad for large datasets.
Shouldn't really gain you all that much and may get you into trouble. The metadata that Hive stores can allow Spark to navigate your data more efficiently here than you trying to do it yourself.

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

Fast way to insert nested JSON to Hadoop (Spark Java)

I need to write to Hadoop around 150B nested Json records per day (Using Spark Java),
What is the "fast" way to do it in term of performance for example :
Create Hive table and write parquet file to HDFS
Or create Dataset from Json and using saveAsTable
Or is there another way of doing it ?

Optimized hive data aggregation using hive

I have a hive table (80 million records) with the followig schema (event_id ,country,unit_id,date) and i need to export this data to a text file as with the following requirments:
1-Rows are aggregated(combined) by event_id.
2-Aggregated rows must be sorted according to date.
For example rows with same event_id must be combined as a list of lists, ordered according to date.
What is the best performance wise solution to make this job using spark ?
Note: This is expected to be a batch job.
Performance-wise, I think the best solution is to write a spark program (scala or python) to read in the underlying files to the hive table, do your transformations, and then write the output as a file.
I've found that it's much quicker to just read the files in spark rather than querying hive through spark and pulling the result into a dataframe.

spark connector loading vs sstableloader performance

I have a spark job that right now pulls data from HDFS and transforms the data into flat files to load into the Cassandra.
The cassandra table is essentially 3 columns but the last two are map collections, so a "complex" data structure.
Right now I use the COPY command and get about 3k rows/sec load but thats extremely slow given that I need to load about 50milllion records.
I see I can convert the CSV file to sstables but I don't see an example involving map collections and/or lists.
Can I use the spark connector to cassandra to load data with map collections and lists and get better performance than just the COPY command?
Yes the Spark Cassandra Connector can be much much faster for files already in HDFS. Using spark you'll be able to distributedly grab and write into C*.
Even without Spark using a java based loader like https://github.com/brianmhess/cassandra-loader will give you a significant speed improvement.

Resources