Data Processing in Parallel using Apache Spark with Pyspark - apache-spark

I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop

Related

Spark hbase bulk load generates more than 15X data

I have Spark dataframe with just 2 columns like { Key| Value}. And this dataframe has 10 million records. I am inserting this into HBase table (has 10 pre-split regions) using bulk load approach from Spark. This works fine and loads the data successfully. When I checked the size table it was like 151GB (453 gb with 3x hadoop replication). I ran major compaction on that table, and table size got reduced to 35GB (105gb with 3x replication).
I am trying to run the same code and same data in a different cluster. But here I have quota limitation of 2TB to my namespace. My process fails while loading HFiles to HBase saying its quota limit exceeded.
I would like to know whether Spark creates much more data files than the required 151GB during the bulk load? If so how to avoid that? or is there better approach to load the same?
The question is that if actual data is around 151gb (before major_compact), then why 2TB size is not enough?

How do I process non-real time data in batches in Spark?

I am new Big data and Spark. I have to work on real-time data and old data from the past 2 years. There are around a million rows for each day. I am using PySpark and Databricks. Data is partitioned on created date. I have to perform some transformations and load it to a database.
For real-time data, I will be using spark streaming (readStream to read, perform transformation and then writeStream).
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches? If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
It is kind of open ended but let me try to address your concerns.
How do I work with the data from the past 2 years? I tried filtering data from 30 days I got good throughput. Should I be running the process on all 2 years of data at once or should I doing it in batches?
Since you are new to Spark, do it in batches and start by running 1 day at a time, then 1 week and so one. Get your program to run successfully and optimize. As you increase the batch size you can increase your cluster size using Pyspark Dataframes (not Pandas). If your job is verified and efficient, you can run monthly, bi-monthly or larger batches (smaller jobs are better in your case).
If I perform this processes in batches, does Spark provide a way to batch it or do I do it in Python. Also, do I run these batches in parallel or in sequence?
You can use the date range as parameters to your Databricks job and use data bricks to schedule your jobs to ran back to back. Sure you can run them in parallel on different clusters but the whole idea with Spark is to use Sparks distributed capability and run your job on as many worker nodes as your job requires. Again, get one small job to work and validate your results, then validate a larger set and so on. If you feel confident, start a large cluster (many and fat workers) and run a large date range.
It is not an easy task for a newbie but should be a lot of fun. Best wishes.

how spark reads data when we are using a filter in where

I'm reading a key from a table which is huge in size (900 GB).
its just one where condition but spark has launched many jobs with huge no of tasks.
i'm using 11 node cluster (128 GB memory and 16 cores per node)
i know that we may need more number of tasks, but why those many jobs, why cant it process in a single stage...?
Can someone please explain what happens internally when we use a where condition..
Appreciate your response.please check this image
Spark is for bulk processing, not a single key lookup as your image shows as in, say, an ORACLE database, with an index. For a JOIN for many rows these lookups are finer, of course.
Spark does not know what you are doing (semantically), so it follows its distributed model and processes in parallel - meaning many tasks - for many partitions.
The image is not a proper use case for Spark.

Spark Performance Issue vs Hive

I am working on a pipeline that will run daily. It includes joining 2 tables say x & y ( approx. 18 MB and 1.5 GB sizes respectively) and loading the output of the join to final table.
Following are the facts about the environment,
For table x:
Data size: 18 MB
Number of files in a partition : ~191
file type: parquet
For table y:
Data size: 1.5 GB
Number of files in a partition : ~3200
file type: parquet
Now the problem is:
Hive and Spark are giving same performance (time taken is same)
I tried different combination of resources for spark job.
e.g.:
executors:50 memory:20GB cores:5
executors:70 memory:20GB cores:5
executors:1 memory:20GB cores:5
All three combinations are giving same performance. I am not sure what I am missing here.
I also tried broadcasting the small table 'x' so as to avoid shuffle while joining but not much improvement in performance.
One key observations is:
70% of the execution time is consumed for reading the big table 'y' and I guess this is due to more number of files per partition.
I am not sure how hive is giving the same performance.
Kindly suggest.
I assume you are comparing Hive on MR vs Spark. Please let me know if it is not the case.Because Hive(on tez or spark) vs Spark Sql will not differ
vastly in terms of performance.
I think the main issue is that there are too many small files.
A lot of CPU and time is consumed in the I/O itself, hence you can't experience the processing power of Spark.
My advice is to coalesce the spark dataframes immedietely after reading the parquet files. Please coalesce the 'x' dataframe into single partition and 'y'
dataframe into 6-7 partitions.
After doing the above, please perform the join(broadcastHashJoin).

Spark based processing of data stored on SSD

We are currently using Spark 2.1 based application which analyses and process huge number of records to generate some stats which is used for report generation. Now our we are using 150 executors, 2 core per executor and 10 GB per executor for our spark jobs and size of data is ~3TB stored in parquet format. For processing 12 months of data it is taking ~15 mins of time.
Now to improve performance we want to try full SSD based node to store data in HDFS. Well the question is, are there any special configuration/optimisation to be done for SSD? Are there any study done for Spark processing performance on SSD based HDFS vs HDD based HDFS?
http://spark.apache.org/docs/latest/hardware-provisioning.html#local-disks
SPARK_LOCAL_DIRS is config that you need to change.
https://www.slideshare.net/databricks/optimizing-apache-spark-throughput-using-intel-optane-and-intel-memory-drive-technology-with-ravikanth-durgavajhala
Use case is K means algo but will help.

Resources