How to combine small parquet files with Spark? - apache-spark

I have a Hive table that has a lot of small parquet files and I am creating a Spark data frame out of it to do some processing using SparkSQL. Since I have a large number of splits/files my Spark job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive provides, that is, to combine these small input splits into larger ones by specifying a max split size setting. How can I achieve this with Spark? I tried using the coalesce function, but I can only specify the number of partitions with it (I can only control the number of output files with it). Instead I really want some control over the (combined) input split size that a task processes.
Edit: I am using Spark itself, not Hive on Spark.
Edit 2: Here is the current code I have:
//create a data frame from a test table
val df = sqlContext.table("schema.test_table").filter($"my_partition_column" === "12345")
//coalesce it to a fixed number of partitions. But as I said in my question
//with coalesce I cannot control the file sizes, I can only specify
//the number of partitions
df.coalesce(8).write.mode(org.apache.spark.sql.SaveMode.Overwrite)
.insertInto("schema.test_table")

I have not tried but read it in getting started guide that setting this property should work "hive.merge.sparkfiles=true"
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
In case using Spark on Hive, than Spark's abstraction doesn't provide explicit split of data. However we can control the parallelism in several ways.
You can leverage DataFrame.repartition(numPartitions: Int) to explicitly control the number of partitions.
In case you are using Hive Context than ensure hive-site.xml contains the CombinedInputFormat. That may help.
For more info, take a look at following documentation about Spark data parallelism - http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism.

Related

PySpark Parallelism

I am new to spark and am trying to implement reading data from a parquet file and then after some transformation returning it to web ui as a paginated way. Everything works no issue there.
So now I want to improve the performance of my application, after some google and stack search I found out about pyspark parallelism.
What I know is that :
pyspark parallelism works by default and It creates a parallel process based on the number of cores the system has.
Also for this to work data should be partitioned.
Please correct me if my understanding is not right.
Questions/doubt:
I am reading data from one parquet file, so my data is not partitioned and if I use the .repartition() method on my dataframe that is expensive. so how should I use PySpark Parallelism here ?
Also I could not find any simple implementation of pyspark parallelism, which could explain how to use it.
In spark cluster 1 core reads one partition so if you are on multinode spark cluster
then you need to leave some meory for existing system manager like Yarn etc.
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
you can use reparation and specify number of partitions
df.repartition(n)
where n is the number of partition. Repartition is for parlelleism, it will be ess expensive then process your single file without any partition.

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.
I know that for a RDD, while creating it we can mention the number of partitions like below.
val RDD1 = sc.textFile("path" , 6)
But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.
Only possibility i think is, after creating dataframe we can use repartition API.
df.repartition(4)
So can anyone please let me know if we can specify the number of partitions while creating a dataframe.
You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.
In general:
Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
Default number of shuffle partitions in spark dataframe(200)
Default number of partitions in rdd(10)

How to split the input file in Apache Spark

Suppose I have an input file of size 100MB. It contains large number of points (lat-long pair) in CSV format. What should I do in order to split the input file in 10 10MB files in Apache Spark or how do I customize the split.
Note: I want to process a subset of the points in each mapper.
Spark's abstraction doesn't provide explicit split of data. However you can control the parallelism in several ways.
Assuming you use YARN, HDFS file is automatically split into HDFS blocks and they're processed concurrently when Spark action is running.
Apart from HDFS parallelism, consider using partitioner with PairRDD. PairRDD is data type of RDD of key-value pairs and a partitioner manages mapping from a key to a partition. Default partitioner reads spark.default.parallelism. The partitioner helps to control the distribution of data as well as its locality in PairRDD-specific actions, e.g., reduceByKey.
Take a look at following documentation about Spark data parallelism.
http://spark.apache.org/docs/1.2.0/tuning.html
After searching through the Spark API I have found one method partition which returns the number of partitions of the JavaRDD. At the time of JavaRDD creation we have repartitioned it to desired number of partitions as told by #Nick Chammas.
JavaRDD<String> lines = ctx.textFile("/home/hduser/Spark_programs/file.txt").repartition(5);
List<Partition> partitions = lines.partitions();
System.out.println(partitions.size());

Resources