How parquet columns can be skipped when reading from hdfs? - apache-spark

We all know parquet is column-oriented so we can get only columns we desired and reduce IO.
But what if the parquet file is stored in HDFS, should we download the entire file first, and then apply column filter locally?
For example, if we use spark to read a parquet column from HDFS/Hive:
spark.sql("select name from wide_table")
Still we must download the entire parquet file, is that right?
Or maybe there is a way we can filter the columns just before the network transfer?

Actually "predicate pushdown" which is a feature of Spark SQL will try to use column filters to reduce the amount of information that is processed by spark. Technically the entire hdfs block is still read into memory, but it uses smart logic to only return relevant results. This is normally called out in the physical plan. You can read this by using .explain() on your query to see if the feature is being used. (Not all versions of hdfs support this.)

Related

Extract and analyze data from JSON - Hadoop vs Spark

I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.

Spark: Avro vs Parquet performance

Now that Spark 2.4 has built-in support for Avro format, I'm considering changing the format of some of the data sets in my data lake - those that are usually queried/joined for entire rows rather than specific column aggregations - from Parquet to Avro.
However, most of the work on top of the data is done via Spark, and to my understanding, Spark's in-memory caching and computations are done on columnar-formatted data. Does Parquet offer a performance boost in this regard, while Avro would incur some sort of data "transformation" penalty? What other considerations should I be aware of in this regard?
Both formats shine under different constraints but have things like strong types with schemas and a binary encoding in common. In its basic form it boils down to this differentiation:
Avro is a row-wise format. From this it follows that you can append row-by-row to an existing file. These row-wise appends are then also immediately visible to all readers that work on these files. Avro is best when you have a process that writes into your data lake in a streaming (non-batch) fashion.
Parquet is a columnar format and its files are not appendable. This means that for new arriving records, you must always create new files. In exchange for this behaviour Parquet brings several benefits. Data is stored in a columnar fashion and compression and encoding (simple type-aware, low-cpu but highly effective compression) is applied to each column. Thus Parquet files will be much smaller than Avro files. Also Parquet writes out basic statistics that when you load data from it, you can push down parts of your selection to the I/O. Then only the necessary set of rows is loaded from disk. As Parquet is already in a columnar fashion and most in-memory structures will also be columnar, loading data from them is in general much faster.
As you already have your data and the ingestion process tuned to write Parquet files, it's probably best for you to stay with Parquet as long as data ingestion (latency) does not become a problem for you.
A typical usage is actually to have a mix of Parquet and Avro. Recent, freshly arrived data is stored as Avro files as this makes the data immediately available to the data lake. More historic data is transformed on e.g. a daily basis into Parquet files as they are smaller and most efficient to load but can only be written in batches. While working with this data, you would load both into Spark as a union of two tables. Thus you have the benefit of efficient reads with Parquet combined with the immediate availability of data with Avro. This pattern is often hidden by table formats like Uber's Hudi or Apache Iceberg (incubating) which was started by Netflix.

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

Reading orc/parquet files from hdfs in spark

I have a doubt while loading data into spark cluster(standalone mode) from hdfs say parquet or orc file whether it will just read the meta data or the entire data at the first instance. If I apply filter whether it will load the entire data and apply filter to it or it reads only the selected column and whether the entire data will be copied to spark if so where it will be present ...
thanks in advance..
Not sure exactly what you mean by filtering but generally speaking, when accessing columnar structured files like Parquet or ORC, if you select specific columns you will only read them to memory and not the other columns.
Specifically if you are asking for something like ds.filter("..."), the ability to only read to memory the data that answers the filter is called "Predicate pushdown". generally speaking it is available in Spark but depends on exactly what you are trying to do. for example AFAIA, Spark can't do predicate pushdown on complex typed columns in Parquet files.
So I would suggest if possible only selecting relevant columns and then filtering. if you use complex types or spark SQL check on Google if predicate pushdown is supported.
Also, it doesn't matter if files are on HDFS or somewhere else like S3, behaviour should be the same
If I apply filter whether it will load the entire data and apply filter to it or it reads only..
Spark doesn't load data into memory when filter transformation is done, it will not load data from the file till any action is done on it. This is because of lazy evaluation.

Need less parquet files

I am doing the following process
rdd.toDF.write.mode(SaveMode.Append).partitionBy("Some Column").parquet(output_path)
However, under each partition, there are too many parquet files and each of them, the size is very small, that will makes my following steps become very slow to load all the parquet files. Is there a better way that under each partition, make less parquet files and increase the single parquet file size?
You can repartition before save:
rdd.toDF.repartition("Some Column").write.mode(SaveMode.Append).partitionBy("Some Column")
I used to have this problem.
Actually you can't control the partition of files because it depends on the executor doing.
The way to work around it is using method coalesce to make a shuffle and you can make how many partition you want but it's not efficient way you also need to set driver memory enough to handle this operation.
df = df.coalesce(numPartitions).write.partitionBy(""yyyyy").parquet("xxxx")
I also faced this issue. The problem is if you use coalesce each partition gets same number of parquet files. Now different partitions have different size so ideally I need different coalesce for each partition.
It's going to be really quite expensive if you open a lot of small files. Let's say you open 1k files and each filesize are far from the value of your parquet.block.size.
Here are my suggestions:
Create a job that will first merge your input parquet files to have smaller number of files where their sizes are near or equal to parquet.block.size. The default block size for 128Mb, though it's configurable by updating parquet.block.size. Spark would love if your parquet file is near before or equal the value of your parquet.block.size. The block size is the size of a row group being buffered in memory.
Or update your spark job to just read limited number of files
Or if you have a big machine and/or resources, just do the right tuning.
Hive query has a way to merge small files into larger one. This is not available in spark sql. Also, reducing spark.sql.shuffle.partitions wont help with Dataframe API.
I tried below solution and it generated lesser number of parquet files(from 800 parquet files to 29).
Suppose the data is loaded to a dataframe df
Create a temporary table in hive.
df.createOrReplaceTempView("tempTable")
spark.sql("CREATE TABLE test_temp LIKE test")
spark.sql("INSERT INTO TABLE test_temp SELECT * FROM tempTable")
The test_temp will contain small parquet files.
Populate final hive table from temporary table
spark.sql("INSERT INTO test SELECT * FROM test_temp")
The final table will contain lesser files. Drop temporary table after populating final table.

Resources