Read from Python to Spark in a distributed way [duplicate] - apache-spark

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.
Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.
Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?

the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:
has to be accessible on each worker, for example using distributed file system
file format has to support splitting (the simplest examples is plain old csv)
In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.
Regarding HDF5 files you have two options:
read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy

Related

How can I accelerate the caching in Spark (Pyspark)?

I need to cache a dataframe in Pyspark(2.4.4), and the memory caching is slow.
I benchmark the Pandas caching with Spark caching, by reading the same file(CSV). Specifically, Pandas was 3-4 times faster.
Thanks,
In advance
You are comparing apples and oranges. Pandas is a single machine single core data analysis library whereas pyspark is distributed (cluster computing) data analysis engine. That means you will never outperform pandas reading a small file on a single machine with pyspark due to the overhead (distributed architecture, JVM...). That also means that pyspark will outperform pandas as soon as your file exceeds a certain size.
You as a developer has to choose the solution which best fits your requirements. When pandas is faster for your project and you don't expect a huge increase of data in the future, use pandas. Otherwise use pyspark or dask or...

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

How to ensure that loading of Spark DataFrame from Parquet is distributed and parallelized?

When Spark loads source data from a file into a DataFrame, what factors govern whether the data are loaded fully into memory on a single node (most likely the driver/master node) or in the minimal, parallel subsets needed for computation (presumably on the worker/executor nodes)?
In particular, if using Parquet as the input format and loading via the Spark DataFrame API, what considerations are necessary in order to ensure that loading from the Parquet file is parallelized and deferred to the executors, and limited in scope to the columns needed by the computation on the executor node in question?
(I am looking to understand the mechanism Spark uses to schedule loading of source data in the distributed execution plan, in order to avoid exhausting memory on any one node by loading the full data set.)
As long as you use spark operations, all data transformations and aggregations are perfored only on executors. Therefore there is no need for driver to load the data, its job is to manage processing flow. Driver gets the data only in case you use some terminal operations, like collect(), first(), show(), toPandas(), toLocalIterator() and similar. Additionally, executors does not load all files content into memory, but gets the smallest posible chunks (which are called partitions).
If you use column store format such as Parquet only columns required for the execution plan are loaded - this is default behaviour in spark.
Edit: I just saw that there might be a bug in spark and if you use nested columns inside your schema then unnecessary columns may be loaded, see: Why does Apache Spark read unnecessary Parquet columns within nested structures?

library to process .rrd(round robin data) using spark

I have huge time series data which is in .rrd(round robin database) format stored in S3. I am planning to use apache spark for running analysis on this to get different performance matrix.
Currently I am downloading the .rrd file from s3 and processing it using rrd4j library. I am going to do processing for longer terms like year or more. it involves processing of hundreds of thousands of .rrd files. I want spark nodes to get the file directly from s3 and run the analysis.
how can I make spark to use the rrd4j to read the .rrd files? is there any library which helps me do that?
is there any support in spark for processing this kind of data?
The spark part is rather easy, use either wholeTextFiles or binaryFiles on sparkContext (see docs). According to the documentation, rrd4j usually wants a path to construct an rrd, but with the RrdByteArrayBackend, you could load the data in there - but that might be a problem, because most of the API is protected. You'll have to figure out a way to load an Array[Byte] into rrd4j.

Which is efficient, Dataframe or RDD or hiveql?

I am newbie to Apache Spark.
My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.
For example,
CSV1
name,age,deparment_id
CSV2
department_id,deparment_name,location
I want to get a third CSV file with
name,age,deparment_name
I am loading both the CSV into dataframes.
And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe
I am also able to do the same using several RDD.map()
And I am also able to do the same using executing hiveql using HiveContext
I want to know which is the efficient way if my CSV files are huge and why?
This blog contains the benchmarks. Dataframes is much more efficient than RDD
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Here is the snippet from blog
At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic.
Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.
Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png
Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance
(assuming you are using version >= 1.3)
And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations
Overall direction for Spark is to go with dataframes, so that query is optimized through catalyst

Resources