I need to cache a dataframe in Pyspark(2.4.4), and the memory caching is slow.
I benchmark the Pandas caching with Spark caching, by reading the same file(CSV). Specifically, Pandas was 3-4 times faster.
Thanks,
In advance
You are comparing apples and oranges. Pandas is a single machine single core data analysis library whereas pyspark is distributed (cluster computing) data analysis engine. That means you will never outperform pandas reading a small file on a single machine with pyspark due to the overhead (distributed architecture, JVM...). That also means that pyspark will outperform pandas as soon as your file exceeds a certain size.
You as a developer has to choose the solution which best fits your requirements. When pandas is faster for your project and you don't expect a huge increase of data in the future, use pandas. Otherwise use pyspark or dask or...
Related
Recently i came across a requirement where i tried to change python re with pyspark regexp_extract, the reason behind change re to pyspark regexp_extract is spark is more faster. by comparing the process speed with pyspark and re process I concluded like re is faster than pyspark regexp_extract. Is there any specific reason that cause pyspark regexp_extract is slow.
Thanks in advance
Probably more context is needed to give an specific answer, but what I can infer from what you said is the following:
I would think that it depends on the size of the data and how are the partitions in spark. As spark is parallelizing, probably in not huge amounts of data, regular python functions will work faster, but not in huge amounts of data were parallelization is more handy.
As I am new in Big Data Platform, I would like like to do some feature engineering work with my data. The Database size is about 30-50 Gb. Is is possible to load the full data (30-50Gb) in a data frame like pandas data frame?
The Database used here is Oracle. I tried to load it but I am getting out of memory error. Furthermore I like to work in Python.
pandas is not good if you have GBS of data it would be better to use distributed architecture to improve speed and efficiency. There is a library called DASK that can load large data and use distributed architecture.
In terms of memory RAM efficiency , who much better?
What dask do to reduce/compress large data to runs on small RAM?
When running on a single machine with datasets smaller than RAM, pandas/numpy should help you run fine. Dask is a distributed task distribution package, which basically means you can lazily read datasets on single computers. For example, a folder of .csvs, that together are too big (60 GB) to load into memory., can be loaded with dask so you only use the data when you need it, by calling dask.dataframe.compute().
Basically, start with using pandas - if your code starts throwing MemoryErrors, you can use dask instead.
Source:
http://dask.pydata.org/en/latest/why.html
I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.
Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.
Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?
the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:
has to be accessible on each worker, for example using distributed file system
file format has to support splitting (the simplest examples is plain old csv)
In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.
Regarding HDF5 files you have two options:
read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy
I am newbie to Apache Spark.
My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.
For example,
CSV1
name,age,deparment_id
CSV2
department_id,deparment_name,location
I want to get a third CSV file with
name,age,deparment_name
I am loading both the CSV into dataframes.
And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe
I am also able to do the same using several RDD.map()
And I am also able to do the same using executing hiveql using HiveContext
I want to know which is the efficient way if my CSV files are huge and why?
This blog contains the benchmarks. Dataframes is much more efficient than RDD
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
Here is the snippet from blog
At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic.
Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.
Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png
Both DataFrames and spark sql queries are optimized using the catalyst engine, so I would guess they will produce similar performance
(assuming you are using version >= 1.3)
And both should be better than simple RDD operations, because for RDDs, spark don't have any knowledge about the types of your data, so it can't do any special optimizations
Overall direction for Spark is to go with dataframes, so that query is optimized through catalyst