Big Data Load in Pandas Data Frame - python-3.x

As I am new in Big Data Platform, I would like like to do some feature engineering work with my data. The Database size is about 30-50 Gb. Is is possible to load the full data (30-50Gb) in a data frame like pandas data frame?
The Database used here is Oracle. I tried to load it but I am getting out of memory error. Furthermore I like to work in Python.

pandas is not good if you have GBS of data it would be better to use distributed architecture to improve speed and efficiency. There is a library called DASK that can load large data and use distributed architecture.

Related

How can I accelerate the caching in Spark (Pyspark)?

I need to cache a dataframe in Pyspark(2.4.4), and the memory caching is slow.
I benchmark the Pandas caching with Spark caching, by reading the same file(CSV). Specifically, Pandas was 3-4 times faster.
Thanks,
In advance
You are comparing apples and oranges. Pandas is a single machine single core data analysis library whereas pyspark is distributed (cluster computing) data analysis engine. That means you will never outperform pandas reading a small file on a single machine with pyspark due to the overhead (distributed architecture, JVM...). That also means that pyspark will outperform pandas as soon as your file exceeds a certain size.
You as a developer has to choose the solution which best fits your requirements. When pandas is faster for your project and you don't expect a huge increase of data in the future, use pandas. Otherwise use pyspark or dask or...

How to tell if spark session will be able to hold data size in dataframe?

Intend to read data from an Oracle DB with pyspark (running in local mode) and store locally as parquet. Is there a way to tell whether a spark session dataframe will be able to hold the amount of data from the query (which will be the whole table, ie. select * from mytable)? Are there common solutions for if the data would not be able to fit in a dataframe?
* Saw a similar question here, but was a little confused by the discussion in the comments
As you are running on local, So I assume it is not on a cluster. You can not say exactly how much memory would require? However, you can go close to it. You check your respective table size that how much disk space it's using. Suppose you mytable has occupied 1GB of Hard disk then spark would be required RAM more than that, because Spark's engine required some memory for its own processing. Try to have 2GB extra, for safer side than actual table size.
To check you table size in Oracle, You can use below query:
select segment_name,segment_type,bytes/1024/1024 MB
from dba_segments
where segment_type='TABLE' and segment_name='<yourtablename>';
It will give you a result in MB.
To configure JVM related parameter in Apache-Spark you can check this.
It doesn't matter how big the table is if you are running spark in a distributed manner. You would need to worry about the memory if:-
You are reading the data in the driver and then doing a broadcast.
Caching the dataframe for some computation.
Usually for your spark application a DAG gets generated and if you are using JDBC source then the workers will read the data directly and use the shuffle space and off-heap to disk for memory intensive computation.

Where does Spark saves retrieved data on Azure Databricks?

I would like to understand the difference between the RAM and storage in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.
Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.
if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.
By Default Cache stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux operation system.
specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform
println(platform.platform())

Can anyone tell me from which dask.dataframe wins in memory efficiency than pandas?

In terms of memory RAM efficiency , who much better?
What dask do to reduce/compress large data to runs on small RAM?
When running on a single machine with datasets smaller than RAM, pandas/numpy should help you run fine. Dask is a distributed task distribution package, which basically means you can lazily read datasets on single computers. For example, a folder of .csvs, that together are too big (60 GB) to load into memory., can be loaded with dask so you only use the data when you need it, by calling dask.dataframe.compute().
Basically, start with using pandas - if your code starts throwing MemoryErrors, you can use dask instead.
Source:
http://dask.pydata.org/en/latest/why.html

Read from Python to Spark in a distributed way [duplicate]

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.
Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.
Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?
the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:
has to be accessible on each worker, for example using distributed file system
file format has to support splitting (the simplest examples is plain old csv)
In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.
Regarding HDF5 files you have two options:
read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy

Resources