How to convert a rdd of pandas DataFrame to Spark DataFrame - apache-spark

I create a rdd of pandas DataFrame as intermediate result. I want to convert a Spark DataFrame, eventually save it into parquet file.
I want to know what is the efficient way.
Thanks
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).\
assign(col=x)
sc.parallelize(range(5)).map(create_df).\
.TO_DATAFRAME()..write.format("parquet").save("parquet_file")
I have tried pd.concat to reduce rdd to a big dataframe, seems not right.

So talking of efficiency, since spark 2.3 Apache Arrow is integrated with Spark and it is supposed to efficiently transfer data between JVM and Python processes thus enhancing the performance of the conversion from pandas dataframe to spark dataframe. You can enable it by
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
If your spark distribution doesn't have arrow integrated, this should not throw an error, will just be ignored.
A sample code to be run at pyspark shell can be like below:
import numpy as np
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf = pd.DataFrame(np.random.rand(100, 3))
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')

Your create_df method returns a panda dataframe and from that you can create spark dataframe - not sure why you need "sc.parallelize(range(5)).map(create_df)"
So your full code can be like
import pandas as pd
import numpy as np
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
pdf = create_df(10)
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')

import pandas as pd
def create_df(x):
df=pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
return df.values.tolist()
sc.parallelize(range(5)).flatMap(create_df).toDF().\
.write.format("parquet").save("parquet_file")

Related

Pandas importing DataFrame Vs whole library - Speed and memory concern

I have a helper function that use pandas DataFrame multiple times, but I don't need any other pandas functions.
Is it better to import the whole library and call DataFrame to make the code more consistent or just import DataFrame?
My function will be called over 100,000 times and return a dictionary.
import pandas as pd
temp_df = pd.DataFrame()
VS.
from pandas import DataFrame
temp_df = DataFrame()

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.
You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

I am converting large CSV files into Parquet files for further analysis. I read in the CSV data into Pandas and specify the column dtypes as follows
_dtype = {"column_1": "float64",
"column_2": "category",
"column_3": "int64",
"column_4": "int64"}
df = pd.read_csv("data.csv", dtype=_dtype)
I then do some more data cleaning and write the data out into Parquet for downstream use.
_parquet_kwargs = {"engine": "pyarrow",
"compression": "snappy",
"index": False}
df.to_parquet("data.parquet", **_parquet_kwargs)
But when I read the data into Pandas for further analysis using from_parquet I can not seem to recover the category dtypes. The following
df = pd.read_parquet("data.parquet")
results in a DataFrame with object dtypes in place of the desired category.
The following seems to work as expected
import pyarrow.parquet as pq
_table = (pq.ParquetFile("data.parquet")
.read(use_pandas_metadata=True))
df = _table.to_pandas(strings_to_categorical=True)
however I would like to know how this can be done using pd.read_parquet.
This is fixed in Arrow 0.15, now the next code keeps the columns as categories (and the performance is significantly faster):
import pandas
df = pandas.DataFrame({'foo': list('aabbcc'),
'bar': list('xxxyyy')}).astype('category')
df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes
We are having a similar problem.
When working with a multi file parquet are work around is as follows:
Using the Table.to_pandas() documentation the following code may be relevant:
import pyarrow.parquet as pq
dft = pq.read_table('path/to/data_parquet/', use_pandas_metadata=True)
df = dft.to_pandas(categories=['column_2'] )
the use_panadas_metadata works for the dtype datetime64[ns]

Convert a pandas dataframe to a PySpark dataframe [duplicate]

This question already has an answer here:
Convert between spark.SQL DataFrame and pandas DataFrame [duplicate]
(1 answer)
Closed 4 years ago.
I have a script with the below setup.
I am using:
1) Spark dataframes to pull data in
2) Converting to pandas dataframes after initial aggregatioin
3) Want to convert back to Spark for writing to HDFS
The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.
Can you advise?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()
Here we go:
# Spark to Pandas
df_pd = df.toPandas()
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Resources