Convert a pandas dataframe to a PySpark dataframe [duplicate] - python-3.x

This question already has an answer here:
Convert between spark.SQL DataFrame and pandas DataFrame [duplicate]
(1 answer)
Closed 4 years ago.
I have a script with the below setup.
I am using:
1) Spark dataframes to pull data in
2) Converting to pandas dataframes after initial aggregatioin
3) Want to convert back to Spark for writing to HDFS
The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.
Can you advise?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.builder\
.appName(appname)\
.master('yarn')\
.config("hive.metastore.uris", "thrift://uds-far-mn1.dab.02.net:9083")\
.enableHiveSupport()\
.getOrCreate()
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
sdf.show()

Here we go:
# Spark to Pandas
df_pd = df.toPandas()
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)

Related

Write results from Kafka to csv in pyspark

I have setup a Kafka broker and I manage to read the records with pyspark.
import os
from pyspark.sql import SparkSession
import pyspark
import sys
from pyspark import SparkConf, SparkContext, SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
conf = SparkConf().setMaster("my-master").setAppName("Kafka_Spark")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc,5)
kvs = KafkaUtils.createDirectStream(ssc,
['enriched_messages'],
{"metadata.broker.list":"my-kafka-broker","auto.offset.reset" : "smallest"},
keyDecoder=lambda x: x,
valueDecoder=lambda x: x)
lines = kvs.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination(10)
Example of returning data (timestamp, name, lastname, height):
2020-05-07 09:16:38, JoHN, Doe, 182.5
I want to write these records into a csv file. lines is of type KafkaTransformedDStream and classic solution with rdd is not working.
Has anyone a solution to this?
converting DStreams to single rdd is not possible, as DStreams are continuous streams. You can use the following, which results many files, and later merge them to single file.
lines.saveAsTextFiles("prefix", "suffix")

How to use a scikit pickle model in spark structured streaming? [duplicate]

I'm trying to apply a scikit model retrieved using a pickle to every row of a structured streaming dataframe.
I've tried using pandas_udf (version code 1), and it gives me this error:
AttributeError: 'numpy.ndarray' object has no attribute 'isnull'
Code:
inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *
data_schema = data_spark_ts.schema
import pandas as pd
from pyspark.sql.functions import col, pandas_udf, PandasUDFType # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType
get_prediction = pandas_udf(lambda x: gb2.predict(x), IntegerType())
streamingInputDF = (
spark
.readStream
.schema(data_schema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputPath)
.fillna(0)
.withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)
display(streamingInputDF.select("prediction"))
I've tried also using a normal udf instead of the pandas_udf, and it gives me this error:
ValueError: Expected 2D array, got 1D array instead:
[.. ... .. ..]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't know how to reshape my data.
The model I try to apply is retrieved this way:
#load the pickle
import pickle
gb2 = None
with open('pickle_modello_unico.p', 'rb') as fp:
gb2 = pickle.load(fp)
And it's specification is this one:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=300,
n_iter_no_change=None, presort='auto', random_state=None,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)
Any help to solve this?
I solved the issue returning a pd.Series from the pandas_udf.
Here is the working code:
inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *
data_schema = data_spark_ts.schema
import pandas as pd
from pyspark.sql.functions import col, pandas_udf, PandasUDFType # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType
get_prediction = pandas_udf(lambda x: pd.Series(gb2.predict(x)), StringType())
streamingInputDF = (
spark
.readStream
.schema(data_schema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputPath)
.withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)
display(streamingInputDF.select("prediction"))

How to convert a rdd of pandas DataFrame to Spark DataFrame

I create a rdd of pandas DataFrame as intermediate result. I want to convert a Spark DataFrame, eventually save it into parquet file.
I want to know what is the efficient way.
Thanks
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).\
assign(col=x)
sc.parallelize(range(5)).map(create_df).\
.TO_DATAFRAME()..write.format("parquet").save("parquet_file")
I have tried pd.concat to reduce rdd to a big dataframe, seems not right.
So talking of efficiency, since spark 2.3 Apache Arrow is integrated with Spark and it is supposed to efficiently transfer data between JVM and Python processes thus enhancing the performance of the conversion from pandas dataframe to spark dataframe. You can enable it by
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
If your spark distribution doesn't have arrow integrated, this should not throw an error, will just be ignored.
A sample code to be run at pyspark shell can be like below:
import numpy as np
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf = pd.DataFrame(np.random.rand(100, 3))
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
Your create_df method returns a panda dataframe and from that you can create spark dataframe - not sure why you need "sc.parallelize(range(5)).map(create_df)"
So your full code can be like
import pandas as pd
import numpy as np
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
pdf = create_df(10)
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
import pandas as pd
def create_df(x):
df=pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
return df.values.tolist()
sc.parallelize(range(5)).flatMap(create_df).toDF().\
.write.format("parquet").save("parquet_file")

How can I convert this row form into JSON while pushing into kafka topic

I am using a Spark application for processing textfiles that dropped at /home/user1/files/ folder in my system and which map the comma separated data that present in those text files into a particular JSON format. I have written following python code using spark for doing the same. But the output that comes in Kafka will look like as follows
Row(Name=Priyesh,Age=26,MailId=priyeshkaratha#gmail.com,Address=AddressTest,Phone=112)
Python Code :
import findspark
findspark.init('/home/user1/spark')
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.sql import Column, DataFrame, Row, SparkSession
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='server.kafka:9092')
def handler(message):
records = message.collect()
for record in records:
producer.send('spark.out', str(record))
print(record)
producer.flush()
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream('/home/user1/files/')
fields = lines.map(lambda l: l.split(","))
udr = fields.map(lambda p: Row(Name=p[0],Age=int(p[3].split('#')[0]),MailId=p[31],Address=p[29],Phone=p[46]))
udr.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
So how can I convert this row form into JSON while pushing into kafka topic?
You can convert Spark Row objects to dict's, and then serialize those to JSON. For example, you could change this line:
producer.send('spark.out', str(record))
to this:
producer.send('spark.out', json.dumps(record.asDict())))
Alternatively.. in your example code since you aren't using DataFrames you could just create it as a dict to begin with instead of a Row.

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Resources