Execute PySpark code from a Java/Scala application - apache-spark

Is there a way to execute PySpark code from a Java/Scala application on an existing SparkSession?
Specifically, given a PySpark code that receives and returns pyspark dataframe, is there a way to submit it to Java/Scala SparkSession and get back the output dataframe:
String pySparkCode = "def my_func(input_df):\n" +
" from pyspark.sql.functions import *\n" +
" return input_df.selectExpr(...)\n" +
" .drop(...)\n" +
" .withColumn(...)\n"
SparkSession spark = SparkSession.builder().master("local").getOrCreate()
Dataset inputDF = spark.sql("SELECT * from my_table")
outputDf = spark.<SUBMIT_PYSPARK_METHOD>(pySparkCode, inputDF)

Related

Pyspark Job with Dataproc on GCP

I'm trying to running a pyspark job, but I keep getting job failure for this reason:
*Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at: https://console.cloud.google.com/dataproc/jobs/f8f8e95794e0457d80ea1b0c4df8d815?project=long-state-352923&region=us-central1 gcloud dataproc jobs wait 'f8f8e95794e0457d80ea1b0c4df8d815' --region 'us-central1' --project 'long-state-352923' **...***
here is also my code in running the job:
`from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('spark_hdfs_to_hdfs') \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
MASTER_NODE_INSTANCE_NAME="cluster-d687-m"
log_files_rdd = sc.textFile('hdfs://{}/data/logs_example/*'.format(MASTER_NODE_INSTANCE_NAME))
splitted_rdd = log_files_rdd.map(lambda x: x.split(" "))
selected_col_rdd = splitted_rdd.map(lambda x: (x[0], x[3], x[5], x[6]))
columns = ["ip","date","method","url"]
logs_df = selected_col_rdd.toDF(columns)
logs_df.createOrReplaceTempView('logs_df')
sql = """
SELECT
url,
count(*) as count
FROM logs_df
WHERE url LIKE '%/article%'
GROUP BY url
"""
article_count_df = spark.sql(sql)
print(" ### Get only articles and blogs records ### ")
article_count_df.show(5)`
i don't seem to understand the reasoning why its failing.
Is there a problem with code?

PySpark- Error accessing broadcast variable in udf while running in standalone cluster mode

#f.pandas_udf(returnType= DoubleType())
def square(r : pd.Series) -> pd.Series:
print('In pandas Udf square')
offset_value = offset.value
return (r * r ) + 10
if __name__ == "__main__":
spark = SparkSession.builder.appName("Spark").getOrCreate()
sc = spark.sparkContext
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
offset = sc.broadcast(10)
x = pd.Series(range(0,100))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
df = df.withColumn('sq',square(df.x)).withColumn('sqsq', square(f.col('sq')))
start_time = datetime.datetime.now()
df.show()
offset.unpersist()
offset.destroy()
spark.stop()
The above code works well if i run pyspark submit command in local mode
Submit.cmd --master local[*] test.py
Same code, if i try to run in standalone cluster mode, i.e
Submit.cmd --master spark://xx.xx.0.24:7077 test.py
I get error while accessing broadcast variable in udf
java.io.IOException: Failed to delete original file 'C:\Users\xxx\AppData\Local\Temp\spark-bf6b4553-f30f-4e4a-a7f7-ef117329985c\executor-3922c28f-ed1e-4348-baa4-4ed08e042b76\spark-b59e518c-a20a-4a11-b96b-b7657b1c79ea\broadcast6537791588721535439' after copy to 'C:\Users\xxx\AppData\Local\Temp\spark-bf6b4553-f30f-4e4a-a7f7-ef117329985c\executor-3922c28f-ed1e-4348-baa4-4ed08e042b76\blockmgr-ee27f0f0-ee8b-41ea-86d6-8f923845391e\37\broadcast_0_python'
at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2835)
at org.apache.spark.storage.DiskStore.moveFileToBlock(DiskStore.scala:133)
at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.saveToDiskStore(BlockManager.scala:424)
at org.apache.spark.storage.BlockManager$BlockStoreUpdater.$anonfun$save$1(BlockManager.scala:343)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
Without accessing broadcast variable in Udf, this code works fine.

Python function returns non-type in AWS Glue even the same function working in local machine

i am new to AWS glue. I have a created job that would modify phone number's from a column and update the data frame.
Below script working fine in my local machine where i running with pyspark,
This basically add '+00' against those phone numbers which are not starting with '0'
## Phonenubercolum
6-451-512-3627
0-512-582-3548
1-043-733-0050
def addCountry_code(phoneNo):
countryCode= '+00'+phoneNo
if phoneNo[:1] !='0':
return str(countryCode)
else:
return str(phoneNo)
phone_replace_udf=udf(lambda x: addCountry_code(x), StringType())
phoneNo_rep_DF= concatDF.withColumn("phoneNumber", phone_replace_udf(sf.col('phoneNumber')))#.drop('phoneNumber')
##output
+006-451-512-3627
0-512-582-3548
+001-043-733-0050
But when i ran the same code in the glue context, it throws following error
addCountry_code countryCode= '+00'+phoneNo **TypeError: must be str, not NoneType**
I am wondering how this function fails in glue?
Appreciate if anyone can help on this?
This should give the desired result. Use spark.udf.register to register the function
import json
import boto3
import pyspark.sql.dataframe
from pyspark.sql.types import StringType
ds = [{'phoneNumber': '6-451-512-3627'},
{'phoneNumber': '0-512-582-3548'},
{'phoneNumber': '1-043-733-0050'}]
sf = spark.createDataFrame(ds)
def addCountry_code(phoneNo):
countryCode= '+00'+phoneNo
if phoneNo[:1] !='0':
return str(countryCode)
else:
return str(phoneNo)
spark.udf.register('phone_replace_udf', lambda x: addCountry_code(x), StringType())
sf.createOrReplaceTempView('sf')
spark.sql('select phone_replace_udf(phoneNumber) from sf').collect()
You can achieve this without using udf (udfs are generally slower than in-built functions).
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lit
spark = SparkSession.builder.getOrCreate()
## Phonenubercolum
ds = [{'PhoneNumber': '6-451-512-3627'},
{'PhoneNumber': '0-512-582-3548'},
{'PhoneNumber': '1-043-733-0050'}]
df = spark.createDataFrame(ds)
df = df.withColumn('PhoneNumber', when(
~df['PhoneNumber'].startswith('0'), concat(lit('+00'), df['PhoneNumber'])) \
.otherwise(df['PhoneNumber']))
df.show()
+-----------------+
| PhoneNumber|
+-----------------+
|+006-451-512-3627|
| 0-512-582-3548|
|+001-043-733-0050|
+-----------------+

How to direct stream(kafka) a JSON file in spark and convert it into RDD?

Wrote a code that direct streams(kafka) word count when file is given(in producer)
code :
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## Constants
APP_NAME = "PythonStreamingDirectKafkaWordCount"
##OTHER FUNCTIONS/CLASSES
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
Need to convert the input json file to spark Dataframe using Dstream.
This should work:
Once you have your variable containing the TransformedDStream kvs, you can just create a map of DStreams and pass the data to a handler function like this:
data = kvs.map( lambda tuple: tuple[1] )
data.foreachRDD( lambda yourRdd: readMyRddsFromKafkaStream( yourRdd ) )
You should define the handler function that should create the dataframe using your JSON data:
def readMyRddsFromKafkaStream( readRdd ):
# Put RDD into a Dataframe
df = spark.read.json( readRdd )
df.registerTempTable( "temporary_table" )
df = spark.sql( """
SELECT
*
FROM
temporary_table
""" )
df.show()
Hope it helps my friends :)

Creating a stream from a text file in Pyspark

I'm getting the following error when I try to create a stream from a text file in Pyspark:
TypeError: unbound method textFileStream() must be called with StreamingContext instance as first argument (got str instance instead)
I don't want to use SparkContext because I get another error so to remove thet error I have to use SparkSession.
My code:
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.mllib.stat import Statistics
if __name__ == "__main__":
spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 5)
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]
ds1 = ssc.textFileStream(input_path1)
lines1 = ds1.map(lambda x1: x1[1])
windowedds1 = lines1.flatMap(lambda line1: line1.strip().split("\n")).map(lambda strelem1: float(strelem1)).window(5,10)
ds2 = ssc.textFileStream(input_path2)
lines2 = ds2.map(lambda x2: x2[1])
windowedds2 = lines2.flatMap(lambda line2: line2.strip().split("\n")).map(lambda strelem2: float(strelem2)).window(5,10)
result = Statistics.corr(windowedds1,windowedds2, method="pearson")
if result > 0.7:
print("ds1 and ds2 are correlated!!!")
spark.stop()
Thank you!
You have to first create streamingcontext object and then use it to call textFileStream.
spark =
SparkSession.builder.appName("CrossCorrelation").getOrCreate()
ssc = StreamingContext(spark.sparkContext, 1)
ds = ssc.textFileStream(input_path)

Resources