reading json file in pyspark - apache-spark

I'm new to PySpark, Below is my JSON file format from kafka.
{
"header": {
"platform":"atm",
"version":"2.0"
}
"details":[
{
"abc":"3",
"def":"4"
},
{
"abc":"5",
"def":"6"
},
{
"abc":"7",
"def":"8"
}
]
}
how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code.
parsed = messages.map(lambda (k,v): json.loads(v))
list = []
summed = parsed.map(lambda detail:list.append((String(['mcc']), String(['mid']), String(['dsrc']))))
output = summed.collect()
print output
It produces the error 'too many values to unpack'
Error message below at statement summed.collect()
16/09/12 12:46:10 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap 16/09/12 12:46:10 INFO deprecation:
mapred.task.partition is deprecated. Instead, use
mapreduce.task.partition 16/09/12 12:46:10 INFO deprecation:
mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/09/12
12:46:10 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent
call last): File
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py",
line 111, in main
process() File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py",
line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile) File
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py",
line 263, in dump_stream
vs = list(itertools.islice(iterator, batch)) File "", line 1, in ValueError: too many values to unpack

First of all, the json is invalid. After the header a , is missing.
That being said, lets take this json:
{"header":{"platform":"atm","version":"2.0"},"details":[{"abc":"3","def":"4"},{"abc":"5","def":"6"},{"abc":"7","def":"8"}]}
This can be processed by:
>>> df = sqlContext.jsonFile('test.json')
>>> df.first()
Row(details=[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')], header=Row(platform='atm', version='2.0'))
>>> df = df.flatMap(lambda row: row['details'])
PythonRDD[38] at RDD at PythonRDD.scala:43
>>> df.collect()
[Row(abc='3', def='4'), Row(abc='5', def='6'), Row(abc='7', def='8')]
>>> df.map(lambda entry: (int(entry['abc']), int(entry['def']))).collect()
[(3, 4), (5, 6), (7, 8)]
Hope this helps!

import pyspark
from pyspark import SparkConf
# You can configure the SparkContext
conf = SparkConf()
conf.set('spark.local.dir', '/remote/data/match/spark')
conf.set('spark.sql.shuffle.partitions', '2100')
SparkContext.setSystemProperty('spark.executor.memory', '10g')
SparkContext.setSystemProperty('spark.driver.memory', '10g')
sc = SparkContext(appName='mm_exp', conf=conf)
sqlContext = pyspark.SQLContext(sc)
data = sqlContext.read.json(file.json)
I feel that he missed an important part of the read sequence. You have to initialize a SparkContext.
When you start a SparkContext, it also spins up a webUI on port 4040. The webUI can be accessed using http://localhost:4040. That is a useful place to check progress of all calculations.

try this with latest spark version.
df = spark.read.json('test.json')

According to the info in the comments, each row in messages RDD holds one line from the json file
u'{',
u' "header": {',
u' "platform":"atm",'
Your code is failing in the following line:
parsed = messages.map(lambda (k,v): json.loads(v))
Your code takes line like: '{' and try to convert it into key,value, and execute json.loads(value)
it is clear that python/spark won't be able to divide one char '{' into key-value pair.
The json.loads() command should be executed on a complete json data-object
This specific task might be accomplished easier with pure python

Related

Getting java.lang.IllegalArgumentException after running Pandas UDF on a Spark DF

I'm new to PySpark and I've trying to do inference from a saved model here. I've tried almost every way to print the returned output from the pandas UDF but it always gives me an error. More often than not, it's java.lang.IllegalArgumentException. I don't understand what I'm doing wrong here and it'd be great if someone can help me out by debugging the pandas udf code, if that if where the problem lies.
Code:
saved_xgb.load_model("baseline_xgb.json")
sc = spark.sparkContext
broadcast_model = sc.broadcast(saved_xgb)
prediction_set = spark.sql("select *, floor(rand()*100) as prediction_group from test_dataset_view")
flattened_schema = StructType(prediction_set.schema.fields + [StructField('pred_label', FloatType(), nullable=True),
StructField('score', FloatType(), nullable=True)])
#pandas_udf(flattened_schema, PandasUDFType.GROUPED_MAP)
def model_scoring(pdf):
pdf = pd.DataFrame(pdf)
pdf = pdf.replace('\\N', np.nan)
y_test = pdf['label']
X_test = pdf.drop(['label', 'prediction_group'], axis=1)
y_pred = broadcast_model.value.predict(X_test)
auc_test = roc_auc_score(y_test, broadcast_model.value.predict_proba(X_test)[:, 1])
pdf['pred_label'] = y_pred
pdf['score'] = auc_test
return pdf
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
Error Stack from stdout on Spark Submit:
XGBoost Version1.5.2
PySpark Version :2.3.2.3.1.0.0-78
PySpark Version :2.3.2.3.1.0.0-78
Traceback (most recent call last):
File "baseline_h2o.py", line 130, in <module>
prediction_set.groupby(F.col('prediction_group')).apply(model_scoring).show()
File ".../pyspark.zip/pyspark/sql/dataframe.py", line 350, in show
File ".../py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File ".../pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File ".../py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1241.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 603, executor 26): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)
I have tried printing out the Spark DF in multiple ways and displaying just the 'score' column alone, but all of them lead to errors. Even tried saving that DataFrame directly by saving and writing it, even that gave the same error. Can anyone guide me to solve this?
Edit: Solved
I had to set the environment variable inside the pandas_udf function for PyArrow to work properly on the executors.
import os
os.environ['ARROW_PRE_0_15_IPC_FORMAT']='1'

load jalali date from string in pyspark

I need to load jalali date from string and then, return it as gregorian date string. I'm using the following code:
def jalali_to_gregorian(col, format=None):
if not format:
format = "%Y/%m/d"
gre = jdatetime.datetime.strptime(col, format=format).togregorian()
return gre.strftime(format=format)
# register the function
spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType())
# load the date and show it:)
df = df.withColumn("financial_date", jalali_to_gregorian(df.PersianCreateDate))
df.select(['PersianCreateDate', 'financial_date']).show()
it throws ValueError: time data 'Column<PersianCreateDate>' does not match format '%Y/%m/%d' error at me.
the string from the column is a match and I have tested it. this is a problem from how spark is sending the column value to my function. anyway to solve it?
to test:
df=spark.createDataFrame([('1399/01/02',),('1399/01/01',)],['jalali'])
df = df.withColumn("gre", jalali_to_gregorian(df.jalali))
df.show()
should result in
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/20|
|1399/01/01|2020/03/21|
+----------+----------+
instead, I'm thrown at with:
Fail to execute line 2: df = df.withColumn("financial_date", jalali_to_gregorian(df.jalali))
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6468469233020961307.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "<stdin>", line 7, in jalali_to_gregorian
File "/usr/local/lib/python2.7/dist-packages/jdatetime/__init__.py", line 929, in strptime
(date_string, format))
ValueError: time data 'Column<jalali>' does not match format '%Y/%m/%d''%Y/%m/%d'
Your problem is that you're trying to apply function to the column, not to the values inside the column.
The code that you have used: spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType()) registers your function for use in the Spark SQL (via spark.sql(...), not in the pyspark.
To get function that you can use inside withColumn, select, etc., you need to create a wrapper function that is done with udf function and this function should be used in the withColumn:
from pyspark.sql.functions import udf
jalali_to_gregorian_udf = udf(jalali_to_gregorian, StringType())
df = df.withColumn("gre", jalali_to_gregorian_udf(df.jalali))
>>> df.show()
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/21|
|1399/01/01|2020/03/20|
+----------+----------+
See documentation for more details.
You also have the error in the time format - instead of format = "%Y/%m/d" it should be format = "%Y/%m/%d".
P.S. If you're running on Spark 3.x, then I recommend to look to the vectorized UDFs (aka, Pandas UDFs) - they are much faster than usual UDFs, and will provide better performance if you have a lot of data.

Zeppelin/Spark: org.apache.spark.SparkException: Cannot run program "/usr/bin/": error=13, no permission

I try to get a basic regression run with Zeppelin 0.7.2 and Spark 2.1.1 on Debian 9. Both zeppelin are "installed" in /usr/local/ that means /usr/local/zeppelin/ and /usr/local/spark. Zeppelin also knows the correct SPARK_HOME. First I load the data:
%spark.pyspark
from sqlalchemy import create_engine #sql query
import pandas as pd #sql query
from pyspark import SparkContext #Spark DataFrame
from pyspark.sql import SQLContext #Spark DataFrame
# database connection and sql query
pdf = pd.read_sql("select col1, col2, col3 from table", create_engine('mysql+mysqldb://user:pass#host:3306/db').connect())
print(pdf.size) # size of pandas dataFrame
# convert pandas dataFrame into spark dataFrame
sdf = SQLContext(SparkContext.getOrCreate()).createDataFrame(pdf)
sdf.printSchema()# what does the spark dataFrame look like?
Fine, it works and I get the output with 46977 row and three cols:
46977
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: date (nullable = true)
Ok, now I want to do the regression:
%spark.pyspark
# do a linear regression with sparks ml libs
# https://community.intersystems.com/post/machine-learning-spark-and-cach%C3%A9
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# choose several inputCols and transform the "Features" column(s) into the correct vector format
vectorAssembler = VectorAssembler(inputCols=["col1"], outputCol="features")
data=vectorAssembler.transform(sdf)
print(data)
# Split the data into 70% training and 30% test sets.
trainingData,testData = data.randomSplit([0.7, 0.3], 0.0)
print(trainingData)
# Configure the model.
lr = LinearRegression().setFeaturesCol("features").setLabelCol("col2").setMaxIter(10)
## Train the model using the training data.
lrm = lr.fit(trainingData)
## Run the test data through the model and display its predictions for PetalLength.
#predictions = lrm.transform(testData)
#predictions.show()
But while doing lr.fit(trainingData), I get errors in the console (and log files of zeppelin). The errors seems to be while starting spark: Cannot run program "/usr/bin/": error=13, Keine Berechtigung. I wonder what should to be started in /usr/bin/ since I only use the path /usr/local/.
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 367, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4001144784380663394.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 9, in <module>
File "/usr/local/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/local/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o70.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): **java.io.IOException: Cannot run program "/usr/bin/": error=13, Keine Berechtigung**
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
It was a configuration error in Zeppelins conf/zeppelin-env.sh. There, I had the following line uncommented that caused the error and I now commented the line and it works:
#export PYSPARK_PYTHON=/usr/bin/ # path to the python command. must be the same path on the driver(Zeppelin) and all workers.
So the problem was that the path to PYSPARK_PYTHON was not set correctly, now it uses the default python binary. I found the solution by looking for the string /usr/bin/ by doing grep -R "/usr/bin/" in the Zeppelin base directory and checked the files.

TypeError: 'JavaPackage' object is not callable

when I code the spark sql API hiveContext.sql()
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext,HiveContext
conf = SparkConf().setAppName("spark_sql")
sc = SparkContext(conf = conf)
hc = HiveContext(sc)
#rdd = sc.textFile("test.txt")
sqlContext = SQLContext(sc)
res = hc.sql("use teg_uee_app")
#for each in res.collect():
# print(each[0])
sc.stop()
I got the following error:
enFile "spark_sql.py", line 23, in <module>
res = hc.sql("use teg_uee_app")
File "/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/spark/python/pyspark/sql/context.py", line 683, in _ssql_ctx
self._scala_HiveContext = self._get_hive_ctx()
File "/spark/python/pyspark/sql/context.py", line 692, in _get_hive_ctx
return self._jvm.HiveContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable
how do I add SPARK_CLASSPATH or SparkContext.addFile?I don't have idea.
Maybe this will help you: When using HiveContext I have to add three jars to the spark-submit arguments:
spark-submit --jars /usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/lib/spark/lib/datanucleus-core-3.2.10.jar,/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar ...
Of course the paths and versions depend on your cluster setup.
In my case this turned out to be a classpath issue - I had a Hadoop jar on the classpath that was a wrong version of Hadoop than I was running.
Make sure you only set the executor and/or driver classpaths in one place and that there's no system-wide default applied somewhere such as .bashrc or Spark's conf/spark-env.sh.

How do I collect a single column in Spark?

I would like to perform an action on a single column.
Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.
Here is an example:
df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()
This produces the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
How can I use the collect() function on a single column?
Spark >= 2.0
Starting from Spark 2.0.0 you need to explicitly specify .rdd in order to use flatMap
df.select("array").rdd.flatMap(lambda x: x).collect()
Spark < 2.0
Just select and flatMap:
df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]]

Resources