How to elegantly convert multi-col rows into dataframe? - apache-spark

I want to convert RDD to DataFrame using StructType. But item "Broken,Line," would cause error. Is there an elegant way to process record like this? Thanks.
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
val mySchema = StructType(Array(
StructField("colA", StringType, true),
StructField("colB", StringType, true),
StructField("colC", StringType, true)))
val x = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val inputx = sc.parallelize(x).
| map((x:String) => Row.fromSeq(x.split(",").slice(0,mySchema.size).toSeq))
val df = spark.createDataFrame(inputx, mySchema)
df.show
Error would be like this:
Name: org.apache.spark.SparkException Message: Job aborted due to
stage failure: Task 0 in stage 14.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 14.0 (TID 14, localhost, executor
driver): java.lang.RuntimeException: Error while encoding:
java.lang.ArrayIndexOutOfBoundsException: 2
I'm using:
Spark: 2.2.0
Scala: 2.11.8
And I ran the code in spark-shell.

Row.fromSeq on which we apply your schema throws the error that you are getting. Your third element in your list contains just 2 elements. You can't transform it into a Row with three elements unless you add a null value instead of the missing value.
And when creating your DataFrame, Spark is expecting 3 elements per Row on which to apply the schema, thus the error.
A quick and dirty solution would be to use scala.util.Try to get fields separately :
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.sql.Row
import scala.util.Try
val mySchema = StructType(Array(StructField("colA", StringType, true), StructField("colB", StringType, true), StructField("colC", StringType, true)))
val l = List("97573,Start,eee", "9713,END,Good", "Broken,Line,")
val rdd = sc.parallelize(l).map {
x => {
val fields = x.split(",").slice(0, mySchema.size)
val f1 = Try(fields(0)).getOrElse("")
val f2 = Try(fields(1)).getOrElse("")
val f3 = Try(fields(2)).getOrElse("")
Row(f1, f2, f3)
}
}
val df = spark.createDataFrame(rdd, mySchema)
df.show
// +------+-----+----+
// | colA| colB|colC|
// +------+-----+----+
// | 97573|Start| eee|
// | 9713| END|Good|
// |Broken| Line| |
// +------+-----+----+
I wouldn't say that it's an elegant solution like you've asked. Parsing strings is never elegant ! You ought using the csv source to read it correctly (or spark-csv for < 2.x).

Related

Pyspark Streaming with checkpointing is failing

I am working on spark streaming data received through custom receiver using pyspark. To make it fault tolerant I enabled checkpointing . Since then the code which was running fine before checkpointing was introduced is now throwing error.
Error Message :
pubsubStream.flatMap(lambda x : x).map(lambda x: convertjson(x)).foreachRDD(lambda rdd : dstream_to_rdd(rdd))
File "/home/test/spark_checkpointing/spark_checkpoint_test.py", line 227, in dstream_to_rdd
df = spark_session.read.option("multiline","true")\
NameError: name 'sparkContext' is not defined
Code is as below :
import sys
from pyspark import SparkContext,SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pubsub import PubsubUtils
import json
import time
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType, FloatType,LongType,BooleanType)
from google.cloud import storage
import pyspark
conf_bucket_name = <bucket_name>
#Events list
events_list = ["Event1","Event2"]
# This chunk of schema creation will be automated later
# and most probable moved outside
full_schema = StructType([
StructField('_id', StructType([
StructField('_data', StringType(), True)
])),
StructField('ct', StructType([
StructField('$timestamp' , StructType([
StructField('i', LongType(), True),
StructField('t', LongType(), True),
]))])),
StructField('fg', StructType([
StructField('sgs' , StructType([
StructField('col1',StringType(), True),
StructField('col2',StringType(), True)
]))])),
StructField('col6', StringType(), True),
StructField('_corrupt_record', StringType(), True)
])
def convertjson(ele):
temp = json.loads(ele.decode('utf-8'))
if temp['col6'] == 'update':
del temp['updateDescription']
return temp
return temp
def dstream_to_rdd(x):
if not x.isEmpty():
df = spark_session.read.option("multiline","true")\
.option("mode", "PERMISSIVE")\
.option("primitivesAsString", "false")\
.schema(full_schema)\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.option("allowFieldAddition","true")\
.json(x)
df.show(truncate=True)
#df.printSchema()
def createContext(all_config):
# If you do not see this printed, that means the StreamingContext has been loaded
# from the new checkpoint
print("Creating new context")
ssc = StreamingContext(spark_session.sparkContext, 10)
pubsubStream = PubsubUtils.createStream(ssc, <SUBSCRIPTION>, 10000, True)
# Print the records of dstream
pubsubStream.pprint() /// Dstreams are getting printed on console
#Dstream is transforme using flatmap to flatten it as tuple may have multiple records
#Then converted it to json forma and finally pushed to BQ
pubsubStream.flatMap(lambda x : x).map(lambda x: convertjson(x)).foreachRDD(lambda rdd : dstream_to_rdd(rdd))
pubsubStream.checkpoint(50)
return ssc
if __name__ == "__main__":
#Declaration of spark session and streaming session
checkpointDir = <checkpointdir path on google cloud storage>
spark_session = SparkSession.builder.appName("Test_spark_checkpoint").getOrCreate()
spark_session.conf.set('temporaryGcsBucket', <temp bucket name>)
ssc = StreamingContext.getOrCreate(checkpointDir,lambda: createContext(all_config))
ssc.start()
ssc.awaitTermination()
The error message is of sparkContext not defined. On doing dir(spark_session) this I found
that it returns list of the attributes and methods which contains sparkContext . Am I suppose to pass it explicitly. What is the miss here?
Also help me understanding the position of checkpointing used in code is correct or not.
Update piece of code : Tried with sparkContext instead of sparkSession
conf = SparkConf()
conf.setAppName("Test_spark_checkpoint")
conf.set('temporaryGcsBucket', <temp bucket>)
sc = SparkContext(conf=conf)
print(dir(sc))
ssc = StreamingContext.getOrCreate(checkpointDir,lambda: createContext(all_config))
df = sc.read.option("multiline","true")\
.option("mode", "PERMISSIVE")\
.option("primitivesAsString", "false")\
.schema(full_schema)\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.option("allowFieldAddition","true")\
.json(x)
df.show(truncate=True)

When reading a CSV is there an option to start on row 2 or below?

I'm reading a bunch of CSV files into a dataframe using the sample code below.
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/corp/ABC*.gz")
I'm hoping there is a way to start on row 2 or below, because row 1 contains some basic metadata about these files, and the first row has 4 pipe characters, so Spark thinks the file has 4 columns, but it actually has over 100 columns in the actual data.
I tried playing with the inferSchema and header but I couldn't get anything to work.
If the first line in CSV doesnt match actual column count and names, you may need to define your schema by hand, and then try this combination:
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","false")
.option("header","true")
.schema(mySchema)
.option("enforceSchema","true")
.load(...
Full list of CSV options.
Note that for Spark 2.3 and above, you can use a shorthand, SQL-style notation for schema definition -- simple string "column1 type1, column2 type2, ...".
If however your header has more than one line, you will probably be forced to ignore all "errors" by using additional option .option("mode","DROPMALFORMED").
You are right! You need to define a custom schema! I ended up going with this.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())

Round Spark DataFrame in-place

I read a .csv file to Spark DataFrame. For a DoubleType column is there a way to specify at the time of the file read that this column should be rounded to 2 decimal places? I'm also supplying a custom schema to the DataFrameReader API call. Here's my schema and API calls:
val customSchema = StructType(Array(StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DoubleType, true)))
#using Spark's CSV reader with custom schema
#spark == SparkSession()
val parsedSchema = spark.read.format("csv").schema(customSchema).option("header", "true").option("nullvalue", "?").load("C:\\Scala\\SparkAnalytics\\block_1.csv")
After the file read into DataFrame I can round the decimals like:
parsedSchema.withColumn("cmp_fname_c1", round($"cmp_fname_c1", 3))
But this creates a new DataFrame, so I'd also like to know if it can be done in-place instead of creating a new DataFrame.
Thanks
You can specify, say, DecimalType(10, 2) for the DoubleType column in your customSchema when loading your CSV file. Let's say you have a CSV file with the following content:
id_1,id_2,Id_3
1,10,5.555
2,20,6.0
3,30,7.444
Sample code below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(customSchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").
show
// +----+----+----+
// |id_1|id_2|id_3|
// +----+----+----+
// | 1| 10|5.56|
// | 2| 20|6.00|
// | 3| 30|7.44|
// +----+----+----+

How to replace a string value to int in Spark Dataset?

For example, input data:
1.0
\N
Schema:
val schema = StructType(Seq(
StructField("value", DoubleType, false)
))
Read into Spark Dataset:
val df = spark.read.schema(schema)
.csv("/path to csv file ")
When I use this Dataset, I will get an exception as "\N" is invalid for double. How can I replace "\N" with 0.0 in this dataset entirely? Thanks.
If data is malformed, don't use schema with inappropriate type. Define input as StringType:
val schema = StructType(Seq(
StructField("value", StringType, false)
))
and cast data later:
val df = spark.read.schema(schema).csv("/path/to/csv/file")
.withColumn("value", $"value".cast("double"))
.na.fill(0.0)

Calculating duration by subtracting two datetime columns in string format

I have a Spark Dataframe in that consists of a series of dates:
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
import pandas as pd
rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),
('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),
('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),
('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),
('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])
schema = StructType([StructField('ID', StringType(), True),
StructField('EndDateTime', StringType(), True),
StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
What I want to do is find duration by subtracting EndDateTime and StartDateTime. I figured I'd try and do this using a function:
# Function to calculate time delta
def time_delta(y,x):
end = pd.to_datetime(y)
start = pd.to_datetime(x)
delta = (end-start)
return delta
# create new RDD and add new column 'Duration' by applying time_delta function
df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime))
However this just gives me:
>>> df2.show()
ID EndDateTime StartDateTime ANI Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null
X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null
X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null
I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.
As of Spark 1.5 you can use unix_timestamp:
from pyspark.sql import functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
- F.unix_timestamp('StartDateTime', format=timeFmt))
df = df.withColumn("Duration", timeDiff)
Note the Java style time format.
>>> df.show()
+---+--------------------+--------------------+--------+
| ID| EndDateTime| StartDateTime|Duration|
+---+--------------------+--------------------+--------+
|X01|2014-02-13T12:36:...|2014-02-13T12:31:...| 258|
|X02|2014-02-13T12:35:...|2014-02-13T12:32:...| 204|
|X03|2014-02-13T12:36:...|2014-02-13T12:32:...| 228|
|XO4|2014-02-13T12:37:...|2014-02-13T12:32:...| 269|
|XO5|2014-02-13T12:36:...|2014-02-13T12:33:...| 202|
+---+--------------------+--------------------+--------+
Thanks to David Griffin. Here's how to do this for future reference.
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf
# Build sample data
rdd = sc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876'),
('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321'),
('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229'),
('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881'),
('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323')])
schema = StructType([StructField('ID', StringType(), True),
StructField('EndDateTime', StringType(), True),
StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
# define timedelta function (obtain duration in seconds)
def time_delta(y,x):
from datetime import datetime
end = datetime.strptime(y, '%Y-%m-%dT%H:%M:%S.%f')
start = datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f')
delta = (end-start).total_seconds()
return delta
# register as a UDF
f = udf(time_delta, IntegerType())
# Apply function
df2 = df.withColumn('Duration', f(df.EndDateTime, df.StartDateTime))
Applying time_delta() will give you duration in seconds:
>>> df2.show()
ID EndDateTime StartDateTime Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... 258
X02 2014-02-13T12:35:... 2014-02-13T12:32:... 204
X03 2014-02-13T12:36:... 2014-02-13T12:32:... 228
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... 268
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... 202
datediff(Column end, Column start)
Returns the number of days from start to end.
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
This can be done in spark-sql by converting the string date to timestamp and then getting the difference.
1: Convert to timestamp:
CAST(UNIX_TIMESTAMP(MY_COL_NAME,'dd-MMM-yy') as TIMESTAMP)
2: Get the difference between dates using datediff function.
This will be combined in a nested function like:
spark.sql("select COL_1, COL_2, datediff( CAST( UNIX_TIMESTAMP( COL_1,'dd-MMM-yy') as TIMESTAMP), CAST( UNIX_TIMESTAMP( COL_2,'dd-MMM-yy') as TIMESTAMP) ) as LAG_in_days from MyTable")
Below is the result:
+---------+---------+-----------+
| COL_1| COL_2|LAG_in_days|
+---------+---------+-----------+
|24-JAN-17|16-JAN-17| 8|
|19-JAN-05|18-JAN-05| 1|
|23-MAY-06|23-MAY-06| 0|
|18-AUG-06|17-AUG-06| 1|
+---------+---------+-----------+
Reference: https://docs-snaplogic.atlassian.net/wiki/spaces/SD/pages/2458071/Date+Functions+and+Properties+Spark+SQL
Use DoubleType instead of IntegerType
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf
# Build sample data
rdd = sc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876'),
('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321'),
('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229'),
('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881'),
('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323')])
schema = StructType([StructField('ID', StringType(), True),
StructField('EndDateTime', StringType(), True),
StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
# define timedelta function (obtain duration in seconds)
def time_delta(y,x):
from datetime import datetime
end = datetime.strptime(y, '%Y-%m-%dT%H:%M:%S.%f')
start = datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f')
delta = (end-start).total_seconds()
return delta
# register as a UDF
f = udf(time_delta, DoubleType())
# Apply function
df2 = df.withColumn('Duration', f(df.EndDateTime, df.StartDateTime))
Here is a working version for spark 2.x derived from jason's answer
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession,SQLContext
from pyspark.sql.types import StringType, StructType, StructField
sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("Python Spark SQL basic example").getOrCreate()
rdd = sc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876'),
('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321'),
('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229'),
('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881'),
('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323')])
schema = StructType([StructField('ID', StringType(), True),
StructField('EndDateTime', StringType(), True),
StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
# register as a UDF
from datetime import datetime
sqlContext.registerFunction("time_delta", lambda y,x:(datetime.strptime(y, '%Y-%m-%dT%H:%M:%S.%f')-datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f')).total_seconds())
df.createOrReplaceTempView("Test_table")
spark.sql("SELECT ID,EndDateTime,StartDateTime,time_delta(EndDateTime,StartDateTime) as time_delta FROM Test_table").show()
sc.stop()

Resources