How to split values from map_keys() into multiple columns in PySpark - apache-spark

I have this data frame that has a schema with a map like below:
root
|-- events: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
When I explode it or use map_keys() to obtain those values I get this dataframe below:
+--------------------+--------------------+
| map_data| map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk...|
|[[{event_name=walk..| 2019-02-17|
|[[{event_name=walk..| 08:00:00|
|[[{event_name=run...|[{event_name=walk...|
|[[{event_name=fly...| 2019-02-17|
|[[{event_name=run...| 09:00:00|
+--------------------+--------------------+
This is my code to get to the dataframe show above:
events = event_data\
.withColumn(
"map_data",
F.map_values(event_data.events)
)
events.printSchema()
events.select("map_data")
.withColumn(
"map_values",
F.explode(events.map_data)
).show(10)
From what I started with, I would consider this a milestone reached, however, I would like my data frame to look like this:
+--------------------+-----------+--------+
| events | date | time |
+--------------------+-----------+--------+
|[{event_name=walk...| 2019-02-17|08:00:00|
|[{event_name=walk...| 2019-02-17|09:00:00|
+--------------------+-----------+--------+
I have been researching and I have seen that people are utilizing udf's, however, I am sure there is a way to accomplish what I want purely with dataframes and sql functions.
For more insight here is how my rows look like when without .show(truncate=False)
+--------------------+--------------------+
| map_data| map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|
|[[{event_name=walk..| 2019-02-17|
|[[{event_name=walk..| 08:00:00|
Also, with the dataframe I have now, my issue here is to find out how to explode an array into multiple columns. I mention this cause I can either work with that or perform a more efficient process to create the dataframe based on the map I was given.

I have found a solution to my problem. I needed to go about this approach (Create a dataframe from a hashmap with keys as column names and values as rows in Spark) and perform these series of computation on event_data which is my initialized dataframe.
This is how my dataframe looks now
|25769803776|2019-03-19|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|08:02:00|

Related

PySpark not picking the custom schema in csv

I am struggling with a very basic pyspark example and I don't know what is going on and would really appreciate if some could help me out
Below is my pypsark code to read a csv file which contains three column
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Sample App").getOrCreate()
child_df1 = spark.read.csv("E:\\data\\person.csv",inferSchema=True,header=True,multiLine=True)
child_df1.printSchema()
Below is the output of above code
root
|-- CPRIMARYKEY: long (nullable = true)
|-- GENDER: string (nullable = true)
|-- FOREIGNKEY: long (nullable = true)
child_df1.select("CPRIMARYKEY","FOREIGNKEY","GENDER").show()
Output
+--------------------+----------------------+------+
| CPRIMARYKEY |FOREIGNKEY |GENDER|
+--------------------+----------------------+------+
| 6922132627268452352| -4967470388989657188| F|
|-1832965148339791872| 761108337125613824| F|
| 7948853342318925440| -914230724356211688| M|
The issue comes when I provide the custom schema
import pyspark.sql.types as T
child_schema = T.StructType(
[
T.StructField("CPRIMARYKEY", T.LongType()),
T.StructField("FOREIGNKEY", T.LongType())
]
)
child_df2 = spark.read.csv("E:\\data\\person.csv",schema=child_schema,multiLine=True,header=True)
child_df2.show()
+--------------------+----------------------+
| CPRIMARYKEY |FOREIGNKEY|
+--------------------+----------------------+
| 6922132627268452352| null|
|-1832965148339791872| null|
| 7948853342318925440| null|
I am not able to understand that when inferring schema spark can recognize long value however when providing schema , its putting null values for FOREIGNKEY column. I am struggling with this simple exercise for a very long time and no luck. Could someone please point me out on what I am missing. Thank you
As far as I understand, you said the CSV has 2 columns.
so the FOREGINKEY and GENDER columns are de-facto one.
so Spark tries to parse -4967470388989657188,F as Long and returns null because it's not a valid long.
Can you add the Gender column to the schema and see if it fixes FOREGINKEY?
If you don't want the gender column, instead of removing it in the schema just .drop('gender') after reading the csv.

Is there any way to handle time in pyspark?

I have a string with 6 characters which should be loaded into SQL Server as the TIME data type.
But spark doesn't have any time data type. I have tried a few ways but data type is not returning in the timestamp.
I am reading the data as a string and converting it to timestamp and then finally trying to extract time values but it is returning value as string again.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).printSchema()
root
|-- time_col: timestamp (nullable = true)
|-- tim2: string (nullable = true)
And the data looks like this but in a different data type.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).show(5)
+-------------------+------+
| time_col| tim2|
+-------------------+------+
|1970-01-01 14:44:51|144451|
|1970-01-01 14:48:37|144837|
|1970-01-01 14:46:10|144610|
|1970-01-01 11:46:39|114639|
|1970-01-01 17:44:33|174433|
+-------------------+------+
Is there any way I can get tim2 column in timestamp column or column equivalent to TIME data type of SQL Server?
I think you won't get what you are trying to do, there's no type in PySpark to handle "HH:mm:ss", see: What data type should be used for a time column
I'd suggest you to use it as string.
In my case I used to convert into timestamp in spark and before sending to SQL server just make it string.. it worked fine with me.
Maybe this will help you, but it seems to me that this changes the column in str
df.withColumn('TIME', date_format('datetime', 'HH:mm:ss'))
In scala, python will be similar:
scala> val df = Seq("144451","144837").toDF("c").select('c.cast("INT").cast("TIMESTAMP"))
df: org.apache.spark.sql.DataFrame = [c: timestamp]
scala> df.show()
+-------------------+
| c|
+-------------------+
|1970-01-02 17:07:31|
|1970-01-02 17:13:57|
+-------------------+
scala> df.printSchema()
root
|-- c: timestamp (nullable = true)

Spark DataFrame is Untyped vs DataFrame has schema?

I am beginner to Spark, while reading about Dataframe, I have found below two statements for dataframe very often-
1) DataFrame is untyped
2) DataFrame has schema (Like database table which has all information related to table attribute - name, type, not null)
aren't both statements are contradicting ? First we are saying Dataframe is un typed and at the same time we are also saying Dataframe has information about all columns i.e. schema , please help me what i am missing here ? because if dataframe has schema then it is also knowing about type of the columns so how it become un typed ?
DataFrames are dynamically typed, while Datasets and RDDs are statically typed. That means when you define a Dataset or RDD you need to explicitly specify a class that represents the content. This can be useful, because when you go to write transformations on your Dataset, the compiler can check your code for type safety. Take for example this dataset of Pet info. When I use pet.species or pet.name the compiler knows their types at compile time.
case class Pet(name: String, species: String, age: Int, weight: Double)
val data: Dataset[Pet] = Seq(
Pet("spot", "dog", 2, 50.5),
Pet("mittens", "cat", 11, 15.5),
Pet("mickey", "mouse", 1, 1.5)).toDS
println(data.map(x => x.getClass.getSimpleName).first)
// Pet
val newDataset: Dataset[String] = data.map(pet => s"I have a ${pet.species} named ${pet.name}.")
When we switch to using a DataFrame, the schema stays the same and the data is still typed (or structured), but this information is only available at runtime. This is called dynamic typing. This prevents the compiler from catching your mistakes, but it can be very useful because it allows you to write sql like statements and defining new columns on the fly, for example appending columns to an existing DataFrame, without needing to define a new class for every little operation. This flip side is that you can define bad operations that result in nulls or in some cases, runtime errors.
val df: DataFrame = data.toDF
df.printSchema()
// root
// |-- name: string (nullable = true)
// |-- species: string (nullable = true)
// |-- age: integer (nullable = false)
// |-- weight: double (nullable = false)
val newDf: DataFrame = df
.withColumn("some column", ($"age" + $"weight"))
.withColumn("bad column", ($"name" + $"age"))
newDf.show()
// +-------+-------+---+------+-----------+----------+
// | name|species|age|weight|some column|bad column|
// +-------+-------+---+------+-----------+----------+
// | spot| dog| 2| 50.5| 52.5| null|
// |mittens| cat| 11| 15.5| 26.5| null|
// | mickey| mouse| 1| 1.5| 2.5| null|
// +-------+-------+---+------+-----------+----------+
Spark checks DataFrame type align to those of that are in given schema or not, in run time and not in compile time. It is because elements in DataFrame are of Row type and Row type cannot be parameterized by a type by a compiler in compile time so the compiler cannot check its type. Because of that DataFrame is untyped and it is not type-safe.
Datasets on the other hand check whether types conform to the specification at compile time. That’s why Datasets are type safe.
for more informations https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/

Aggregating tuples within a DataFrame together [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I currently am trying to do some aggregation on the services column. I would like to group all the similar services and sum the values, and if possible flatten this into a single row.
Input:
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|
Hopeful Output:
+------------------+--------------------+------------------+--------------------+
| cid | serv1 | serv2 | serv3 |
+------------------+--------------------+------------------+--------------------+
|845124826013182686| 259867 | 70267 | 2093140 |
Below is the code I currently have
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("Service Aggregation").getOrCreate()
pathToFile = '/path/to/jsonfile'
df = spark.read.json(pathToFile)
df2 = df.select('cid',functions.explode_outer(df.nodes.services))
finaldataFrame = df2.select('cid',(functions.explode_outer(df2.col)).alias('Services'))
finaldataFrame.show()
I am quite new to pyspark and have been looking at resources and trying to create some UDF to apply to that column but the map function withi pyspark only works fro RDDs and not DataFrames and am unsure how move forward to get the desired output.
Any suggestions or help would be much appreciated.
Result of printSchema
root
|-- clusterId: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cpuCoreInSeconds: long (nullable = true)
| | |-- name: string (nullable = true)
First, extract the service and the value from the Services column by position. Note this assumes that the value is always in position 0 and the service is always in position 1 (as shown in your example).
import pyspark.sql.functions as f
df2 = df.select(
'cid',
f.col("Services").getItem(0).alias('value').cast('integer'),
f.col("Services").getItem(1).alias('service')
)
df2.show()
#+------------------+-------+-------+
#| cid| value|service|
#+------------------+-------+-------+
#|845124826013182686| 112931| serv1|
#|845124826013182686| 146936| serv1|
#|845124826013182686| 32718| serv2|
#|845124826013182686| 28839| serv2|
#|845124826013182686| 8710| serv2|
#|845124826013182686|2093140| serv3|
#+------------------+-------+-------+
Note that I casted the value to integer, but it may already be an integer depending on how your schema is defined.
Once the data is in this format, it's easy to pivot() it. Group by the cid column, pivot the service column, and aggregate by summing the value column:
df2.groupBy('cid').pivot('service').sum("value").show()
#+------------------+------+-----+-------+
#| cid| serv1|serv2| serv3|
#+------------------+------+-----+-------+
#|845124826013182686|259867|70267|2093140|
#+------------------+------+-----+-------+
Update
Based on the schema you provided, you will have to get the value and service by name, rather than by position:
df2 = df.select(
'cid',
f.col("Services").getItem("cpuCoreInSeconds").alias('value'),
f.col("Services").getItem("name").alias('service')
)
The rest is the same. Also, no need to cast to integer as cpuCoreInSeconds is already a long.

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

I have two dataframes df1 and df2. Both of them have the following schema:
|-- ts: long (nullable = true)
|-- id: integer (nullable = true)
|-- managers: array (nullable = true)
| |-- element: string (containsNull = true)
|-- projects: array (nullable = true)
| |-- element: string (containsNull = true)
df1 is created from an avro file while df2 from an equivalent parquet file. However, If I execute, df1.unionAll(df2).show(), I get the following error:
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
I ran into the same situation and it turns out to be not only the fields need to be the same but also you need to maintain the exact same ordering of the fields in both dataframe in order to make it work.
This is old and there are already some answers lying around but I just faced this problem while trying to make a union of two dataframes like in...
//Join 2 dataframes
val df = left.unionAll(right)
As others have mentioned, order matters. So just select right columns in the same order than left dataframe columns
//Join 2 dataframes, but take columns in the same order
val df = left.unionAll(right.select(left.columns.map(col):_*))
I found the following PR on github
https://github.com/apache/spark/pull/11333.
That relates to UDF (user defined function) columns, which were not correctly handled during the union, and thus would cause the union to fail. The PR fixes it, but it hasn't made it to spark 1.6.2, I haven't checked on spark 2.x yet.
If you're stuck on 1.6.x there's a stupid work around, map the DataFrame to an RDD and back to a DataFrame
// for a DF with 2 columns (Long, Array[Long])
val simple = dfWithUDFColumn
.map{ r => (r.getLong(0), r.getAs[Array[Long]](1))} // DF --> RDD[(Long, Array[Long])]
.toDF("id", "tags") // RDD --> back to DF but now without UDF column
// dfOrigin has the same structure but no UDF columns
val joined = dfOrigin.unionAll(simple).dropDuplicates(Seq("id")).cache()

Resources