Aggregating tuples within a DataFrame together [duplicate] - apache-spark

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I currently am trying to do some aggregation on the services column. I would like to group all the similar services and sum the values, and if possible flatten this into a single row.
Input:
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|
Hopeful Output:
+------------------+--------------------+------------------+--------------------+
| cid | serv1 | serv2 | serv3 |
+------------------+--------------------+------------------+--------------------+
|845124826013182686| 259867 | 70267 | 2093140 |
Below is the code I currently have
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("Service Aggregation").getOrCreate()
pathToFile = '/path/to/jsonfile'
df = spark.read.json(pathToFile)
df2 = df.select('cid',functions.explode_outer(df.nodes.services))
finaldataFrame = df2.select('cid',(functions.explode_outer(df2.col)).alias('Services'))
finaldataFrame.show()
I am quite new to pyspark and have been looking at resources and trying to create some UDF to apply to that column but the map function withi pyspark only works fro RDDs and not DataFrames and am unsure how move forward to get the desired output.
Any suggestions or help would be much appreciated.
Result of printSchema
root
|-- clusterId: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cpuCoreInSeconds: long (nullable = true)
| | |-- name: string (nullable = true)

First, extract the service and the value from the Services column by position. Note this assumes that the value is always in position 0 and the service is always in position 1 (as shown in your example).
import pyspark.sql.functions as f
df2 = df.select(
'cid',
f.col("Services").getItem(0).alias('value').cast('integer'),
f.col("Services").getItem(1).alias('service')
)
df2.show()
#+------------------+-------+-------+
#| cid| value|service|
#+------------------+-------+-------+
#|845124826013182686| 112931| serv1|
#|845124826013182686| 146936| serv1|
#|845124826013182686| 32718| serv2|
#|845124826013182686| 28839| serv2|
#|845124826013182686| 8710| serv2|
#|845124826013182686|2093140| serv3|
#+------------------+-------+-------+
Note that I casted the value to integer, but it may already be an integer depending on how your schema is defined.
Once the data is in this format, it's easy to pivot() it. Group by the cid column, pivot the service column, and aggregate by summing the value column:
df2.groupBy('cid').pivot('service').sum("value").show()
#+------------------+------+-----+-------+
#| cid| serv1|serv2| serv3|
#+------------------+------+-----+-------+
#|845124826013182686|259867|70267|2093140|
#+------------------+------+-----+-------+
Update
Based on the schema you provided, you will have to get the value and service by name, rather than by position:
df2 = df.select(
'cid',
f.col("Services").getItem("cpuCoreInSeconds").alias('value'),
f.col("Services").getItem("name").alias('service')
)
The rest is the same. Also, no need to cast to integer as cpuCoreInSeconds is already a long.

Related

Endless execution with spark udf

I'm want to get the country with lat and long, so i used geopy and create a sample dataframe
data = [{"latitude": -23.558111, "longitude": -46.64439},
{"latitude": 41.877445, "longitude": -87.723846},
{"latitude": 29.986801, "longitude": -90.166314}
]
then create a udf
#F.udf("string")
def city_state_country(lat,lng):
geolocator = Nominatim(user_agent="geoap")
coord = f"{lat},{lng}"
location = geolocator.reverse(coord, exactly_one=True)
address = location.raw['address']
country = address.get('country', '')
return country
and it works this is the result
df2 = df.withColumn("contr",city_state_country("latitude","longitude"))
+----------+----------+-------------+
| latitude| longitude| contr|
+----------+----------+-------------+
|-23.558111| -46.64439| Brasil|
| 41.877445|-87.723846|United States|
| 29.986801|-90.166314|United States|
+----------+----------+-------------+
, but when I want to use my data with the schema
root
|-- id: integer (nullable = true)
|-- open_time: string (nullable = true)
|-- starting_lng: float (nullable = true)
|-- starting_lat: float (nullable = true)
|-- user_id: string (nullable = true)
|-- date: string (nullable = true)
|-- lat/long: string (nullable = false)
and 4 million rows, so I use limit and select
df_open_app3= df_open_app2.select("starting_lng","starting_lat").limit(10)
Finally, use the same udf
df_open_app4= df_open_app3.withColumn('con', city_state_country("starting_lat","starting_lng"))
The problem is that when I execute a display the process is endless, I don't know why but theorically should be process only 10 rows
I tried the similar scenario in my environment and it's working perfectly fine with around million records.
Created sample UDF function and dataframe with around million records.
selected particular columns and executed function on that.
As Derek O suggested try by using .cache() while creating a dataframe co you don't need to re-read the dataframe. so, that you can reuse the cached dataframe. When you have billions of records. Since action triggers the transformations, display is the first action hence it triggers the execution of all above dataframes creations ang might causing abnormal behavior.

PySpark not picking the custom schema in csv

I am struggling with a very basic pyspark example and I don't know what is going on and would really appreciate if some could help me out
Below is my pypsark code to read a csv file which contains three column
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Sample App").getOrCreate()
child_df1 = spark.read.csv("E:\\data\\person.csv",inferSchema=True,header=True,multiLine=True)
child_df1.printSchema()
Below is the output of above code
root
|-- CPRIMARYKEY: long (nullable = true)
|-- GENDER: string (nullable = true)
|-- FOREIGNKEY: long (nullable = true)
child_df1.select("CPRIMARYKEY","FOREIGNKEY","GENDER").show()
Output
+--------------------+----------------------+------+
| CPRIMARYKEY |FOREIGNKEY |GENDER|
+--------------------+----------------------+------+
| 6922132627268452352| -4967470388989657188| F|
|-1832965148339791872| 761108337125613824| F|
| 7948853342318925440| -914230724356211688| M|
The issue comes when I provide the custom schema
import pyspark.sql.types as T
child_schema = T.StructType(
[
T.StructField("CPRIMARYKEY", T.LongType()),
T.StructField("FOREIGNKEY", T.LongType())
]
)
child_df2 = spark.read.csv("E:\\data\\person.csv",schema=child_schema,multiLine=True,header=True)
child_df2.show()
+--------------------+----------------------+
| CPRIMARYKEY |FOREIGNKEY|
+--------------------+----------------------+
| 6922132627268452352| null|
|-1832965148339791872| null|
| 7948853342318925440| null|
I am not able to understand that when inferring schema spark can recognize long value however when providing schema , its putting null values for FOREIGNKEY column. I am struggling with this simple exercise for a very long time and no luck. Could someone please point me out on what I am missing. Thank you
As far as I understand, you said the CSV has 2 columns.
so the FOREGINKEY and GENDER columns are de-facto one.
so Spark tries to parse -4967470388989657188,F as Long and returns null because it's not a valid long.
Can you add the Gender column to the schema and see if it fixes FOREGINKEY?
If you don't want the gender column, instead of removing it in the schema just .drop('gender') after reading the csv.

How can extract date from struct type column in PySpark dataframe?

I'm dealing with PySpark dataframe which has struct type column as shown below:
df.printSchema()
#root
#|-- timeframe: struct (nullable = false)
#| |-- start: timestamp (nullable = true)
#| |-- end: timestamp (nullable = true)
So I tried to collect() and pass end timestamps/window of related column for plotting issue:
from pyspark.sql.functions import *
# method 1
ts1 = [val('timeframe.end') for val in df.select(date_format(col('timeframe.end'),"yyyy-MM-dd")).collect()]
# method 2
ts2 = [val('timeframe.end') for val in df.select('timeframe.end').collect()]
So normally when the column is not struct I follow this answer but in this case I couldn't find better ways except this post and this answer which they tries to convert it to arrays. I'm not sure this the best practice.
What I have tried 2 methods as shown above unsuccessfully which outputs belows:
print(ts1) #[Row(2021-12-28='timeframe.end')]
print(ts2) #[Row(2021-12-28 00:00:00='timeframe.end')]
Expected outputs are below:
print(ts1) #[2021-12-28] just date format
print(ts2) #[2021-12-28 00:00:00] just timestamp format
How can I handle this matter?
You can access Row fields using brackets (row["field"]) or with dot (row.field) not with parentheses. Try this instead:
from pyspark.sql import Row
import pyspark.sql.functions as F
df = spark.createDataFrame([Row(timeframe=Row(start="2021-12-28 00:00:00", end="2022-01-06 00:00:00"))])
ts1 = [r["end"] for r in df.select(F.date_format(F.col("timeframe.end"), "yyyy-MM-dd").alias("end")).collect()]
# or
# ts1 = [r.end for r in df.select(F.date_format(F.col("timeframe.end"), "yyyy-MM-dd").alias("end")).collect()]
print(ts1)
#['2022-01-06']
When you do row("timeframe.end") you actually calling the class Row that's why you get those values.

pyspark can't stop reading empty string as null (spark 3.0)

I have a some csv data file like this (^ as delmiter):
ID
name
age
0
1
Mike
20
When I do
df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
"inferSchema", "true").csv(xxxxxxx)
spark will default the 2 column after 0 row to null
df.show():
ID
name
age
0
null
null
1
Mike
20
How can I stop pyspark to read the data as null but just empty string?
I have tried add some option in the end
1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")
But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?
Thanks
It looks that the empty values since Spark Version 2.0.1 are treated as null. A manner to achieve your result is using df.na.fill(...):
df = spark.read.csv('your_data_path', sep='^', header=True)
# root
# |-- ID: string (nullable = true)
# |-- name: string (nullable = true)
# |-- age: string (nullable = true)
# Fill all columns
# df = df.na.fill('')
# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])
df.show(truncate=False)
Output
+---+----+---+
|ID |name|age|
+---+----+---+
|0 | | |
|1 |Mike|20 |
+---+----+---+

pyspark-java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema

I'm running pyspark-sql code on Horton sandbox
18/08/11 17:02:22 INFO spark.SparkContext: Running Spark version 1.6.3
# code
from pyspark.sql import *
from pyspark.sql.types import *
rdd1 = sc.textFile ("/user/maria_dev/spark_data/products.csv")
rdd2 = rdd1.map( lambda x : x.split("," ) )
df1 = sqlContext.createDataFrame(rdd2, ["id","cat_id","name","desc","price", "url"])
df1.printSchema()
root
|-- id: string (nullable = true)
|-- cat_id: string (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: string (nullable = true)
|-- url: string (nullable = true)
df1.show()
+---+------+--------------------+----+------+--------------------+
| id|cat_id| name|desc| price| url|
+---+------+--------------------+----+------+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 59.98|http://images.acm...|
| 2| 2|Under Armour Men'...| |129.99|http://images.acm...|
| 3| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 4| 2|Under Armour Men'...| | 89.99|http://images.acm...|
| 5| 2|Riddell Youth Rev...| |199.99|http://images.acm...|
# When I try to get counts I get the following error.
df1.count()
**Caused by: java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.**
# I get the same error for the following code as well
df1.registerTempTable("products_tab")
df_query = sqlContext.sql ("select id, name, desc from products_tab order by name, id ").show();
I see column desc is null, not sure if null column needs to be handled differently when creating data frame and using any method on it.
The same error occurs when running sql query. It seems sql error is due to "order by" clause, if I remove order by then query runs successfully.
Please let me know if you need more info and appreciate answer on how to handle this error.
I tried to see if name field contains any comma, as suggested by Chandan Ray.
There's no comma in name field.
rdd1.count()
=> 1345
rdd2.count()
=> 1345
# clipping id and name column from rdd2
rdd_name = rdd2.map(lambda x: (x[0], x[2]) )
rdd_name.count()
=>1345
rdd_name_comma = rdd_name.filter (lambda x : True if x[1].find(",") != -1 else False )
rdd_name_comma.count()
==> 0
I found the issue- it was due to one bad record, where comma was embedded in string. And even though string was double quoted, python splits string into 2 columns.
I tried using databricks package
# from command prompt
pyspark --packages com.databricks:spark-csv_2.10:1.4.0
# on pyspark
schema1 = StructType ([ StructField("id",IntegerType(), True), \
StructField("cat_id",IntegerType(), True), \
StructField("name",StringType(), True),\
StructField("desc",StringType(), True),\
StructField("price",DecimalType(), True), \
StructField("url",StringType(), True)
])
df1 = sqlContext.read.format('com.databricks.spark.csv').schema(schema1).load('/user/maria_dev/spark_data/products.csv')
df1.show()
df1.show()
+---+------+--------------------+----+-----+--------------------+
| id|cat_id| name|desc|price| url|
+---+------+--------------------+----+-----+--------------------+
| 1| 2|Quest Q64 10 FT. ...| | 60|http://images.acm...|
| 2| 2|Under Armour Men'...| | 130|http://images.acm...|
| 3| 2|Under Armour Men'...| | 90|http://images.acm...|
| 4| 2|Under Armour Men'...| | 90|http://images.acm...|
| 5| 2|Riddell Youth Rev...| | 200|http://images.acm...|
df1.printSchema()
root
|-- id: integer (nullable = true)
|-- cat_id: integer (nullable = true)
|-- name: string (nullable = true)
|-- desc: string (nullable = true)
|-- price: decimal(10,0) (nullable = true)
|-- url: string (nullable = true)
df1.count()
1345
I suppose your name field has comma in it, so its splitting this also. So its expecting 7 columns
There might be some malformed lines.
Please try to use the code as below to exclude bad record in one file
val df = spark.read.format(“csv”).option("badRecordsPath", "/tmp/badRecordsPath").load(“csvpath”)
//it will read csv and create a dataframe, if there will be any malformed record it will move this into the path you provided.
// please read below
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
Here is my take on cleaning of such records, we normally encounter such situations:
a. Anomaly on the data where the file when created, was not looked if "," is the best delimiter on the columns.
Here is my solution on the case:
Solution a: In such cases, we would like to have the process identify as part of data cleansing if that record is a qualified records. The rest of the records if routed to a bad file/collection would give the opportunity to reconcile such records.
Below is the structure of my dataset (product_id,product_name,unit_price)
1,product-1,10
2,product-2,20
3,product,3,30
In the above case, product,3 is supposed to be read as product-3 which might have been a typo when the product was registered. In such as case, the below sample would work.
>>> tf=open("C:/users/ip2134/pyspark_practice/test_file.txt")
>>> trec=tf.read().splitlines()
>>> for rec in trec:
... if rec.count(",") == 2:
... trec_clean.append(rec)
... else:
... trec_bad.append(rec)
...
>>> trec_clean
['1,product-1,10', '2,product-2,20']
>>> trec_bad
['3,product,3,30']
>>> trec
['1,product-1,10', '2,product-2,20','3,product,3,30']
The other alternative of dealing with this problem would be trying to see if skipinitialspace=True would work to parse out the columns.
(Ref:Python parse CSV ignoring comma with double-quotes)

Resources