Pyspark: create spark data frame from nested dictionary - apache-spark

How to create a spark data frame from a nested dictionary? I'm new to spark. I do not want to use the pandas data frame.
My dictionary look like:-
{'prathameshsalap#gmail.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 50)},
'vaishusawant143#gmail.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 35)},
'you#example.com': {'Date': datetime.date(2019, 10, 21),'idle_time': datetime.datetime(2019, 10, 21, 1, 55)}
}
I want to convert this dict to spark data frame using pyspark data frame.
My expected output:-
Date idle_time
user_name
prathameshsalap#gmail.com 2019-10-21 2019-10-21 01:50:00
vaishusawant143#gmail.com 2019-10-21 2019-10-21 01:35:00
you#example.com 2019-10-21 2019-10-21 01:55:00

You need to redo your dictionary and build rows to properly infer the schema.
import datetime
from pyspark.sql import Row
data_dict = {
'prathameshsalap#gmail.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 50)
},
'vaishusawant143#gmail.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 35)
},
'you#example.com': {
'Date': datetime.date(2019, 10, 21),
'idle_time': datetime.datetime(2019, 10, 21, 1, 55)
}
}
data_as_rows = [Row(**{'user_name': k, **v}) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_rows).select('user_name', 'Date', 'idle_time')
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name |Date |idle_time |
+-------------------------+----------+-------------------+
|prathameshsalap#gmail.com|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143#gmail.com|2019-10-21|2019-10-21 01:35:00|
|you#example.com |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+
Note: if you already have the schema prepared and don't need to infer, you can just supply the schema to the createDataFrame function:
import pyspark.sql.types as T
schema = T.StructType([
T.StructField('user_name', T.StringType(), False),
T.StructField('Date', T.DateType(), False),
T.StructField('idle_time', T.TimestampType(), False)
])
data_as_tuples = [(k, v['Date'], v['idle_time']) for k,v in data_dict.items()]
data_df = spark.createDataFrame(data_as_tuples, schema=schema)
data_df.show(truncate=False)
>>>
+-------------------------+----------+-------------------+
|user_name |Date |idle_time |
+-------------------------+----------+-------------------+
|prathameshsalap#gmail.com|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143#gmail.com|2019-10-21|2019-10-21 01:35:00|
|you#example.com |2019-10-21|2019-10-21 01:55:00|
+-------------------------+----------+-------------------+

Convert the dictionary to a list of tuples, each tuple will then become a row in Spark DataFrame:
rows = []
for key, value in data.items():
row = (key,value['Date'], value['idle_time'])
rows.append(row)
Define schema for your data:
from pyspark.sql.types import *
sch = StructType([
StructField('user_name', StringType()),
StructField('date', DateType()),
StructField('idle_time', TimestampType())
])
Create the Spark DataFrame:
df = spark.createDataFrame(rows, sch)
df.show()
+--------------------+----------+-------------------+
| user_name| date| idle_time|
+--------------------+----------+-------------------+
|prathameshsalap#g...|2019-10-21|2019-10-21 01:50:00|
|vaishusawant143#g...|2019-10-21|2019-10-21 01:35:00|
| you#example.com|2019-10-21|2019-10-21 01:55:00|
+--------------------+----------+-------------------+

Related

Create datetime object from numpy array floats

I have a numpy array which contains hours from 4 days:
s = np.array([0.0, 1.0, 2.0, 3.0, 4.0 ....96.0])
I want to create a datetime object from that.
I know that the first element is at timestamp 2021-03-21 00:00,
so:
start_date = datetime.datetime.strptime('2021-03-21 00:00', '%Y-%m-%d %H:%M')
How can I create a new array which contains datetimes, incremented by an hour from the s array.
Use timedelta to build your new array:
>>> import numpy as np
>>> from datetime import datetime, timedelta
>>> s = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 96.0])
>>> start_date = datetime.strptime('2021-03-21 00:00', '%Y-%m-%d %H:%M')
>>> [start_date + timedelta(hours=diff) for diff in s]
[datetime.datetime(2021, 3, 21, 0, 0), datetime.datetime(2021, 3, 21, 1, 0), datetime.datetime(2021, 3, 21, 2, 0), datetime.datetime(2021, 3, 21, 3, 0), datetime.datetime(2021, 3, 21, 4, 0), datetime.datetime(2021, 3, 25, 0, 0)]

Conver pyspark column to a list

I thought this would be easy but can't find the answer :-)
How do I convert the name column in to a list. I am hoping I can get isin to work rather than a join against another datframe column. But isin seems to require a list (if I understand correctly).
Create the datframe:
from pyspark import SparkContext, SparkConf, SQLContext
from datetime import datetime
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data2 = [
('George', datetime(2010, 3, 24, 3, 19, 58), 3),
('Sally', datetime(2009, 12, 12, 17, 21, 30), 5),
('Frank', datetime(2010, 11, 22, 13, 29, 40), 2),
('Paul', datetime(2010, 2, 8, 3, 31, 23), 8),
('Jesus', datetime(2009, 1, 1, 4, 19, 47), 2),
('Lou', datetime(2010, 3, 2, 4, 33, 51), 3),
]
df2 = sqlContext.createDataFrame(data2, ['name', 'trial_start_time', 'purchase_time'])
df2.show(truncate=False)
Should look like:
+------+-------------------+-------------+
|name |trial_start_time |purchase_time|
+------+-------------------+-------------+
|George|2010-03-24 07:19:58|3 |
|Sally |2009-12-12 22:21:30|5 |
|Frank |2010-11-22 18:29:40|2 |
|Paul |2010-02-08 08:31:23|8 |
|Jesus |2009-01-01 09:19:47|2 |
|Lou |2010-03-02 09:33:51|3 |
+------+-------------------+-------------+
I am not sure if collect is the closest I can come to this.
df2.select("name").collect()
[Row(name='George'),
Row(name='Sally'),
Row(name='Frank'),
Row(name='Paul'),
Row(name='Jesus'),
Row(name='Lou')]
Any suggestions on how to output the name column to a list?
It may need to look something like this:
[George, Sally, Frank, Paul, Jesus, Lou]
Use collect_list function and then collect to get list variable.
Example:
from pyspark.sql.functions import *
df2.agg(collect_list(col("name")).alias("name")).show(10,False)
#+----------------------------------------+
#|name |
#+----------------------------------------+
#|[George, Sally, Frank, Paul, Jesus, Lou]|
#+----------------------------------------+
lst=df2.agg(collect_list(col("name"))).collect()[0][0]
lst
#['George', 'Sally', 'Frank', 'Paul', 'Jesus', 'Lou']

delete duplicated rows based on conditions pandas

I want to delete rows in dataframe if (x1, x2, x3) are the same between different rows and save in variable all ids of the rows deleted.
For example, with this data, I want to delete the second row;
d = {'id': ["i1", "i2", "i3", "i4"], 'x1': [13, 13, 61, 61], 'x2': [10, 10, 13, 13], 'x3': [12, 12, 2, 22], 'x4': [24, 24,9, 12]}
df = pd.DataFrame(data=d)
#input data
d = {'id': ["i1", "i2", "i3", "i4"], 'x1': [13, 13, 61, 61], 'x2': [10, 10, 13, 13], 'x3': [12, 12, 2, 22], 'x4': [24, 24,9, 12]}
df = pd.DataFrame(data=d)
#create new column where contents from x1, x2 and x3 columns are merged
df['MergedColumn'] = df[df.columns[1:4]].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
#remove duplicates based on the created column and drop created column
df1 = pd.DataFrame(df.drop_duplicates("MergedColumn", keep='first').drop(columns="MergedColumn"))
#print output dataframe
print(df1)
#merge two dataframes
df2 = pd.merge(df, df1, how='left', on = 'id')
#find rows with null values in the right table (rows that were removed)
df2 = df2[df2['x1_y'].isnull()]
#prints ids of rows that were removed
print(df2['id'])

How to convert some pyspark dataframe's column into a dict with its column name and combine them to be a json column?

I have data in the following format, and I want to change its format using pyspark with two columns ('tag' and 'data').
The 'tag' column values are unique, and the 'data' column values are a json string obtained from the orginial column 'date、stock、price'
in which combine 'stock' and 'price' to be the 'A' columns value, combine 'date' and 'num' to be the 'B' columns value.
I didn't find or write good funcitions to realize this effect.
my spark version is 2.1.0
original DataFrame
date, stock, price, tag, num
1388534400, GOOG, 50, a, 1
1388534400, FB, 60, b, 2
1388534400, MSFT, 55, c, 3
1388620800, GOOG, 52, d, 4
I expect the output:
new DataFrame
tag| data
'a'| "{'A':{'stock':'GOOD', 'price': 50}, B:{'date':1388534400, 'num':1}"
'b'| "{'A':{'stock':'FB', 'price': 60}, B:{'date':1388534400, 'num':2}"
'c'| "{'A':{'stock':'MSFT', 'price': 55}, B:{'date':1388534400, 'num':3}"
'd'| "{'A':{'stock':'GOOG', 'price': 52}, B:{'date':1388620800, 'num':4}"
+--+--------------------------------------------------------------+
from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([
(1388534400, "GOOG", 50, 'a', 1),
(1388534400, "FB", 60, 'b', 2),
(1388534400, "MSFT", 55, 'c', 3),
(1388620800, "GOOG", 52, 'd', 4)]
).toDF("date", "stock", "price", 'tag', 'num')
df.show()
tag_cols = {'A':['stock', 'price'], 'B':['date', 'num']}
# todo, change the Dataframe columns format
IIUC, just use pyspark.sql.functions.struct and pyspark.sql.functions.to_json (both should be available in spark 2.1)
from pyspark.sql import functions as F
# skip df initialization[enter link description here][1]
df_new = df.withColumn('A', F.struct('stock', 'price')) \
.withColumn('B', F.struct('date', 'num')) \
.select('tag', F.to_json(F.struct('A', 'B')).alias('data'))
>>> df_new.show(5,0)
+---+-----------------------------------------------------------------+
|tag|data |
+---+-----------------------------------------------------------------+
|a |{"A":{"stock":"GOOG","price":50},"B":{"date":1388534400,"num":1}}|
|b |{"A":{"stock":"FB","price":60},"B":{"date":1388534400,"num":2}} |
|c |{"A":{"stock":"MSFT","price":55},"B":{"date":1388534400,"num":3}}|
|d |{"A":{"stock":"GOOG","price":52},"B":{"date":1388620800,"num":4}}|
+---+-----------------------------------------------------------------+

Flatmap a collect_set in pyspark dataframe

I have two dataframe and I'm using collect_set() in agg after using groupby. What's the best way to flatMap the resulting array after aggregating.
schema = ['col1', 'col2', 'col3', 'col4']
a = [[1, [23, 32], [11, 22], [9989]]]
df1 = spark.createDataFrame(a, schema=schema)
b = [[1, [34], [43, 22], [888, 777]]]
df2 = spark.createDataFrame(b, schema=schema)
df = df1.union(
df2
).groupby(
'col1'
).agg(
collect_set('col2').alias('col2'),
collect_set('col3').alias('col3'),
collect_set('col4').alias('col4')
)
df.collect()
I'm getting this as output:
[Row(col1=1, col2=[[34], [23, 32]], col3=[[11, 22], [43, 22]], col4=[[9989], [888, 777]])]
But, I want this as output:
[Row(col1=1, col2=[23, 32, 34], col3=[11, 22, 43], col4=[9989, 888, 777])]
You can use udf:
from itertools import chain
from pyspark.sql.types import *
from pyspark.sql.functions import udf
flatten = udf(lambda x: list(chain.from_iterable(x)), ArrayType(IntegerType()))
df.withColumn('col2_flat', flatten('col2'))
Without UDF I supposed this should work :
from pyspark.sql.functions import array_distinct, flatten
df.withColumn('col2_flat', array_distinct(flatten('col2')))
It will flatten the nested arrays, and then deduplicates.

Resources