How to convert the data to map in PySpark, for dynamic columns?
Input dataframe:
key_column
Column_1
Column_2
.....
Column_N
1
Value_1
Value_2
.....
Value_N
1
Value_a
Value_2
......
Value_Z
2
Value_1
Value_2
.....
Value_N
Expected output dataframe:
key_column
Map_output
1
{"Column_1":"Value_1, Value_a", "Column_2":"Value_2", ......, "Column_N":"Value_N, Value_Z"}
2
{"Column_1":"Value_1", "Column_2":"Value_2", ......, "Column_N":"Value_N"}
We can use create_map function with reduce().
col_list = ['col1', 'col2', 'col3'] # can use sdf.columns for all columns in dataframe
spark.sparkContext.parallelize([('val01', 'val02', 'val03'), ('val11', 'val12', 'val13')]). \
toDF(['col1', 'col2', 'col3']). \
withColumn('allcol_map',
func.create_map(*reduce(lambda x, y: x + y, [[func.lit(k), func.col(k)] for k in col_list]))
). \
show(truncate=False)
# +-----+-----+-----+---------------------------------------------+
# |col1 |col2 |col3 |allcol_map |
# +-----+-----+-----+---------------------------------------------+
# |val01|val02|val03|{col1 -> val01, col2 -> val02, col3 -> val03}|
# |val11|val12|val13|{col1 -> val11, col2 -> val12, col3 -> val13}|
# +-----+-----+-----+---------------------------------------------+
# root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: string (nullable = true)
# |-- allcol_map: map (nullable = false)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
We can also use map_from_entries function that requires an array of structs. The struct fields will be converted into the maps. It will output the same result as aforementioned.
col_list = ['col1', 'col2', 'col3'] # can use sdf.columns for all columns in dataframe
spark.sparkContext.parallelize([('val01', 'val02', 'val03'), ('val11', 'val12', 'val13')]). \
toDF(['col1', 'col2', 'col3']). \
withColumn('allcol_map',
func.map_from_entries(func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in col_list]))
). \
show(truncate=False)
Based on the updated situation, you'd like to group by some key columns. Looking at the new expected output, you can use concat_ws and collect_list / collect_set to club the all / unique column values.
col_list = ['col1', 'col2', 'col3']
spark.sparkContext.parallelize([('part0', 'val01', 'val02', 'val03'), ('part0', 'val11', 'val12', 'val13'), ('part1', 'val21', 'val22', 'val23')]). \
toDF(['key_column', 'col1', 'col2', 'col3']). \
groupBy('key_column'). \
agg(*[func.concat_ws(',', func.collect_set(k)).alias(k) for k in col_list]). \
withColumn('allcol_map',
func.map_from_entries(func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) for k in col_list]))
). \
show(truncate=False)
# +----------+-----------+-----------+-----------+---------------------------------------------------------------+
# |key_column|col1 |col2 |col3 |allcol_map |
# +----------+-----------+-----------+-----------+---------------------------------------------------------------+
# |part1 |val21 |val22 |val23 |{col1 -> val21, col2 -> val22, col3 -> val23} |
# |part0 |val01,val11|val02,val12|val03,val13|{col1 -> val01,val11, col2 -> val02,val12, col3 -> val03,val13}|
# +----------+-----------+-----------+-----------+---------------------------------------------------------------+
F.from_json(F.to_json(F.struct(df.columns)), 'map<string,string>')
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Value_1', 'Value_2', 'Value_N'),
('Value_a', 'Value_b', 'Value_M')],
['Column_1', 'Column_2', 'Column_N'])
df = df.select(F.from_json(F.to_json(F.struct(df.columns)), 'map<string,string>').alias('Map_output'))
df.show(truncate=0)
# +---------------------------------------------------------------+
# |Map_output |
# +---------------------------------------------------------------+
# |{Column_1 -> Value_1, Column_2 -> Value_2, Column_N -> Value_N}|
# |{Column_1 -> Value_a, Column_2 -> Value_b, Column_N -> Value_M}|
# +---------------------------------------------------------------+
Related
I want to modify/filter on a property inside a struct.
Let's say I have a dataframe with the following column :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [1, 2, 3]} |
#+------------------------------------------+
Schema:
struct<a:string, b:array<int>>
I want to filter out some values in 'b' property when value inside the array == 1
The result desired is the following :
#+------------------------------------------+
#| arrayCol |
#+------------------------------------------+
#| {"a" : "some_value", "b" : [2, 3]} |
#+------------------------------------------+
Is it possible to do it without extracting the property, filter the values, and re-build another struct ?
Update:
For spark 3.1+, withField can be used to update the struct column without having to recreate all the struct. In your case, you can update the field b using filter function to filter the array values like this:
import pyspark.sql.functions as F
df1 = df.withColumn(
'arrayCol',
F.col('arrayCol').withField('b', F.filter(F.col("arrayCol.b"), lambda x: x != 1))
)
df1.show()
#+--------------------+
#| arrayCol|
#+--------------------+
#|{some_value, [2, 3]}|
#+--------------------+
For older versions, Spark doesn’t support adding/updating fields in nested structures. To update a struct column, you'll need to create a new struct using the existing fields and the updated ones:
import pyspark.sql.functions as F
df1 = df.withColumn(
"arrayCol",
F.struct(
F.col("arrayCol.a").alias("a"),
F.expr("filter(arrayCol.b, x -> x != 1)").alias("b")
)
)
One way would be to define a UDF:
Example:
import ast
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, MapType
def remove_value(col):
col["b"] = str([x for x in ast.literal_eval(col["b"]) if x != 1])
return col
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{
"arrayCol": {
"a": "some_value",
"b": "[1, 2, 3]",
},
},
]
)
remove_value_udf = spark.udf.register(
"remove_value_udf", remove_value, MapType(StringType(), StringType())
)
df = df.withColumn(
"result",
remove_value_udf(F.col("arrayCol")),
)
Result:
root
|-- arrayCol: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- result: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------------------------+------------------------------+
|arrayCol |result |
+---------------------------------+------------------------------+
|{a -> some_value, b -> [1, 2, 3]}|{a -> some_value, b -> [2, 3]}|
+---------------------------------+------------------------------+
I have to read data from a path which is partitioned by region.
US region has columns a,b,c,d,e
EUR region has only a,b,c,d
When I read data from the path and doing a printSchema, I am seeing only a,b,c,d 'e' is missing.
Is there any way to handle this situation? Like column e automatically gets populated with null for EUR data...?
You can use the mergeSchema option that should do exactly what you are looking for as long as columns with the same name have the same type.
Example:
spark.read.option("mergeSchema", "true").format("parquet").load(...)
Once you read the data from the path, you can check if data frame contains column 'e'. If it does not, then you could add this with default value which is None is this case.
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder \
.appName('example') \
.getOrCreate()
df = spark.createDataFrame(data=data, schema = columns)
if 'e' not in df.columns:
df = df.withColumn('e',lit(None))
You can collect all the possible columns from both dataset then fill None if that column is not available in each dataset
df_ab = (spark
.sparkContext
.parallelize([
('a1', 'b1'),
('a2', 'b2'),
])
.toDF(['a', 'b'])
)
df_ab.show()
# +---+---+
# | a| b|
# +---+---+
# | a1| b1|
# | a2| b2|
# +---+---+
df_abcd = (spark
.sparkContext
.parallelize([
('a3', 'b3', 'c3', 'd3'),
('a4', 'b4', 'c4', 'd4'),
])
.toDF(['a', 'b', 'c', 'd'])
)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
unique_columns = list(set(df_ab.columns + df_abcd.columns))
# ['d', 'b', 'a', 'c']
for col in unique_columns:
if col not in df_ab.columns:
df_ab = df_ab.withColumn(col, F.lit(None))
if col not in df_abcd.columns:
df_abcd = df_abcd.withColumn(col, F.lit(None))
df_ab.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- d: null (nullable = true)
# |-- c: null (nullable = true)
df_ab.show()
# +---+---+----+----+
# | a| b| d| c|
# +---+---+----+----+
# | a1| b1|null|null|
# | a2| b2|null|null|
# +---+---+----+----+
df_abcd.printSchema()
# root
# |-- a: string (nullable = true)
# |-- b: string (nullable = true)
# |-- c: string (nullable = true)
# |-- d: string (nullable = true)
df_abcd.show()
# +---+---+---+---+
# | a| b| c| d|
# +---+---+---+---+
# | a3| b3| c3| d3|
# | a4| b4| c4| d4|
# +---+---+---+---+
I used pyspark and SQLContext. Hope this implementation will help you to get an idea. Spark provides an environment to use SQL and it is very convenient to use SPARK SQL for these type of things.
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions
from pyspark.sql import SQLContext
import sys
import os
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
class getData(object):
"""docstring for getData"""
def __init__(self):
def get_data(self, n):
spark = SparkSession.builder.appName('YourProjectName').getOrCreate()
data2 = [("region 1","region 2","region 3","region 4"),
("region 5","region 6","region 7","region 8")
]
schema = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True), \
StructField("d", StringType(), True) \
])
data3 = [("EU region 1","EU region 2","EU region 3"),
("EU region 5","EU region 6","EU region 7")
]
schema3 = StructType([ \
StructField("a",StringType(),True), \
StructField("b",StringType(),True), \
StructField("c",StringType(),True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.createOrReplaceTempView("USRegion")
sqlDF = self.sparkSession1.sql("SELECT * FROM USRegion")
sqlDF.show(n=600)
df1 = spark.createDataFrame(data=data3,schema=schema3)
df1.createOrReplaceTempView("EURegion")
sqlDF1 = self.sparkSession1.sql("SELECT * FROM EURegion")
sqlDF1.show(n=600)
sql_union_df = self.sparkSession1.sql("SELECT a, b, c, d FROM USRegion uNION ALL SELECT a,b, c, '' as d FROM EURegion ")
sql_union_df.show(n=600)
#call the class
conn = getData()
#call the method implemented inside the class
print(conn.get_data(10))
I have two DataFrames called DF1 and DF2, the content of each DataFrame is as follows:
df1:
line_item_usage_account_id line_item_unblended_cost name
100000000001 12.05 account1
200000000001 52 account2
300000000003 12.03 account3
df2:
accountname accountproviderid clustername app_pmo app_costcenter
account1 100000000001 cluster1 111111 11111111
account2 200000000001 cluster2 222222 22222222
I need to make a join for fields df1.line_item_usage_account_id and df2.accountproviderid
When both fields have the same ID, the value of the DF1 line_item_unblended_cost column must be added.
And when the value of the line_item_usage_account_id field of the DF1 is not in the accountproviderid column of the DF2, the df1 fields must be aggregated as follows:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
account3 300000000003 NA NA NA 12.03
The account3 data was added at the end of the new DataFrame by filling with "na" the columns of the DF2.
Any help thanks in advance.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([
[100000000001, 12.05, 'account1'],
[200000000001, 52.00, 'account2'],
[300000000003, 12.03, 'account3']],
schema=['line_item_usage_account_id', 'line_item_unblended_cost', 'name' ])
df1.show()
df1.printSchema()
df2 = spark.createDataFrame([
['account1', 100000000001, 'cluster1', 111111, 11111111],
['account2', 200000000001, 'cluster2', 222222, 22222222]],
schema=['accountname', 'accountproviderid', 'clustername', 'app_pmo', 'app_costcenter'])
df2.printSchema()
df2.show()
cols = ['name', 'line_item_usage_account_id', 'clustername', 'app_pmo', 'app_costcenter', 'line_item_unblended_cost']
resDF = df1.join(df2, df1.line_item_usage_account_id == df2.accountproviderid, "leftouter").select(*cols).withColumnRenamed('name', 'accountname').withColumnRenamed('line_item_usage_account_id', 'accountproviderid').orderBy('accountname')
resDF.printSchema()
# |-- accountname: string (nullable = true)
# |-- accountproviderid: long (nullable = true)
# |-- clustername: string (nullable = true)
# |-- app_pmo: long (nullable = true)
# |-- app_costcenter: long (nullable = true)
# |-- line_item_unblended_cost: double (nullable = true)
resDF.show()
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# |accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# | account1| 100000000001| cluster1| 111111| 11111111| 12.05|
# | account2| 200000000001| cluster2| 222222| 22222222| 52.0|
# | account3| 300000000003| null| null| null| 12.03|
# +-----------+-----------------+-----------+-------+--------------+------------------------+
Below is the sample dataframe, I want to split this into multiple dataframes or rdd's based on their datatype
ID:Int
Name:String
Joining_Date: Date
I have 100+ columns in my data frame, Is there any inbuilt method to achieve this logic?
As far as I know there is not a build-in functionality to achieve that, nevertheless here is a way to separate one dataframe into multiple dataframes based on the column type.
First lets create some data:
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, LongType, DateType
df = spark.createDataFrame([
(0, 11, "t1", "s1", "2019-10-01"),
(0, 22, "t2", "s2", "2019-02-11"),
(1, 23, "t3", "s3", "2018-01-10"),
(1, 24, "t4", "s4", "2019-10-01")], ["i1", "i2", "s1", "s2", "date"])
df = df.withColumn("date", col("date").cast("date"))
# df.printSchema()
# root
# |-- i1: long (nullable = true)
# |-- i2: long (nullable = true)
# |-- s1: string (nullable = true)
# |-- s2: string (nullable = true)
# |-- date: date (nullable = true)
Then we will group the columns of the previous dataframe into a dictionary where the key with be the column type and the value a list with the columns that correspond to that type:
d = {}
# group cols into a dict by type
for c in df.schema:
key = c.dataType
if not key in d.keys():
d[key] = [c.name]
else:
d[key].append(c.name)
d
# {DateType: ['date'], StringType: ['s1', 's2'], LongType: ['i1', 'i2']}
Then we iterate though the keys(col types) and we generate the schema along with the corresponding empty dataframe for each item of the dictionary:
type_dfs = {}
# create schema for each type
for k in d.keys():
schema = StructType(
[
StructField(cname , k) for cname in d[k]
])
# finally create an empty df with that schema
type_dfs[str(k)] = spark.createDataFrame(sc.emptyRDD(), schema)
type_dfs
# {'DateType': DataFrame[date: date],
# 'StringType': DataFrame[s1: string, s2: string],
# 'LongType': DataFrame[i1: bigint, i2: bigint]}
Finally we can use the generated dataframes by accessing each item of the type_dfs:
type_dfs['StringType'].printSchema()
# root
# |-- s1: string (nullable = true)
# |-- s2: string (nullable = true)
I have spark dataframe with two columns of type Integer and Map, I wanted to know best way to update the values for all the keys for map column.
With help of UDF, I am able to update the values
def modifyValues = (map_data: Map[String, Int]) => {
val divideWith = 10
map_data.mapValues( _ / divideWith)
}
val modifyMapValues = udf(modifyValues)
df.withColumn("updatedValues", modifyMapValues($"data_map"))
scala> dF.printSchema()
root
|-- id: integer (nullable = true)
|-- data_map: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
Sample data:
>val ds = Seq(
(1, Map("foo" -> 100, "bar" -> 200)),
(2, Map("foo" -> 200)),
(3, Map("bar" -> 200))
).toDF("id", "data_map")
Expected output:
+---+-----------------------+
|id |data_map |
+---+-----------------------+
|1 |[foo -> 10, bar -> 20] |
|2 |[foo -> 20] |
|3 |[bar -> 1] |
+---+-----------------------+
Wanted to know, is there anyway to do this without UDF?
One possible way how to do it (without UDF) is this one:
extract keys using map_keys to an array
extract values using map_values to an array
transform extracted values using TRANSFORM (available since Spark 2.4)
create back the map using map_from_arrays
import org.apache.spark.sql.functions.{expr, map_from_arrays, map_values, map_keys}
ds
.withColumn("values", map_values($"data_map"))
.withColumn("keys", map_keys($"data_map"))
.withColumn("values_transformed", expr("TRANSFORM(values, v -> v/10)"))
.withColumn("data_map_transformed", map_from_arrays($"keys", $"values_transformed"))
import pyspark.sql.functions as sp
from pyspark.sql.types import StringType, FloatType, MapType
Add a new key with any value:
my_update_udf = sp.udf(lambda x: {**x, **{'new_key':77}}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
Update value for all/one key(s):
my_update_udf = sp.udf(lambda x: {k:v/77) for k,v in x.items() if v!=None and k=='my_key'}, MapType(StringType(), FloatType()))
sdf = sdf.withColumn('updated', my_update_udf(sp.col('to_be_updated')))
There is another way available in Spark 3:
Seq(
Map("keyToUpdate" -> 11, "someOtherKey" -> 12),
Map("keyToUpdate" -> 21, "someOtherKey" -> 22)
).toDF("mapColumn")
.withColumn(
"mapColumn",
map_concat(
map(lit("keyToUpdate"), col("mapColumn.keyToUpdate") * 10), // <- transformation
map_filter(col("mapColumn"), (k, _) => k =!= "keyToUpdate")
)
)
.show(false)
output:
+----------------------------------------+
|mapColumn |
+----------------------------------------+
|{someOtherKey -> 12, keyToUpdate -> 110}|
|{someOtherKey -> 22, keyToUpdate -> 210}|
+----------------------------------------+
map_filter(expr, func) - Filters entries in a map using the function
map_concat(map, ...) - Returns the union of all the given maps