Efficient way to add UUID in pyspark [duplicate] - python-3.x

This question already has answers here:
Pyspark add sequential and deterministic index to dataframe
(2 answers)
Closed 2 years ago.
I have a DataFrame that I want to add a column of distinct uuid4() rows. My code:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
from uuid import uuid4
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([
[1, 1, 'teste'],
[2, 2, 'teste'],
[3, 0, 'teste'],
[4, 5, 'teste'],
],
list('abc'))
df = df.withColumn("_tmp", f.lit(1))
uuids = [str(uuid4()) for _ in range(df.count())]
df1 = spark_session.createDataFrame(uuids, StringType())
df1 = df_1.withColumn("_tmp", f.lit(1))
df2 = df.join(df_1, "_tmp", "inner").drop("_tmp")
df2.show()
But I've got this ERROR:
Py4JJavaError: An error occurred while calling o1571.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
I already try with alias and using monotonically_increasing_id as the join column, but I see
here that I cannot trust in monotonically_increasing_id as merge column.
I'm expecting:
+---+---+-----+------+
| a| b| c| value|
+---+---+-----+------+
| 1| 1|teste| uuid4|
| 2| 2|teste| uuid4|
| 3| 0|teste| uuid4|
| 4| 5|teste| uuid4|
+---+---+-----+------+
what's the correct approach in this case?

I use row_number as #Tetlanesh suggest. I have to create an ID column to ensure that row_number count every row of Window.
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from uuid import uuid4
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
from pyspark.sql.functions import row_number
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([
[1, 1, 'teste'],
[1, 2, 'teste'],
[2, 0, 'teste'],
[2, 5, 'teste'],
],
list('abc'))
df = df.alias("_tmp")
df.registerTempTable("_tmp")
df2 = self.spark_session.sql("select *, uuid() as uuid from _tmp")
df2.show()
Another approach is using windows, but It's not efficient as the first one:
df = df.withColumn("_id", f.lit(1))
df = df.withColumn("_tmp", row_number().over(Window.orderBy('_id')))
uuids = [(str(uuid4()), 1) for _ in range(df.count())]
df1 = spark_session.createDataFrame(uuids, ['uuid', '_id'])
df1 = df1.withColumn("_tmp", row_number().over(Window.orderBy('_id')))
df2 = df.join(df1, "_tmp", "inner").drop('_id')
df2.show()
both outputs:
+---+---+-----+------+
| a| b| c| uuid|
+---+---+-----+------+
| 1| 1|teste| uuid4|
| 2| 2|teste| uuid4|
| 3| 0|teste| uuid4|
| 4| 5|teste| uuid4|
+---+---+-----+------+

Related

Pyspark eval or expr - Concatenating multiple dataframe columns using when statement

I am trying to concatenate multiple dataframe columns I am not able to perform pyspark eval or expr on the below when statement inside concat_ws.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql.functions import concat_ws,concat,when,col,expr
from pyspark.sql.functions import lit
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([("foo", "bar"), ("ba z", None)],
('a', 'b'))
keys = ['a','b']
key_val = ''
for key in keys:
key_val = key_val + 'when(df["{0}"].isNull(), lit("_")).otherwise(df["{0}"]),'.format(key)
key_val_exp = key_val.rsplit(',', 1)[0]
spaceDeleteUDF = udf(lambda s: str(s).replace(" ", "_").strip(), StringType())
df=df.withColumn("unique_id", spaceDeleteUDF(concat_ws("-",eval(key_val_exp))))
Error:
"TypeError: Invalid argument, not a string or column: (Column<b'CASE WHEN (a IS NULL) THEN _ ELSE a END'>, Column<b'CASE WHEN (b IS NULL) THEN _ ELSE b END'>) of type <class 'tuple'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function."
Expected output:
+----+----+---------+
| a| b|unique_id|
+----+----+---------+
| foo| bar| foo-bar|
|ba z|null| ba_z-_|
+----+----+---------+
check this out.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([("foo", "bar"), ("ba z", None)],
('a', 'b'))
df.show()
# +----+----+
# | a| b|
# +----+----+
# | foo| bar|
# |ba z|null|
# +----+----+
df1 = df.select( *[F.col(column) for column in df.columns],*[ F.when(F.col(column).isNull(),F.lit('_')).otherwise(F.col(column)).alias(column+'_mod') for column in df.columns])
df2 = df1.select(*[F.col(column) for column in df1.columns if '_mod' not in column], *[ F.regexp_replace(column, r'\s', '_').alias(column) for column in df1.columns if '_mod' in column])
df3 = df2.select( *[F.col(column) for column in df1.columns if '_mod' not in column],F.concat_ws('-',*[F.col(column) for column in df2.columns if '_mod' in column]).alias('unique_id'))
df3.show()
# +----+----+---------+
# | a| b|unique_id|
# +----+----+---------+
# | foo| bar| foo-bar|
# |ba z|null| ba_z-_|
# +----+----+---------+

How do I replace a full stop with a zero in PySpark?

I am trying to replace a full stop in my raw data with the value 0 in PySpark.
I tried to use a .when and .otherwise statement.
I tried to use regexp_replace to change the '.' to 0.
Code tried:
from pyspark.sql import functions as F
#For #1 above:
dataframe2 = dataframe1.withColumn("test_col", F.when(((F.col("test_col") == F.lit(".")), 0).otherwise(F.col("test_col")))
#For #2 above:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
Instead of "." it should rewrite the column with numbers only (i.e. there is a number in non full stop rows, otherwise, it's a full stop that should be replaced with 0).
pyspark version
from pyspark.sql import SparkSession
from pyspark.sql.types import (StringType, IntegerType, StructField, StructType)
from pyspark.sql import functions
column_schema = StructType([StructField("num", IntegerType()), StructField("text", StringType())])
data = [[3, 'r1'], [9, 'r2.'], [27, '.']]
spark = SparkSession.builder.master("local").getOrCreate()
spark.conf.set("spark.executor.memory", '1g')
spark.conf.set('spark.executor.cores', '1')
spark.conf.set('spark.cores.max', '2')
spark.conf.set("spark.driver.memory", '1g')
spark_context = spark.sparkContext
data_frame = spark.createDataFrame(data, schema=column_schema)
data_frame.show()
filtered_data_frame = data_frame.withColumn('num',
functions.when(data_frame['num'] == 3, -3).otherwise(data_frame['num']))
filtered_data_frame.show()
filtered_data_frame = data_frame.withColumn('text',
functions.when(data_frame['text'] == '.', '0').otherwise(
data_frame['text']))
filtered_data_frame.show()
output
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| -3| r1|
| 9| r2.|
| 27| .|
+---+----+
+---+----+
|num|text|
+---+----+
| 3| r1|
| 9| r2.|
| 27| 0|
+---+----+
sample code does query properly
package otz.scalaspark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object ValueReplacement {
def main(args: Array[String]) {
val sparkConfig = new SparkConf().setAppName("Value-Replacement").setMaster("local[*]").set("spark.executor.memory", "1g");
val sparkContext = new SparkContext(sparkConfig)
val someData = Seq(
Row(3, "r1"),
Row(9, "r2"),
Row(27, "r3"),
Row(81, "r4")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)
val sqlContext = new SQLContext(sparkContext)
val dataFrame = sqlContext.createDataFrame(
sparkContext.parallelize(someData),
StructType(someSchema)
)
val filteredDataFrame = dataFrame.withColumn("number", when(col("number") === 3, -3).otherwise(col("number")));
filteredDataFrame.show()
}
}
output
+------+----+
|number|word|
+------+----+
| -3| r1|
| 9| r2|
| 27| r3|
| 81| r4|
+------+----+
You attempt #2 was almost correct, if you have a dataframe1 like:
+--------+
|test_col|
+--------+
| 1.0|
| 2.0|
| 2|
+--------+
Your attempt must be yielding:
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '.', 0))
dataframe2.show()
+--------+
|test_col|
+--------+
| 000|
| 000|
| 0|
+--------+
Here the . means all the letter are to be replaced and not just '.'.
However, if you add an escape sequence (\) before the dot then things should work fine.
dataframe2 = dataframe1.withColumn('test_col', F.regexp_replace(dataframe1.test_col, '\.', '0'))
dataframe2.show()
+--------+
|test_col|
+--------+
| 100|
| 200|
| 2|
+--------+

Spark Aggregating multiple columns (possible to array) from join output

I've below datasets
Table1
Table2
Now I would like to get below dataset. I've tried with left outer join Table1.id == Table2.departmentid but, I am not getting the desired output.
Later, I need to use this table to get several counts and convert the data into an xml . I will be doing this convertion using map.
Any help would be appreciated.
Only joining is not enough to get the desired output. Probably You are missing something and last element of each nested array might be departmentid. Assuming the last element of nested array is departmentid, I've generated the output by the following way:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions.collect_list
case class department(id: Integer, deptname: String)
case class employee(employeid:Integer, empname:String, departmentid:Integer)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val department_df = Seq(department(1, "physics")
,department(2, "computer") ).toDF()
val emplyoee_df = Seq(employee(1, "A", 1)
,employee(2, "B", 1)
,employee(3, "C", 2)
,employee(4, "D", 2)).toDF()
val result = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname").
rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp").
groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
The output looks like this:
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
Explanation: If i break down the last dataframe transformation into multiple steps, it'll probably make clear how the output is generated.
left outer join between department_df and employee_df
val df1 = department_df.join(emplyoee_df, department_df("id") === emplyoee_df("departmentid"), "left").
selectExpr("id", "deptname", "employeid", "empname")
df1.show()
+---+--------+---------+-------+
| id|deptname|employeid|empname|
+---+--------+---------+-------+
| 1| physics| 2| B|
| 1| physics| 1| A|
| 2|computer| 4| D|
| 2|computer| 3| C|
+---+--------+---------+-------+
creating array using some column's values from the df1 dataframe
val df2 = df1.rdd.map {
case Row(id:Integer, deptname:String, employeid:Integer, empname:String) => (id, deptname, Array(employeid.toString, empname, id.toString))
}.toDF("id", "deptname", "arrayemp")
df2.show()
+---+--------+---------+
| id|deptname| arrayemp|
+---+--------+---------+
| 1| physics|[2, B, 1]|
| 1| physics|[1, A, 1]|
| 2|computer|[4, D, 2]|
| 2|computer|[3, C, 2]|
+---+--------+---------+
create new list aggregating multiple arrays using df2 dataframe
val result = df2.groupBy("id", "deptname").
agg(collect_list("arrayemp").as("emplist")).
orderBy("id", "deptname")
result.show(false)
+---+--------+----------------------+
|id |deptname|emplist |
+---+--------+----------------------+
|1 |physics |[[2, B, 1], [1, A, 1]]|
|2 |computer|[[4, D, 2], [3, C, 2]]|
+---+--------+----------------------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq(
(1,"Physics"),
(2,"Computer"),
(3,"Maths")
)).toDF("ID","Dept")
val schema = List(
StructField("EMPID", IntegerType, true),
StructField("EMPNAME", StringType, true),
StructField("DeptID", IntegerType, true)
)
val data = Seq(
Row(1,"A",1),
Row(2,"B",1),
Row(3,"C",2),
Row(4,"D",2) ,
Row(5,"E",null)
)
val df_emp = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val newdf = df_emp.withColumn("CONC",array($"EMPID",$"EMPNAME",$"DeptID")).groupBy($"DeptID").agg(expr("collect_list(CONC) as emplist"))
df.join(newdf,df.col("ID") === df_emp.col("DeptID")).select($"ID",$"Dept",$"emplist").show()
---+--------+--------------------+
| ID| Dept| listcol|
+---+--------+--------------------+
| 1| Physics|[[1, A, 1], [2, B...|
| 2|Computer|[[3, C, 2], [4, D...|

Spark: Find the value with the highest occurrence per group over rolling time window

Starting from the following spark data frame:
from io import StringIO
import pandas as pd
from pyspark.sql.functions import col
pd_df = pd.read_csv(StringIO("""device_id,read_date,id,count
device_A,2017-08-05,4041,3
device_A,2017-08-06,4041,3
device_A,2017-08-07,4041,4
device_A,2017-08-08,4041,3
device_A,2017-08-09,4041,3
device_A,2017-08-10,4041,1
device_A,2017-08-10,4045,2
device_A,2017-08-11,4045,3
device_A,2017-08-12,4045,3
device_A,2017-08-13,4045,3"""),infer_datetime_format=True, parse_dates=['read_date'])
df = spark.createDataFrame(pd_df).withColumn('read_date', col('read_date').cast('date'))
df.show()
Output:
+--------------+----------+----+-----+
|device_id | read_date| id|count|
+--------------+----------+----+-----+
| device_A|2017-08-05|4041| 3|
| device_A|2017-08-06|4041| 3|
| device_A|2017-08-07|4041| 4|
| device_A|2017-08-08|4041| 3|
| device_A|2017-08-09|4041| 3|
| device_A|2017-08-10|4041| 1|
| device_A|2017-08-10|4045| 2|
| device_A|2017-08-11|4045| 3|
| device_A|2017-08-12|4045| 3|
| device_A|2017-08-13|4045| 3|
+--------------+----------+----+-----+
I would like to find the most frequent id for each (device_id, read_date) combination, over a 3 day rolling window. For each group of rows selected by the time window, I need to find the most frequent id by summing up the counts per id, then return the top id.
Expected Output:
+--------------+----------+----+
|device_id | read_date| id|
+--------------+----------+----+
| device_A|2017-08-05|4041|
| device_A|2017-08-06|4041|
| device_A|2017-08-07|4041|
| device_A|2017-08-08|4041|
| device_A|2017-08-09|4041|
| device_A|2017-08-10|4041|
| device_A|2017-08-11|4045|
| device_A|2017-08-12|4045|
| device_A|2017-08-13|4045|
+--------------+----------+----+
I am starting to think this is only possible using a custom aggregation function. Since spark 2.3 is not out I will have to write this in Scala or use collect_list. Am I missing something?
Add window:
from pyspark.sql.functions import window, sum as sum_, date_add
df_w = df.withColumn(
"read_date", window("read_date", "3 days", "1 day")["start"].cast("date")
)
# Then handle the counts
df_w = df_w.groupBy('device_id', 'read_date', 'id').agg(sum_('count').alias('count'))
Use one of the solutions from Find maximum row per group in Spark DataFrame for example
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
rolling_window = 3
top_df = (
df_w
.withColumn(
"rn",
row_number().over(
Window.partitionBy("device_id", "read_date")
.orderBy(col("count").desc())
)
)
.where(col("rn") == 1)
.orderBy("read_date")
.drop("rn")
)
# results are calculated on the start of the time window - adjust read_date as needed
final_df = top_df.withColumn('read_date', date_add('read_date', rolling_window - 1))
final_df.show()
# +---------+----------+----+-----+
# |device_id| read_date| id|count|
# +---------+----------+----+-----+
# | device_A|2017-08-05|4041| 3|
# | device_A|2017-08-06|4041| 6|
# | device_A|2017-08-07|4041| 10|
# | device_A|2017-08-08|4041| 10|
# | device_A|2017-08-09|4041| 10|
# | device_A|2017-08-10|4041| 7|
# | device_A|2017-08-11|4045| 5|
# | device_A|2017-08-12|4045| 8|
# | device_A|2017-08-13|4045| 9|
# | device_A|2017-08-14|4045| 6|
# | device_A|2017-08-15|4045| 3|
# +---------+----------+----+-----+
I managed to find a very inefficient solution. Hopefully someone can spot improvements to avoid the python udf and call to collect_list.
from pyspark.sql import Window
from pyspark.sql.functions import col, collect_list, first, udf
from pyspark.sql.types import IntegerType
def top_id(ids, counts):
c = Counter()
for cnid, count in zip(ids, counts):
c[cnid] += count
return c.most_common(1)[0][0]
rolling_window = 3
days = lambda i: i * 86400
# Define a rolling calculation window based on time
window = (
Window()
.partitionBy("device_id")
.orderBy(col("read_date").cast("timestamp").cast("long"))
.rangeBetween(-days(rolling_window - 1), 0)
)
# Use window and collect_list to store data matching the window definition on each row
df_collected = df.select(
'device_id', 'read_date',
collect_list(col('id')).over(window).alias('ids'),
collect_list(col('count')).over(window).alias('counts')
)
# Get rid of duplicate rows where necessary
df_grouped = df_collected.groupBy('device_id', 'read_date').agg(
first('ids').alias('ids'),
first('counts').alias('counts'),
)
# Register and apply udf to return the most frequently seen id
top_id_udf = udf(top_id, IntegerType())
df_mapped = df_grouped.withColumn('top_id', top_id_udf(col('ids'), col('counts')))
df_mapped.show(truncate=False)
returns:
+---------+----------+------------------------+------------+------+
|device_id|read_date |ids |counts |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041] |[3] |4041 |
|device_A |2017-08-06|[4041, 4041] |[3, 3] |4041 |
|device_A |2017-08-07|[4041, 4041, 4041] |[3, 3, 4] |4041 |
|device_A |2017-08-08|[4041, 4041, 4041] |[3, 4, 3] |4041 |
|device_A |2017-08-09|[4041, 4041, 4041] |[4, 3, 3] |4041 |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041 |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045 |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045 |
|device_A |2017-08-13|[4045, 4045, 4045] |[3, 3, 3] |4045 |
+---------+----------+------------------------+------------+------+

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))
Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+

Resources