multiline values in a column while spark read file - apache-spark

I have data as below and I need to separate that based on ","
I/p file : 1,2,4,371003\,5371022\,87200000\,U
The desired result should be :
a b c d e f
1 2 3 4 371003,5371022,87000000 U
val df = spark.read.option("inferSchma","true").option("escape","\\").option("delimiter",",").csv("/user/txt.csv")

try this:
val df = spark.read.csv("/user/txt.csv")
df.show()
+---+---+---+-------+--------+---------+---+
|_c0|_c1|_c2| _c3| _c4| _c5|_c6|
+---+---+---+-------+--------+---------+---+
| 1| 2| 4|371003\|5371022\|87200000\| U|
+---+---+---+-------+--------+---------+---+
df.select(
'_c0, '_c1, '_c2,
regexp_replace(concat_ws(",", '_c3, '_c4, '_c5), "\\\\", ""),
'_c6
).toDF("a","b","c","e","f").show(false)
+---+---+---+-----------------------+---+
|a |b |c |e |f |
+---+---+---+-----------------------+---+
|1 |2 |4 |371003,5371022,87200000|U |
+---+---+---+-----------------------+---+

Related

Employing Pyspark How to determine the frequency of each event and its event-by-event frequency

I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I would like to include a column that like the one below. The data will be in the form of a1,1 in the column, where the first element represents the event frequency (a1), or how often "a" appears in the field, and the second element (,1) is the frequency for each event, or how often "a" repeats before any other element (b) in the field. Can we carry this out with PySpark?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can achieve your desired result by doing this,
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame(['a', 'a', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'], 'string').toDF("Data")
print("Original Data:")
df.show()
print("Result:")
df.withColumn("ID", F.monotonically_increasing_id()) \
.withColumn("group",
F.row_number().over(Window.orderBy("ID"))
- F.row_number().over(Window.partitionBy("Data").orderBy("ID"))
) \
.withColumn("element_freq", F.when(F.col('Data') != 'abcd', F.row_number().over(Window.partitionBy("group").orderBy("ID"))).otherwise(F.lit(0)))\
.withColumn("event_freq", F.when(F.col('Data') != 'abcd', F.dense_rank().over(Window.partitionBy("Data").orderBy("group"))).otherwise(F.lit(0)))\
.withColumn("Frequency", F.concat_ws(',', F.concat(F.col("Data"), F.col("event_freq")), F.col("element_freq"))) \
.orderBy("ID")\
.drop("ID", "group", "event_freq", "element_freq")\
.show()
Original Data:
+----+
|Data|
+----+
| a|
| a|
| a|
| a|
| a|
| b|
| b|
| b|
| a|
| a|
| b|
+----+
Result:
+----+---------+
|Data|Frequency|
+----+---------+
| a| a1,1|
| a| a1,2|
| a| a1,3|
| a| a1,4|
| a| a1,5|
| b| b1,1|
| b| b1,2|
| b| b1,3|
| a| a2,1|
| a| a2,2|
| b| b2,1|
+----+---------+
Use Window functions. I give you to options just in case
Option 1, separating groups and Frequency
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('group', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('group', when(col('data')!=col('group'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('group', concat('Data',sum('group').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', rank().over(Window.partitionBy('group').orderBy('index')))
).sort('index').drop('index').show(truncate=False)
+----+-----+---------+
|Data|group|Frequency|
+----+-----+---------+
|a |a1 |1 |
|a |a1 |2 |
|a |a1 |3 |
|a |a1 |4 |
|a |a1 |5 |
|b |b2 |1 |
|b |b2 |2 |
|b |b2 |3 |
|a |a2 |1 |
|a |a2 |2 |
|b |b3 |1 |
+----+-----+---------+
Option 2 gives an output you wanted
#Variable to use in the groupby
k=Window.partitionBy().orderBy('index')
(
#Create an index of df to order by
df1.withColumn('index', monotonically_increasing_id())
#Create a column that puts a consecutive and previous Data in a row
.withColumn('Frequency', lag('Data').over(k))
# #Where consecutive and previous dont match, conditionally assign a 1 else o
.withColumn('Frequency', when(col('data')!=col('Frequency'),1).otherwise(0))
# Concat Data and sum of outcome from above per group and ordered by index
.withColumn('Frequency', concat('Data',sum('Frequency').over(Window.partitionBy('Data').orderBy('index'))+1))
#rank outcome above in the order in which they appeared in initial df
.withColumn('Frequency', array_join(array('Frequency',rank().over(Window.partitionBy('Frequency').orderBy('index'))),','))
).sort('index').drop('index').show(truncate=False)
+----+---------+
|Data|Frequency|
+----+---------+
|a |a1,1 |
|a |a1,2 |
|a |a1,3 |
|a |a1,4 |
|a |a1,5 |
|b |b2,1 |
|b |b2,2 |
|b |b2,3 |
|a |a2,1 |
|a |a2,2 |
|b |b3,1 |
+----+---------+

When dynamically generating join condition as list in PySpark, How to apply "OR" in between the elements instead of "AND"?

I am joining two dataframes site_bs and site_wrk_int1 and creating site_wrk using a dynamic join condition.
My code is like below:
join_cond=[ col(v_col) == col('wrk_'+v_col) for v_col in primaryKeyCols] #result would be
site_wrk=site_bs.join(site_wrk_int1,join_cond,'inner').select(*site_bs.columns)
join_cond will be dynamic and the value will be something like [ col(id) == col(wrk_id), col(id) == col(wrk_parentId)]
In the above join condition, join will happen satisfying both the conditions above. i.e., the join condition will be
id = wrk_id and id = wrk_parentId
But I want or condition to be applied like below
id = wrk_id or id = wrk_parentId
How to achieve this in Pyspark?
Since logical operations on pyspark columns return column objects, you can chain these conditions in the join statement such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1, "A", "A"),
(2, "C", "C"),
(3, "E", "D"),
], ['id', 'col1', 'col2']
)
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| A| A|
| 2| C| C|
| 3| E| D|
+---+----+----+
df.alias("t1").join(
df.alias("t2"),
(f.col("t1.col1") == f.col("t2.col2")) | (f.col("t1.col1") == f.lit("E")),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
As you can see, I get the True value for left rows with IDs 1 and 2 as col1 == col2 OR col1 == E which is True for three rows of my DataFrame. In terms of syntax, it's important that the Python operators (| & ...) are separated by closed brackets as in example above, otherwise you might get confusing py4j errors.
Alternatively, if you wish to keep to similar notation as you stated in your questions, why not use functools.reduce and operator.or_ to apply this logic to your list, such as:
In this example, I have an AND condition between my column conditions and get NULL only, as expected:
df.alias("t1").join(
df.alias("t2"),
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")],
"left_outer"
).show(truncate=False)
+---+----+----+----+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+----+----+----+
|3 |E |D |null|null|null|
|1 |A |A |null|null|null|
|2 |C |C |null|null|null|
+---+----+----+----+----+----+
In this example, I leverage functools and operator to get same result as above:
df.alias("t1").join(
df.alias("t2"),
functools.reduce(
operator.or_,
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")]),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
I am quite new in spark SQL.
Please notify me if this can be a solution.
site_wrk = site_bs.join(site_work_int1, [(site_bs.id == site_work_int1.wrk_id) | (site_bs.id == site_work_int1.wrk_parentId)], how = "inner")

How to get the info in table header (schema)?

env: spark2.4.5
source: id-name.json
{"1": "a", "2": "b", "3":, "c"..., "n": "z"}
I load the .json file into spark Dataset with Json format and it is stored like:
+---+---+---+---+---+
| 1 | 2 | 3 |...| n |
+---+---+---+---+---+
| a | b | c |...| z |
+---+---+---+---+---+
And I want it to be generated like such result:
+------------+------+
| id | name |
+------------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| . | . |
| . | . |
| . | . |
| n | z |
+------------+------+
My solution using spark-sql:
select stack(n, '1', `1`, '2', `2`... ,'n', `n`) as ('id', 'name') from table_name;
It doesn't meet my demand because I don't want to hard-code all the 'id' in sql.
Maybe using 'show columns from table_name' with 'stack()' can help?
I would be very grateful if you could give me some suggestion.
Create required values for stack dynamic & use it where ever it required. Please check below code to generate same values dynamic.
scala> val js = Seq("""{"1": "a", "2": "b","3":"c","4":"d","5":"e"}""").toDS
js: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val df = spark.read.json(js)
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
exprC: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala>
Updated Code to fetch data from json file
scala> "hdfs dfs -cat /tmp/sample.json".!
{"1": "a", "2": "b","3":"c","4":"d","5":"e"}
res4: Int = 0
scala> val df = spark.read.json("/tmp/sample.json")
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
stack: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> df.createTempView("table")
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+

How to get a subset of columns based on row values in Spark Dataframe?

I have the following dataframe in Spark (it has only one row):
df.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1|4.4| 2| 3| 7|2.6|
+---+---+---+---+---+---+
I want to get the columns that their values are greater than 2.8 (just as an example). The outcomes should be:
List(B, D , E)
Here is my own solution:
val cols = df.columns
val threshold = 2.8
val values = df.rdd.collect.toList
val res = values
.flatMap(x => x.toSeq)
.map(x => x.toString.toDouble)
.zip(cols)
.filter(x => x._1 > threshold)
.map(x => x._2)
A simple udf function should give you correct result as
val columns = df.columns
def getColumns = udf((cols: Seq[Double]) => cols.zip(columns).filter(_._1 > 2.8).map(_._2))
df.withColumn("columns > 2.8", getColumns(array(columns.map(col(_)): _*))).show(false)
So that even if you have multiple rows as below
+---+---+---+---+---+---+
|A |B |C |D |E |F |
+---+---+---+---+---+---+
|1 |4.4|2 |3 |7 |2.6|
|4 |2.7|2 |3 |1 |2.9|
+---+---+---+---+---+---+
You will get result for each rows as
+---+---+---+---+---+---+-------------+
|A |B |C |D |E |F |columns > 2.8|
+---+---+---+---+---+---+-------------+
|1 |4.4|2 |3 |7 |2.6|[B, D, E] |
|4 |2.7|2 |3 |1 |2.9|[A, D, F] |
+---+---+---+---+---+---+-------------+
I hope the answer is helpful
you could use explode and array functions:
df.select(
explode(
array(
df.columns.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
)
).as("kv")
)
.where($"kv.val" > 2.8)
.select($"kv.key")
.show()
+---+
|key|
+---+
| B|
| D|
| E|
+---+
you could then collect this result. But I don't see any issue with collecting the dataframe first as t has only 1 row:
df.columns.zip(df.first().toSeq.map(_.asInstanceOf[Double]))
.collect{case (c,v) if v>2.8 => c} // Array(B,D,E)
val c = df.columns.foldLeft(df){(a,b) => a.withColumn(b, when(col(b) > 2.8, b))}
c.collect
You can remove the nulls from the array

Sorting DataFrame within rows and getting the ranking

I have the following PySpark DataFrame :
+----+----------+----------+----------+
| id| a| b| c|
+----+----------+----------+----------+
|2346|2017-05-26| null|2016-12-18|
|5678|2013-05-07|2018-05-12| null|
+----+----------+----------+----------+
My ideal output is :
+----+---+---+---+
|id |a |b |c |
+----+---+---+---+
|2346|2 |0 |1 |
|5678|1 |2 |0 |
+----+---+---+---+
Ie the more recent the date within the row, the higher the score
I have looked at similar posts suggesting to use window function. The problem is that I need to order my values within the row, not the column.
You can put the values in each row into an array and use pyspark.sql.functions.sort_array() to sort it.
import pyspark.sql.functions as f
cols = ["a", "b", "c"]
df = df.select("*", f.sort_array(f.array([f.col(c) for c in cols])).alias("sorted"))
df.show(truncate=False)
#+----+----------+----------+----------+------------------------------+
#|id |a |b |c |sorted |
#+----+----------+----------+----------+------------------------------+
#|2346|2017-05-26|null |2016-12-18|[null, 2016-12-18, 2017-05-26]|
#|5678|2013-05-07|2018-05-12|null |[null, 2013-05-07, 2018-05-12]|
#+----+----------+----------+----------+------------------------------+
Now you can use a combination of pyspark.sql.functions.coalesce() and pyspark.sql.functions.when() to loop over each of the columns in cols and find the corresponding index in the sorted array.
df = df.select(
"id",
*[
f.coalesce(
*[
f.when(
f.col("sorted").getItem(i) == f.col(c),
f.lit(i)
)
for i in range(len(cols))
]
).alias(c)
for c in cols
]
)
df.show(truncate=False)
#+----+---+----+----+
#|id |a |b |c |
#+----+---+----+----+
#|2346|2 |null|1 |
#|5678|1 |2 |null|
#+----+---+----+----+
Finally fill the null values with 0:
df = df.na.fill(0)
df.show(truncate=False)
#+----+---+---+---+
#|id |a |b |c |
#+----+---+---+---+
#|2346|2 |0 |1 |
#|5678|1 |2 |0 |
#+----+---+---+---+

Resources