Spark: Conditionally Joining/Concatting Columns Based Leading Characters - apache-spark

I've got a data set that has unclean data that has been split incorrectly. This results in an uneven number of columns - the number of columns per row depends on the number of errors arising from one field. You know if the column is incorrect if it has 3 leading double quotes. If the column has 3 leading double quotes then you want to join it with the previous column and shift left.
I import the csv of the data into a dataframe, which creates something similar to the example below.
Example:
INPUT:
`+--+--------+----------+----------+---------+
|id | detail | context | _c3 | _c4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+`
DESIRED OUTPUT:
`+--+------------------------+---------+
|id | detail | context |
+---+------------------------+---------+
| 1 | {blah} | service |
| 2 | { blah""" blah"""blah} | service |
| 3 | { blah"""blah} | service |
+---+------------------------+---------+`
I've tried something like - as well as a bunch of other approaches:
`df.filter(col("context").startsWith("\"\"\"")).select($"detail", lit(" "), $"context").collect()`
This doesn't works, and doesn't fully do what I need it to do. Any ideas? Help is much appreciated :)
Thanks!

I think the easiest way to fix this, would be to put the columns back together, and then parse them correctly. One way to do this is use concat to combine all the columns, then use regexp_extract to pull out the pieces you want as individual columns. For example:
case class MyRow(id: Int, detail: String, context: String, _c3: String, _c4: String)
val data = Seq(
MyRow(1, "{blah}", "service", "", ""),
MyRow(2, "{ blah", " \"\"\" blah", " \"\"\"blah}", "service"),
MyRow(3, "{ blah", "\"\"\"blah}", "service", "")
)
val df = sc.parallelize(data).toDF
val columns = df.columns.filterNot(_ == "id")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")
val fixed = combined.withColumn("left", regexp_extract($"data", "(\\{.*\\})", 1)).
withColumn("right", regexp_extract($"data", "([^}]+$)", 1))
fixed.show(10, false)
Which should output:
+---+-------------------------------+------------------------+-------+
|id |data |left |right |
+---+-------------------------------+------------------------+-------+
|1 |{blah}service |{blah} |service|
|2 |{ blah """ blah """blah}service|{ blah """ blah """blah}|service|
|3 |{ blah"""blah}service |{ blah"""blah} |service|
+---+-------------------------------+------------------------+-------+
In the code above I'm assuming that the columns are already in the right order.
This is just splitting on the last }. If you need more complicated parsing, you can write a UDF that parses it however you want and returns a tuple of fields.

Related

Spark text parsing with dynamic delimiter

I have text file which looks like:
:1: some first row of first attribute
second row of first attribute
:55: first row of fifty fifth
:100: some other text
also other
another one
I would like to parse it such manner:
+----------+-----------------------------------+
| AttrNr | Row |
+----------+-----------------------------------+
| 1 | some first row of first attribute |
+----------+-----------------------------------+
| 1 | second row of first attribute |
+----------+-----------------------------------+
| 1 | 3rd value with test: 1,2,3 |
+----------+-----------------------------------+
| 55 | first row of fifty fifth |
+----------+-----------------------------------+
| 100 | some other text |
+----------+-----------------------------------+
| 100 | also other |
+----------+-----------------------------------+
| 100 | another one |
+----------+-----------------------------------+
Parsing should be done according :n: delimeter. ":" symbol might appear in values.
The final output can be achieved by using a set of Window functions that are available on Spark but your data lacks a lot of essential details like a partitioning column, a column with which we can order the data to know which row comes after the first one.
Assuming you are working on a distributed system here the following answer might not work at all. It works for the provided example but things will be different when you are in a distributed environment with a huge file.
Creating a DataFrame from the text file:
Reading the text file as an RDD:
val rdd = sc.parallelize(Seq(
(":1: some first row of first attribute"),
("second row of first attribute"),
(":55: first row of fifty fifth"),
(":100: some other text"),
("also other"),
("another one")
))
// Or use spark.sparkContext.textFile if you are reading from a file
Iterate over the RDD to split the columns in the required format and generate a DataFrame
val df = rdd.map{ c=>
if(c.startsWith(":")) (c.split(" ", 2)(0), c.split(" ", 2)(1))
else (null.asInstanceOf[String], c )
}.toDF("AttrNr", "Row")
//df: org.apache.spark.sql.DataFrame = [AttrNr: string, Row: string]
df.show(false)
// +------+---------------------------------+
// |AttrNr|Row |
// +------+---------------------------------+
// |:1: |some first row of first attribute|
// |null |second row of first attribute |
// |:55: |first row of fifty fifth |
// |:100: |some other text |
// |null |also other |
// |null |another one |
// +------+---------------------------------+
The following set of commands are just a hack and are not performance effective at all and shouldn't be used in a production-like environment. last provides your the last not null column. Partition and ordering is done manually here because your data does not provide those set of columns.
df.withColumn("p", lit(1))
.withColumn("AttrNr",
last($"AttrNr", true).over(Window.partitionBy($"p").orderBy(lit(1)).rowsBetween(Window.unboundedPreceding, 0) ) )
// +------+---------------------------------+
// |AttrNr|Row |
// +------+---------------------------------+
// |:1: |some first row of first attribute|
// |:1: |second row of first attribute |
// |:55: |first row of fifty fifth |
// |:100: |some other text |
// |:100: |also other |
// |:100: |another one |
// +------+---------------------------------+
Actually I solved it with SQL. But I was wondering maybe it's more simple way. I'm using spark 2.3 without high order functions.
import org.apache.spark.sql.expressions.Window
val df = Seq((":1: some first row of first attribute"),
("second row of first attribute"),
("3rd value with test: 1,2,3"),
(":55: first row of fifty fifth"),
(":100: some other text"),
("also other"),
("another one")).toDF("_c0")
df.createOrReplaceTempView("test1")
spark.sql("""select _c0, split(_c0, ":") arr from test1""").createOrReplaceTempView("test2")
val testDF = spark.sql("""
select arr[1] t0,
cast(arr[1] as int) t1,
case when arr[1] = cast(arr[1] as int)
then replace(concat_ws(":",arr),concat(concat(":",arr[1]),":"),"")
else concat_ws(":",arr)
end Row
,monotonically_increasing_id() mrn
from test2""")
val fnc = Window.orderBy("mrn")
val testDF2 = testDF.withColumn("AttrNr", last('t1,true).over(fnc))
testDF2.drop("t0","t1","mrn").show(false)
+----------------------------------+------+
|Row |AttrNr|
+----------------------------------+------+
| some first row of first attribute|1 |
|second row of first attribute |1 |
|3rd value with test: 1,2,3 |1 |
| first row of fifty fifth |55 |
| some other text |100 |
|also other |100 |
|another one |100 |
+----------------------------------+------+
Column "AttrNr" can be received with "regexp_extract" function:
df
.withColumn("AttrNr", regexp_extract($"_c0", "^:([\\d].*):", 0))
.withColumn("Row", when(length($"AttrNr") === lit(0), $"_c0").otherwise(expr("substring(_c0, length(AttrNr) + 2)")))
.withColumn("AttrNr", when(length($"AttrNr") === lit(0), null.asInstanceOf[String]).otherwise(expr("substring(_c0, 2, length(AttrNr) - 2)")))
// Window with no partitioning, bad for performance
.withColumn("AttrNr", last($"AttrNr", true).over(Window.orderBy(lit(1)).rowsBetween(Window.unboundedPreceding, 0)))
.drop("_c0")

How to identify if a particular string/pattern exist in a column using pySpark

Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?
Input data:
+---------------------------+---------+
| text | subtext |
+---------------------------+---------+
| Where is my string? | is |
| Hm, this one is different | on |
+---------------------------+---------+
Expected output:
+---------------------------+---------+----------+
| text | subtext | position |
+---------------------------+---------+----------+
| Where is my string? | is | 6 |
| Hm, this one is different | on | 9 |
+---------------------------+---------+----------+
Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.
You can use locate. You need to subtract 1 because string index starts from 1, not 0.
import pyspark.sql.functions as F
df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
df2.show(truncate=False)
+-------------------------+-------+--------+
|text |subtext|position|
+-------------------------+-------+--------+
|Where is my string? |is |6 |
|Hm, this one is different|on |9 |
+-------------------------+-------+--------+
Another way using position SQL function :
from pyspark.sql.functions import expr
df1 = df.withColumn('position', expr("position(subtext in text) -1"))
df1.show(truncate=False)
#+-------------------------+-------+--------+
#|text |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string? |is |6 |
#|Hm, this one is different|on |9 |
#+-------------------------+-------+--------+
pyspark.sql.functions.instr(str, substr)
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))
You can use locate itself. The problem is first parameter of locate (substr) should be string.
So you can use expr function to convert column to string
Please find the correct code as below:
df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))

Get all rows after doing GroupBy in SparkSQL

I tried to do group by in SparkSQL which works good but most of the rows went missing.
spark.sql(
"""
| SELECT
| website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)
I am getting output like this :
+------------------+---------+
|website_session_id|min_pv_id|
+------------------+---------+
|1 |1 |
|10 |15 |
|100 |168 |
|1000 |1910 |
|10000 |20022 |
|100000 |227964 |
|100001 |227966 |
|100002 |227967 |
|100003 |227970 |
|100004 |227973 |
+------------------+---------+
Same query in MySQL gives the desired result like this :
What is the best way to do ,so that all rows are fetched in my Query.
Please note I already checked other answers related to this, like joining to get all rows etc, but I want to know if there is any other way by with we can get the result like we get in MySQL ?
It looks like it is ordered by alphabetically, in which case 10 comes before 2.
You might want to check that the columns type is a number, not string.
What datatypes do the columns have (printSchema())?
I think website_session_id is of string type. Cast it to an integer type and see what you get:
spark.sql(
"""
| SELECT
| CAST(website_session_id AS int) as website_session_id,
| MIN(website_pageview_id) as min_pv_id
|
| FROM website_pageviews
| GROUP BY website_session_id
| ORDER BY website_session_id
|
|
|""".stripMargin).show(10,truncate = false)

Conditional Explode in Spark Structured Streaming / Spark SQL

I'm trying to do a conditional explode in Spark Structured Streaming.
For instance, my streaming dataframe looks like follows (totally making the data up here). I want to explode the employees array into separate rows of arrays when contingent = 1. When contingent = 0, I need to let the array be as is.
|----------------|---------------------|------------------|
| Dept ID | Employees | Contingent |
|----------------|---------------------|------------------|
| 1 | ["John", "Jane"] | 1 |
|----------------|---------------------|------------------|
| 4 | ["Amy", "James"] | 0 |
|----------------|---------------------|------------------|
| 2 | ["David"] | 1 |
|----------------|---------------------|------------------|
So, my output should look like (I do not need to display the contingent column:
|----------------|---------------------|
| Dept ID | Employees |
|----------------|---------------------|
| 1 | ["John"] |
|----------------|---------------------|
| 1 | ["Jane"] |
|----------------|---------------------|
| 4 | ["Amy", "James"] |
|----------------|---------------------|
| 2 | ["David"] |
|----------------|---------------------|
There are a couple challenges I'm currently facing:
Exploding Arrays conditionally
exploding arrays into arrays (rather than strings in this case)
In Hive, there was a concept of UDTF (user-defined table functions) that would allow me to do this. Wondering if there is anything comparable to it?
Use flatMap to explode and specify whatever condition you want.
case class Department (Dept_ID: String, Employees: Array[String], Contingent: Int)
case class DepartmentExp (Dept_ID: String, Employees: Array[String])
val ds = df.as[Department]
ds.flatMap(dept => {
if (dept.Contingent == 1) {
dept.Employees.map(emp => DepartmentExp(dept.Dept_ID, Array(emp)))
} else {
Array(DepartmentExp(dept.Dept_ID, dept.Employees))
}
}).as[DepartmentExp]

Resources