Spark text parsing with dynamic delimiter - apache-spark

I have text file which looks like:
:1: some first row of first attribute
second row of first attribute
:55: first row of fifty fifth
:100: some other text
also other
another one
I would like to parse it such manner:
+----------+-----------------------------------+
| AttrNr | Row |
+----------+-----------------------------------+
| 1 | some first row of first attribute |
+----------+-----------------------------------+
| 1 | second row of first attribute |
+----------+-----------------------------------+
| 1 | 3rd value with test: 1,2,3 |
+----------+-----------------------------------+
| 55 | first row of fifty fifth |
+----------+-----------------------------------+
| 100 | some other text |
+----------+-----------------------------------+
| 100 | also other |
+----------+-----------------------------------+
| 100 | another one |
+----------+-----------------------------------+
Parsing should be done according :n: delimeter. ":" symbol might appear in values.

The final output can be achieved by using a set of Window functions that are available on Spark but your data lacks a lot of essential details like a partitioning column, a column with which we can order the data to know which row comes after the first one.
Assuming you are working on a distributed system here the following answer might not work at all. It works for the provided example but things will be different when you are in a distributed environment with a huge file.
Creating a DataFrame from the text file:
Reading the text file as an RDD:
val rdd = sc.parallelize(Seq(
(":1: some first row of first attribute"),
("second row of first attribute"),
(":55: first row of fifty fifth"),
(":100: some other text"),
("also other"),
("another one")
))
// Or use spark.sparkContext.textFile if you are reading from a file
Iterate over the RDD to split the columns in the required format and generate a DataFrame
val df = rdd.map{ c=>
if(c.startsWith(":")) (c.split(" ", 2)(0), c.split(" ", 2)(1))
else (null.asInstanceOf[String], c )
}.toDF("AttrNr", "Row")
//df: org.apache.spark.sql.DataFrame = [AttrNr: string, Row: string]
df.show(false)
// +------+---------------------------------+
// |AttrNr|Row |
// +------+---------------------------------+
// |:1: |some first row of first attribute|
// |null |second row of first attribute |
// |:55: |first row of fifty fifth |
// |:100: |some other text |
// |null |also other |
// |null |another one |
// +------+---------------------------------+
The following set of commands are just a hack and are not performance effective at all and shouldn't be used in a production-like environment. last provides your the last not null column. Partition and ordering is done manually here because your data does not provide those set of columns.
df.withColumn("p", lit(1))
.withColumn("AttrNr",
last($"AttrNr", true).over(Window.partitionBy($"p").orderBy(lit(1)).rowsBetween(Window.unboundedPreceding, 0) ) )
// +------+---------------------------------+
// |AttrNr|Row |
// +------+---------------------------------+
// |:1: |some first row of first attribute|
// |:1: |second row of first attribute |
// |:55: |first row of fifty fifth |
// |:100: |some other text |
// |:100: |also other |
// |:100: |another one |
// +------+---------------------------------+

Actually I solved it with SQL. But I was wondering maybe it's more simple way. I'm using spark 2.3 without high order functions.
import org.apache.spark.sql.expressions.Window
val df = Seq((":1: some first row of first attribute"),
("second row of first attribute"),
("3rd value with test: 1,2,3"),
(":55: first row of fifty fifth"),
(":100: some other text"),
("also other"),
("another one")).toDF("_c0")
df.createOrReplaceTempView("test1")
spark.sql("""select _c0, split(_c0, ":") arr from test1""").createOrReplaceTempView("test2")
val testDF = spark.sql("""
select arr[1] t0,
cast(arr[1] as int) t1,
case when arr[1] = cast(arr[1] as int)
then replace(concat_ws(":",arr),concat(concat(":",arr[1]),":"),"")
else concat_ws(":",arr)
end Row
,monotonically_increasing_id() mrn
from test2""")
val fnc = Window.orderBy("mrn")
val testDF2 = testDF.withColumn("AttrNr", last('t1,true).over(fnc))
testDF2.drop("t0","t1","mrn").show(false)
+----------------------------------+------+
|Row |AttrNr|
+----------------------------------+------+
| some first row of first attribute|1 |
|second row of first attribute |1 |
|3rd value with test: 1,2,3 |1 |
| first row of fifty fifth |55 |
| some other text |100 |
|also other |100 |
|another one |100 |
+----------------------------------+------+

Column "AttrNr" can be received with "regexp_extract" function:
df
.withColumn("AttrNr", regexp_extract($"_c0", "^:([\\d].*):", 0))
.withColumn("Row", when(length($"AttrNr") === lit(0), $"_c0").otherwise(expr("substring(_c0, length(AttrNr) + 2)")))
.withColumn("AttrNr", when(length($"AttrNr") === lit(0), null.asInstanceOf[String]).otherwise(expr("substring(_c0, 2, length(AttrNr) - 2)")))
// Window with no partitioning, bad for performance
.withColumn("AttrNr", last($"AttrNr", true).over(Window.orderBy(lit(1)).rowsBetween(Window.unboundedPreceding, 0)))
.drop("_c0")

Related

How to identify if a particular string/pattern exist in a column using pySpark

Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \
You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+

How to find position of substring column in a another column using PySpark?

If I have a PySpark DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the position of subtext in text column?
Input data:
+---------------------------+---------+
| text | subtext |
+---------------------------+---------+
| Where is my string? | is |
| Hm, this one is different | on |
+---------------------------+---------+
Expected output:
+---------------------------+---------+----------+
| text | subtext | position |
+---------------------------+---------+----------+
| Where is my string? | is | 6 |
| Hm, this one is different | on | 9 |
+---------------------------+---------+----------+
Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.
You can use locate. You need to subtract 1 because string index starts from 1, not 0.
import pyspark.sql.functions as F
df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
df2.show(truncate=False)
+-------------------------+-------+--------+
|text |subtext|position|
+-------------------------+-------+--------+
|Where is my string? |is |6 |
|Hm, this one is different|on |9 |
+-------------------------+-------+--------+
Another way using position SQL function :
from pyspark.sql.functions import expr
df1 = df.withColumn('position', expr("position(subtext in text) -1"))
df1.show(truncate=False)
#+-------------------------+-------+--------+
#|text |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string? |is |6 |
#|Hm, this one is different|on |9 |
#+-------------------------+-------+--------+
pyspark.sql.functions.instr(str, substr)
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))
You can use locate itself. The problem is first parameter of locate (substr) should be string.
So you can use expr function to convert column to string
Please find the correct code as below:
df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))

Groupby one dataframe based on tags in second dataframe

i have a dataframe (actually 105k dframes) representation of a SWOT table (should be a 4x4 grid containing strings). The problem is that not all frames in the total set (105k) have the same shape and the position of SWOT elements also varies. My approach is to create a copy of each frame and then look for specific strings in the original frame and add tags to the copy + some forward filling.
Now I have the original frames and copies (same size same indices and same column names), where the original frame has the strings I want to group and a copy that is basically a mask (not boolan - but a mask of tags).
When looping over the whole set I am merging the original frame with the copy based on the index and get a megerd frame that has columns form both frames with _x and _y suffixes. How can I concatenate the strings in columns _y that have the same tag in column _x to a new dataframe?
+---+-------+-------------+-------+-------------+-----+-----+
| | 0_x | 0_y | 1_x | 1_y | n_x | n_y |
+---+-------+-------------+-------+-------------+-----+-----+
| 0 | tag_1 | some_string | tag_2 | some_string | | |
| 1 | tag_1 | some_string | tag_2 | some_string | | |
| 2 | tag_2 | some_string | tag_3 | some_string | | |
| n | tag_2 | some_string | tag_3 | some_string | | |
+---+-------+-------------+-------+-------------+-----+-----+
Here is my code so far - basically it does what i want for pairs of columns (n_x, n_y), but it does not work if i have the same tag in more columns (say in 0_x and 1_x). I could do a second iteration over columns a and b and do the same as in the first case, but is there a more efficient way?
df_tmp (cotains the tags) and df_orig (contains the strings)
df_merged = pd.merge(df_tmp, df_orig, left_index=True, right_index=True, validate="1:1")
df_merged = df_merged.applymap(str)
columns_list = df_merged.columns.to_list()
x_columns = [elem for elem in columns_list if elem.endswith("x")]
y_columns = [elem for elem in columns_list if elem.endswith("y")]
df_collect = pd.DataFrame()
for col_x, col_y in zip(x_columns, y_columns):
df_final = df_merged.groupby([col_x])[col_y].apply(','.join).reset_index()
df_final = df_final.rename(columns={col_x: 'a', col_y: 'b' })
df_collect = df_collect.append(df_final, ignore_index=True)

How to combine two columns into one in Sqlite and also get the underlying value of the Foreign Key?

I want to be able to combine two columns from a table into one column then to to be able to get the actual value of the foreign keys. I can do these things individually but not together.
Following the answer below I was able to combine the two columns into one using the first sql statement below.
How to combine 2 columns into a new one in sqlite
The combining process is shown below:
+---+---+
|HT | AT|
+---+---+
|1 | 2 |
|5 | 7 |
|9 | 5 |
+---+---+
into one column as shown:
+---+
|HT |
+---+
| 1 |
| 5 |
| 9 |
| 2 |
| 7 |
| 5 |
+---+
The second SQL statement show's the actual value of each foreign key corresponding to each foreign key id. The Foreign Key Table.
+-----+------------------------+
|T_id | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 2 | 'Chicago Bears' |
| 5 | 'New England Patriots' |
| 7 | 'New York Giants' |
| 9 | 'New York Jets' |
+-----+------------------------+
sql = "SELECT * FROM (SELECT M.HT FROM M UNION SELECT M.AT FROM Match)t"
The second sql statement lets me get the foreign key values for each value in M.HT.
sql = "SELECT M.HT, T.TN FROM M INNER JOIN T ON M.HT = T.Tid WHERE strftime('%Y-%m-%d', M.ST) BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND M.Comp = 6 ORDER BY M.ST"
Result of second SQL statement:
+-----+------------------------+
| HT | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 5 | 'New England Patriots' |
| 9 | 'New York Jets' |
+-----+------------------------+
But try as I might I have not been able to combine these queries!
I believe the following will work (assuming that the tables are Match and T and baring the WHERE and ORDER BY clauses for brevity/ease) :-
SELECT DISTINCT(m.ht), t.tn
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses using Match as m only has columns ht and at */
WHERE strftime('%Y-%m-%d', Match.ST)
BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND Match.Comp = 6
ORDER BY Match.ST
;
Note only tested without the WHERE and ORDER BY clause.
That is using :-
DROP TABLE IF EXISTS Match;
DROP TABLE IF EXISTS T;
CREATE TABLE IF NOT EXISTS Match (ht INTEGER, at INTEGER, st TEXT DEFAULT (datetime('now')));
CREATE TABLE IF NOT EXISTS t (tid INTEGER PRIMARY KEY, tn TEXT);
INSERT INTO T (tn) VALUES('Cows'),('Bears'),('a'),('b'),('Pats'),('c'),('Giants'),('d'),('Jets');
INSERT INTO Match (ht,at) VALUES (1,2),(5,7),(9,5);
/* Directly without the Common Table Expression */
SELECT
DISTINCT(m.ht), t.tn,
Match.st /*<<<<< Added to show results of obtaining other values from Matches >>>>> */
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses here using Match */
;
Noting that limited data (just the one extra column) was used for brevity
Results in :-

Spark: Conditionally Joining/Concatting Columns Based Leading Characters

I've got a data set that has unclean data that has been split incorrectly. This results in an uneven number of columns - the number of columns per row depends on the number of errors arising from one field. You know if the column is incorrect if it has 3 leading double quotes. If the column has 3 leading double quotes then you want to join it with the previous column and shift left.
I import the csv of the data into a dataframe, which creates something similar to the example below.
Example:
INPUT:
`+--+--------+----------+----------+---------+
|id | detail | context | _c3 | _c4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+`
DESIRED OUTPUT:
`+--+------------------------+---------+
|id | detail | context |
+---+------------------------+---------+
| 1 | {blah} | service |
| 2 | { blah""" blah"""blah} | service |
| 3 | { blah"""blah} | service |
+---+------------------------+---------+`
I've tried something like - as well as a bunch of other approaches:
`df.filter(col("context").startsWith("\"\"\"")).select($"detail", lit(" "), $"context").collect()`
This doesn't works, and doesn't fully do what I need it to do. Any ideas? Help is much appreciated :)
Thanks!
I think the easiest way to fix this, would be to put the columns back together, and then parse them correctly. One way to do this is use concat to combine all the columns, then use regexp_extract to pull out the pieces you want as individual columns. For example:
case class MyRow(id: Int, detail: String, context: String, _c3: String, _c4: String)
val data = Seq(
MyRow(1, "{blah}", "service", "", ""),
MyRow(2, "{ blah", " \"\"\" blah", " \"\"\"blah}", "service"),
MyRow(3, "{ blah", "\"\"\"blah}", "service", "")
)
val df = sc.parallelize(data).toDF
val columns = df.columns.filterNot(_ == "id")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")
val fixed = combined.withColumn("left", regexp_extract($"data", "(\\{.*\\})", 1)).
withColumn("right", regexp_extract($"data", "([^}]+$)", 1))
fixed.show(10, false)
Which should output:
+---+-------------------------------+------------------------+-------+
|id |data |left |right |
+---+-------------------------------+------------------------+-------+
|1 |{blah}service |{blah} |service|
|2 |{ blah """ blah """blah}service|{ blah """ blah """blah}|service|
|3 |{ blah"""blah}service |{ blah"""blah} |service|
+---+-------------------------------+------------------------+-------+
In the code above I'm assuming that the columns are already in the right order.
This is just splitting on the last }. If you need more complicated parsing, you can write a UDF that parses it however you want and returns a tuple of fields.

Resources