Perform NGram on Spark DataFrame - apache-spark

I'm using Spark 2.3.1, I have Spark DataFrame like this
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| placing|
| least|
| one|
| two|
+----------+
I want to perform a Spark ml n-Gram feature like this.
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(tokenized_df)
Following Error occurred on this line bigramDataFrame = bigram.transform(tokenized_df)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'
So I changed my code
df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(df_new)
bigramDataFrame.show()
So I got my final Data Frame as Follow
+----------+------------+-------+
| values| testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]| []|
| present| [present]| []|
| invention| [invention]| []|
| include| [include]| []|
| pairing| [pairing]| []|
| two| [two]| []|
| wireless| [wireless]| []|
| device| [device]| []|
| placing| [placing]| []|
| least| [least]| []|
| one| [one]| []|
| two| [two]| []|
+----------+------------+-------+
Why my bigram column value is empty.
I want my output for bigram column as follow
+----------+
| bigrams|
+--------------------+
|embodiment present |
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless device |
|device placing |
|placing least |
|least one |
|one two |
+--------------------+

Your bi-gram column value is empty because there are no bi-grams in each row of your 'values' column.
If your values in input data frame look like:
+--------------------------------------------+
|values |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing |
|least one two |
+--------------------------------------------+
Then you can get the output in bi-grams as below:
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values |testing |ngrams |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing |[two, wireless, device, placing] |[two wireless, wireless device, device placing] |
|least one two |[least, one, two] |[least one, one two] |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
The scala spark code to do this is:
val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
A bi-gram is a sequence of two adjacent elements from a string of
tokens, which are typically letters, syllables, or words.
But in your input data frame, you have only one token in each row, hence you are not getting any bi-grams out of it.
So, for your question, you can do something like this:
Input: df1
+----------+
|values |
+----------+
|embodiment|
|present |
|invention |
|include |
|pairing |
|two |
|wireless |
|devic |
|placing |
|least |
|one |
|two |
+----------+
Output: ngramDataFrameInRows
+------------------+
|ngrams |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless devic |
|devic placing |
|placing least |
|least one |
|one two |
+------------------+
Spark scala code:
val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))

Related

Geospark IllegalArgumentException: Number of partitions must be >= 0

I'm trying to run a simple intersect on a couple of tables with geometries and get this error.
IllegalArgumentException: Number of partitions must be >= 0
My script.
tableA.
join(tableB, expr("ST_Intersects(geom, point)")).
show
This is table A. It has a few million rows.
spark.table("ta").
withColumn("point", expr("ST_Point(CAST(lon AS Decimal(24,20)), CAST(lat AS Decimal(24,20)))"))
And the result.
+-----------+-----------+--------------------+
| lat| lon| point|
+-----------+-----------+--------------------+
| 44.978577| 30.172431|POINT (30.172431 ...|
| 44.707343| 30.794019|POINT (30.794019 ...|
| 44.817301| 30.704576|POINT (30.704576 ...|
| 44.710767| 30.657547|POINT (30.657547 ...|
| 44.88699| 30.521111|POINT (30.521111 ...|
| 44.779| 30.6296|POINT (30.6296 55...|
| 44.653987| 30.572032|POINT (30.572032 ...|
| 44.763931| 30.601646|POINT (30.601646 ...|
|44.44440079|30.50870132|POINT (30.5087013...|
| 44.707493| 30.575095|POINT (30.575095 ...|
| 44.566665| 30.56598|POINT (30.56598 5...|
| 44.58322| 30.209977|POINT (30.209977 ...|
| 44.687525| 30.665842|POINT (30.665842 ...|
|44.90000153|30.62870026|POINT (30.6287002...|
| 44.85094| 30.560021|POINT (30.560021 ...|
| 44.83429| 30.49514|POINT (30.49514 5...|
| 44.740523| 30.890627|POINT (30.890627 ...|
| 44.544804| 30.328373|POINT (30.328373 ...|
| 44.46986| 30.5456|POINT (30.5456 55...|
| 44.8912| 30.6089|POINT (30.6089 55...|
+-----------+-----------+--------------------+
This is table B. It has only 1 row.
spark.table("tb").
withColumn("geom", expr("ST_GeomFromWKT(wkt)"))
And what show gives me.
+--------------------+--------------------+
| wkt| geom|
+--------------------+--------------------+
|MULTIPOLYGON (((3...|MULTIPOLYGON (((3...|
+--------------------+--------------------+
What's with this error? How do I fix it?
I had the order wrong. According to the docs it's.
boolean ST_Intersects( geometry geomA , geometry geomB )
Changing to expr("ST_Intersects(point, geom)") solved it.

Mapping column from arrays in Pyspark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+

pyspark pivot without aggregation

I am looking to essentially pivot without requiring an aggregation at the end to keep the dataframe in tact and not create a grouped object
As an example have this:
+---------++---------++---------++---------+
| country| code |Value | ids
+---------++---------++---------++---------+
| Mexico |food_1_3 |apple | 1
| Mexico |food_1_3 |banana | 2
| Canada |beverage_2 |milk | 1
| Mexico |beverage_2 |water | 2
+---------++---------++---------++---------+
Need this:
+---------++---------++---------++----------+
| country| id |food_1_3 | beverage_2|
+---------++---------++---------++----------+
| Mexico | 1 |apple | |
| Mexico | 2 |banana |water |
| Canada | 1 | |milk |
|+---------++---------++---------++---------+
I have tried
(df.groupby(df.country, df.id).pivot("code").agg(first('Value').alias('Value')))
but just get essentially a top 1. In my real case I have 20 columns some with just integers and others with strings... so sums, counts, collect_list none of those aggs have worked out...
That's because your 'id' is not unique. Add a unique index column and that should work:
import pyspark.sql.functions as F
pivoted = df.groupby(df.country, df.id, F.monotonically_increasing_id().alias('index')).pivot("code").agg(F.first('Value').alias('Value')).drop('index')
pivoted.show()
+-------+---+----------+--------+
|country|ids|beverage_2|food_1_3|
+-------+---+----------+--------+
| Mexico| 1| null| apple|
| Mexico| 2| water| null|
| Canada| 1| milk| null|
| Mexico| 2| null| banana|
+-------+---+----------+--------+

Optimize spark dataframe operation

I have a spark(version-2.4) dataframe of the pattern.
+----------+
| ColumnA |
+----------+
| 1000#Cat |
| 1001#Dog |
| 1000#Cat |
| 1001#Dog |
| 1001#Dog |
+----------+
I am conditionally applying a regex removal of the number that is appended to the string using the following code
dataset.withColumn("ColumnA",when(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)
.equalTo(""), dataset.col("ColumnA"))
.otherwise(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)));
which would result a dataframe in the following format
+---------+
| ColumnA |
+---------+
| Cat |
| Dog |
| Cat |
| Dog |
| Dog |
+---------+
This runs correctly and produces the desired output.
However the regexp_extract operation is being applied twice, once to check if the returned string is empty and if not then reapply the regexp_extract on the column.
Is there any optimization that can be done on this code to make it perform better.?
Use split function instead of regexp_extract.
Please check below code with execution time
scala> df.show(false)
+--------+
|columna |
+--------+
|1000#Cat|
|1001#Dog|
|1000#Cat|
|1001#Dog|
|1001#Dog|
+--------+
scala> spark.time(df.withColumn("parsed",split($"columna","#")(1)).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms
scala> spark.time { df.withColumn("ColumnA",when(regexp_extract($"columna", "\\#(.*)", 1).equalTo(""), $"columna").otherwise(regexp_extract($"columna", "\\#(.*)", 1))).show(false) }
+-------+
|ColumnA|
+-------+
|Cat |
|Dog |
|Cat |
|Dog |
|Dog |
+-------+
Time taken: 22 ms
scala>
contains function to check # value in column
scala> spark.time(df.withColumn("parsed",when($"columna".contains("#"), lit(split($"columna","#")(1))).otherwise("")).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms

SparkSQL Get all prefixes of a word

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)
Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

Resources