SparkSQL Get all prefixes of a word - apache-spark

Say I have a column in a SparkSQL DataFrame like this:
+-------+
| word |
+-------+
| chair |
| lamp |
| table |
+-------+
I want to explode out all the prefixes like so:
+--------+
| prefix |
+--------+
| c |
| ch |
| cha |
| chai |
| chair |
| l |
| la |
| lam |
| lamp |
| t |
| ta |
| tab |
| tabl |
| table |
+--------+
Is there a good way to do this WITHOUT using udfs, or functional programming methods such as flatMap in spark sql? (I'm talking about a solution using the codegen optimal functions in org.apache.spark.sql.functions._)

Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
val df = Seq("chair", "lamp", "table").toDF("word")
df.withColumn("len", explode(sequence(lit(1), length($"word"))))
.select($"word".substr(lit(1), $"len") as "prefix")
.show()
Output:
+------+
|prefix|
+------+
| c|
| ch|
| cha|
| chai|
| chair|
| l|
| la|
| lam|
| lamp|
| t|
| ta|
| tab|
| tabl|
| table|
+------+

Related

Mapping column from arrays in Pyspark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+

Optimize spark dataframe operation

I have a spark(version-2.4) dataframe of the pattern.
+----------+
| ColumnA |
+----------+
| 1000#Cat |
| 1001#Dog |
| 1000#Cat |
| 1001#Dog |
| 1001#Dog |
+----------+
I am conditionally applying a regex removal of the number that is appended to the string using the following code
dataset.withColumn("ColumnA",when(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)
.equalTo(""), dataset.col("ColumnA"))
.otherwise(regexp_extract(dataset.col("ColumnA"), "\\#(.*)", 1)));
which would result a dataframe in the following format
+---------+
| ColumnA |
+---------+
| Cat |
| Dog |
| Cat |
| Dog |
| Dog |
+---------+
This runs correctly and produces the desired output.
However the regexp_extract operation is being applied twice, once to check if the returned string is empty and if not then reapply the regexp_extract on the column.
Is there any optimization that can be done on this code to make it perform better.?
Use split function instead of regexp_extract.
Please check below code with execution time
scala> df.show(false)
+--------+
|columna |
+--------+
|1000#Cat|
|1001#Dog|
|1000#Cat|
|1001#Dog|
|1001#Dog|
+--------+
scala> spark.time(df.withColumn("parsed",split($"columna","#")(1)).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms
scala> spark.time { df.withColumn("ColumnA",when(regexp_extract($"columna", "\\#(.*)", 1).equalTo(""), $"columna").otherwise(regexp_extract($"columna", "\\#(.*)", 1))).show(false) }
+-------+
|ColumnA|
+-------+
|Cat |
|Dog |
|Cat |
|Dog |
|Dog |
+-------+
Time taken: 22 ms
scala>
contains function to check # value in column
scala> spark.time(df.withColumn("parsed",when($"columna".contains("#"), lit(split($"columna","#")(1))).otherwise("")).show(false))
+--------+------+
|columna |parsed|
+--------+------+
|1000#Cat|Cat |
|1001#Dog|Dog |
|1000#Cat|Cat |
|1001#Dog|Dog |
|1001#Dog|Dog |
+--------+------+
Time taken: 14 ms

Sorting a dataset with concatenated string by splitting as well as a single string for column in spark

I have a column which looks as below. It has data with : concatenated and some may be null and some a string as N/A
Sample data:
+--------+
| column |
+--------+
| 11:88 |
+--------+
| 1:45 |
+--------+
| 17:456 |
+--------+
| 2:45 |
+--------+
| 11:9 |
+--------+
| N/A |
+--------+
I want to order it as ascending based on the first element before : and null should be first and N/A and then followed by the numerical order of the first element as below
Required data after ascending order :
+--------+
| column |
+--------+
| N/A |
+--------+
| 1:45 |
+--------+
| 2:45 |
+--------+
| 11:88 |
+--------+
| 11:9 |
+--------+
| 17:456 |
+--------+
Similarly as below when in descending order
+--------+
| column |
+--------+
| 17:456 |
+--------+
| 11:9 |
+--------+
| 11:88 |
+--------+
| 2:45 |
+--------+
| 1:45 |
+--------+
| N/A |
+--------+
scala> val df3 = Seq("11:88","1:45","17:456","2:45","11:9","N/A").toDF("column")
scala> df3.show
+------+
|column|
+------+
| 11:88|
| 1:45|
|17:456|
| 2:45|
| 11:9|
| N/A|
+------+
scala> df3.withColumn("NEW",regexp_replace(col("column"),":",".").cast("Double")).orderBy(col("NEW").desc).drop("NEW").show
+------+
|column|
+------+
|17:456|
| 11:9|
| 11:88|
| 2:45|
| 1:45|
| N/A|
+------+
scala> df3.withColumn("NEW",regexp_replace(col("column"),":",".").cast("Double")).orderBy(col("NEW").asc).drop("NEW").show
+------+
|column|
+------+
| N/A|
| 1:45|
| 2:45|
| 11:9|
| 11:88|
|17:456|
+------+
Please let me know if it helps you.

Pandas sort not sorting data properly

I am trying to sort the results of sklearn.ensemble.RandomForestRegressor's feature_importances_
I have the following function:
def get_feature_importances(cols, importances):
feats = {}
for feature, importance in zip(cols, importances):
feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance')
return importances
I use it like so:
importances = get_feature_importances(X_test.columns, rf.feature_importances_)
print()
print(importances)
And I get the following results:
| PART | 0.035034 |
| MONTH1 | 0.02507 |
| YEAR1 | 0.020075 |
| MONTH2 | 0.02321 |
| YEAR2 | 0.017861 |
| MONTH3 | 0.042606 |
| YEAR3 | 0.028508 |
| DAYS | 0.047603 |
| MEDIANDIFF | 0.037696 |
| F2 | 0.008783 |
| F1 | 0.015764 |
| F6 | 0.017933 |
| F4 | 0.017511 |
| F5 | 0.017799 |
| SS22 | 0.010521 |
| SS21 | 0.003896 |
| SS19 | 0.003894 |
| SS23 | 0.005249 |
| SS20 | 0.005127 |
| RR | 0.021626 |
| HI_HOURS | 0.067584 |
| OI_HOURS | 0.054369 |
| MI_HOURS | 0.062121 |
| PERFORMANCE_FACTOR | 0.033572 |
| PERFORMANCE_INDEX | 0.073884 |
| NUMPA | 0.022445 |
| BUMPA | 0.024192 |
| ELOH | 0.04386 |
| FFX1 | 0.128367 |
| FFX2 | 0.083839 |
I thought the line importances.sort_values(by='Gini-importance') would sort them. But it is not. Why is this not performing correctly?
importances.sort_values(by='Gini-importance') returns the sorted dataframe, which is overlooked by your function.
You want return importances.sort_values(by='Gini-importance').
Or you could make sort_values inplace:
importances.sort_values(by='Gini-importance', inplace=True)
return importances

Calc/Spreadsheet: combine/assign/migrate cols

Sorry, i can't use the right terms but i try to explain my task:
In Calc, or Spreadsheet I have two worksheets with columns like this:
| ID|
| 32|
| 51|
| 51|
| 63|
| 70|
and
| ID|Name |
| 01|name1 |
| 02|name2 |
...
| 69|name69 |
| 70|name70 |
i need to combine/assign/migrate these together, like:
| ID|Name |
| 32|name32 |
| 51|name51 |
| 51|name51 |
| 63|name63 |
| 70|name70 |
I have no idea how can is start to solve it. Please help!
Thank you #PsysicalChemist, the VLookup function is working in Calc to.

Resources