Alternatives for regexp_extract_all in Spark SQL 2.4.0 - apache-spark

There is any way to do regexp_extract_all in Spark SQL 2.4.0?
I have column with phone numbers, and I have to split them. But I don't know which delimiter is used. So i need find all substring that are matched to the regular expression.
For example, from column with string value:
|Column |
|------------------------------------------------------|
|"+1(493) 140-26-90,+1(495) 140-26-92" |
|"+1(491) 140-24-71,+1(495) 284-99-38" |
|"1(492) 232-93-71,+1(411) 222-93-54,+1(499) 214-83-88"|
|"4959906451 4923691956" |
|"+1(926) 456-65-04,+1(926) 345-36-71" |
|"84999590956" |
|"8 915 608 41 73" |
|------------------------------------------------------|
I have to get column with array value:
|Column |
|--------------------------------------------------------------|
|["+1(493) 140-26-90", "+1(495) 140-26-92"] |
|["+1(491) 140-24-71", "+1(495) 284-99-38"] |
|["1(492) 232-93-71", "+1(411) 222-93-54", "+1(499) 214-83-88"]|
|["4959906451", "4923691956"] |
|["+1(926) 456-65-04", "+1(926) 345-36-71"] |
|["84999590956"] |
|["8 915 608 41 73"] |
|--------------------------------------------------------------|
By using this regular expression:
((8|\+7|7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?([\d]{3}[\- ]?[\d]{2}[\- ]?[\d]{2})
I can find all number, and split them into array by regexp_extract_all function
But there is not such function in Spark SQL 2.4.0.
How can I do this in Spark SQL 2.4.0?

you can try this
val pattern = """((8|\+7|7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?([\d]{3}[\- ]?[\d]{2}[\- ]?[\d]{2})""".r
df.rdd.map(x => pattern.findAllIn(x.getString(0)).toList)
.toDF("num_list")
.show(false)
output below:
+---------------------------------------------------+
|num_list |
+---------------------------------------------------+
|[(493) 140-26-90, (495) 140-26-92] |
|[(491) 140-24-71, (495) 284-99-38] |
|[(492) 232-93-71, (411) 222-93-54, (499) 214-83-88]|
|[4959906451, 4923691956] |
|[(926) 456-65-04, (926) 345-36-71] |
|[84999590956] |
|[8 915 608 41 73] |
+---------------------------------------------------+
the idea is to use scala findAllIn to get all the matches.

Related

PySpark: Filtering duplicates of a union, keeping only the groupby rows with the maximum value for a specified column

I want to create a DataFrame that contains all the rows from two DataFrames, and where there are duplicates we keep only the row with the max value of a column.
For example, if we have two tables with the same schema, like below, we will merge into one table which includes only the rows with the maximum column value (highest score) for the group of rows grouped by another column ("name" in the below example).
Table A
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Finch | Acme | 62 |
| Jones | Acme | 30 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Table B
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Jones | Beta | 75 |
| Lewis | Beta | 59 |
| Smith | Beta | 64 |
| Starr | Beta | 81 |
+--------+---------+-------+
Final Table
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Finch | Acme | 62 |
| Jones | Beta | 75 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Here's what seems to work:
from pyspark.sql import functions as F
schema = ["name", "source", "score"]
rows1 = [("Smith", "Acme", 98),
("Jones", "Acme", 30),
("Finch", "Acme", 62),
("Lewis", "Acme", 59),
("Starr", "Acme", 87)]
rows2 = [("Smith", "Beta", 64),
("Jones", "Beta", 75),
("Bryan", "Beta", 93),
("Lewis", "Beta", 59),
("Starr", "Beta", 81)]
df1 = spark.createDataFrame(rows1, schema)
df2 = spark.createDataFrame(rows2, schema)
df_union = df1.unionAll(df2)
df_agg = df_union.groupBy("name").agg(F.max("score").alias("score"))
df_final = df_union.join(df_agg, on="score", how="leftsemi").orderBy("name", F.col("score").desc()).dropDuplicates(["name"])
The above results in the DataFrame I expect. It seems like a convoluted way to do this, but I don't know as I'm relatively new to Spark. Can this be done in a more efficient, elegant, or "Pythonic" manner?
You can use window functions. Partition by name and choose the record with the highest score.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(desc("score"))
df_union.withColumn("rank", row_number().over(w))\
.filter(col("rank")==1).drop("rank").show()
+-----+------+-----+
| name|source|score|
+-----+------+-----+
|Bryan| Beta| 93|
|Finch| Acme| 62|
|Jones| Beta| 75|
|Lewis| Acme| 59|
|Smith| Acme| 98|
|Starr| Acme| 87|
+-----+------+-----+
I don't see anything wrong with your answer, except for the last line - you cannot join on score only, but need to join on combination of "name" and "score", and you can choose inner join, which will eliminate the need to remove rows with lower scores for the same name:
df_final = (df_union.join(df_agg, on=["name", "score"], how="inner")
.orderBy("name")
.dropDuplicates(["name"]))
Notice that there is no need to order by score, and .dropDuplicates(["name"]) is only needed if you want to avoid displaying two rows for name = Lewis who has the same score in both dataframes.

Prometheus count frequency distinct values of gauge

wmi_cpu_core_frequency_mhz is a gauge that returns some unique values {a,b,c...} for label=core with values(p,q,r,s)
I want to get break down of count of each gauge value a,b,c for each label p,q,r,s
Something looking like this
| core | count(a) | count(b) | count(c) |...
+------+----------+----------+----------+
| p | 10 | 35 | 5 |...
+------+----------+----------+----------+
| q | 15 | 15 | 20 |...
+------+----------+----------+----------+
| r | 2 | 13 | 35 |...
+------+----------+----------+----------+
| s | 10 | 10 | 30 |...
+------+----------+----------+----------+
Any idea how to tackle this, or where should I start from.
You want to use count_values here, so your query would be something like count_values (p,q,r,s) ("core", wmi_cpu_core_frequency_mhz).
You won't get a 2d table with this, but you should get the data you're after, at least.

Combine two columns values in one Spark- Python

I have this table bellow:
FrameForm | Sections | Framefrom_section | FrameFrom_echelon
----------|----------|-------------------|------------------
70 | 11/12 | 11/12 | 50004
70 | 13/14 | 13/14 | 60003
How can I do a test via pySpark on a FrameFrom column to combine the two values of Framefrom_section and FrameFrom_echelon to obtain this result:
FrameForm | Framefrom_section | FrameFrom_echelon
----------|-------------------|------------------
70 | [11/12,13/14] | [50004,60003]

Add a new Column to my DataSet in spark Java API

I'm new In Spark .
My DataSet contains two columns. I want to add the third that is the sum of the two columns.
My DataSet is:
+---------+-------------------+
|C1 | C2 |
+---------+-------------------+
| 44 | 10|
| 55 | 10|
+---------+-------------------+
I want to obtain a DataSet like this:
+---------+-------------------+---------+
|C1 | C2 | C3 |
+---------+-------------------+---------+
| 44 | 10| 54 |
| 55 | 10| 65 |
+---------+-------------------+---------+
Any help will be apprecieted.
The correct solution is:
df.withColumn("C3", df.col1("C1").plus(df.col("C2")));
or
df.selectExpr("*", "C1 + C2");
For more arithmetic operators check Java-specific expression operators in the Column documentation.

Spark dataframe decimal precision

I have one dataframe:
val groupby = df.groupBy($"column1",$"Date")
.agg(sum("amount").as("amount"))
.orderBy($"column1",desc("cob_date"))
When applyin the window function for adding new column difference:
val windowspec= Window.partitionBy("column1").orderBy(desc("DATE"))
groupby.withColumn("diffrence" ,lead($"amount", 1,0).over(windowspec)).show()
+--------+------------+-----------+--------------------------+
| Column | Date | Amount | Difference |
+--------+------------+-----------+--------------------------+
| A | 3/31/2017 | 12345.45 | 3456.540000000000000000 |
+--------+------------+-----------+--------------------------+
| A | 2/28/2017 | 3456.54 | 34289.430000000000000000 |
+--------+------------+-----------+--------------------------+
| A | 1/31/2017 | 34289.43 | 45673.987000000000000000 |
+--------+------------+-----------+--------------------------+
| A | 12/31/2016 | 45673.987 | 0.00E+00 |
+--------+------------+-----------+--------------------------+
I'm getting decimal as with trailing zeros .When I did printSchema() for the above dataframe getting the datatype for difference: decimal(38,18).Can some one tell me how to change the datatype to decimal(38,2) or remove the trailing zeros
You can cast the data with the specific decimal size like below,
lead($"amount", 1,0).over(windowspec).cast(DataTypes.createDecimalType(32,2))
In pure SQL, you can use the well known technique:
SELECT ceil(100 * column_name_double)/100 AS cost ...
from pyspark.sql.types import DecimalType
df=df.withColumn(column_name, df[column_name].cast(DecimalType(10,2)))

Resources