Spark Querying on all keys of a map

Spark Querying on all keys of a map - apache-spark

I have a parquet file with the following schema
|-- Name: string (nullable = true)
|-- Attendance: long (nullable = true)
|-- Efficiency: map (nullable = true)
| |-- key: string
| |-- value: double (valueContainsNull = true)
Where efficiency value ranges from -1 to +1 and the key is various categories such as Sports,Academics etc. I have up to 20 different keys.
I am trying to fetch the top 100 names ordered in descending by Attendance where efficiency[Key] is lesser than 0.
I am able to do this for one key. But i'm not able to figure out how I should be implementing this for all my keys simultaneously.
Code snippet for one key:
spark.sql("select Name,Attendance,Efficiency['Sports'] from data where Efficiency['Sports'] < 0 order by Attendance desc limit 100")
On doing some analysis I found that we would need to explode our map. But whenever I explode the number of rows in my table goes up and I am unable to fetch the top 100 names.
Sample Data for one key. The actual table has a map instead of the one column that is seen here
+--------------------+------------------+-------------+
|Name |Attendance |Efficiency[Sports]|
+--------------------+------------------+-------------+
|A |1000 |0.002 |
|B |365 |0.0 |
|C |1080 |0.193 |
|D |245 |-0.002 |
|E |1080 |-0.515 |
|F |905 |0.0 |
|G |900 |-0.001 |
Expected output : List of 100 names for each key
+-----------------------+--------------+
|Sports |Academics |
+-----------------------+--------------+
|A |A |
|B |C |
|C |D |
|D |E |
Any help on solving this would really helpful
Thanks

I hope this is what you are looking for
import org.apache.spark.sql.functions._
//dummy data
val d = Seq(
("a", 10, Map("Sports" -> -0.2, "Academics" -> 0.1)),
("b", 20, Map("Sports" -> -0.1, "Academics" -> -0.1)),
("c", 5, Map("Sports" -> -0.2, "Academics" -> 0.5)),
("d", 15, Map("Sports" -> -0.2, "Academics" -> 0.0))
).toDF("Name", "Attendence", "Efficiency")
//explode the map and get key value
val result = d.select($"Name", $"Attendence", explode($"Efficiency"))
//select value less than 0 and show 100
result.select("*").where($"value".lt(0))
.sort($"Attendence".desc)
.show(100)
Output:
+----+----------+---------+-----+
|Name|Attendence|key |value|
+----+----------+---------+-----+
|b |20 |Sports |-0.1 |
|b |20 |Academics|-0.1 |
|d |15 |Sports |-0.2 |
|a |10 |Sports |-0.2 |
|c |5 |Sports |-0.2 |
+----+----------+---------+-----+
Hope this helps!

Given the input dataframe as
+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency |
+----+----------+-----------------------------------------+
|A |1000 |Map(Sports -> 0.002, Academics -> 0.002) |
|B |365 |Map(Sports -> 0.0, Academics -> 0.0) |
|C |1080 |Map(Sports -> 0.193, Academics -> 0.193) |
|D |245 |Map(Sports -> -0.002, Academics -> -0.46)|
|E |1080 |Map(Sports -> -0.515, Academics -> -0.5) |
|F |905 |Map(Sports -> 0.0, Academics -> 0.0) |
|G |900 |Map(Sports -> -0.001, Academics -> -0.0) |
+----+----------+-----------------------------------------+
Using udf function to iterate the Map for checking less than zero values. This can be done as below
import org.apache.spark.sql.functions._
val isLessThan0 = udf((maps: Map[String, Double]) => maps.map(x => x._2 < 0).toSeq.contains(true))
df.withColumn("lessThan0", isLessThan0('Efficiency))
.filter($"lessThan0" === true)
.orderBy($"Attendance".desc)
.drop("lessThan0")
.show(100, false)
you will have output as
+----+----------+-----------------------------------------+
|Name|Attendance|Efficiency |
+----+----------+-----------------------------------------+
|E |1080 |Map(Sports -> -0.515, Academics -> -0.5) |
|G |900 |Map(Sports -> -0.001, Academics -> -0.0) |
|D |245 |Map(Sports -> -0.002, Academics -> -0.46)|
+----+----------+-----------------------------------------+

Related

Filter rows with minimum and maximum count

This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.

No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.

You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.

How to find out the number of unique elements for a column in a group in PySpark?

I have a PySpark dataframe-
df1 = spark.createDataFrame([
("u1", 1),
("u1", 2),
("u2", 1),
("u2", 1),
("u2", 1),
("u3", 3),
],
['user_id', 'var1'])
print(df1.printSchema())
df1.show(truncate=False)
Output-
root
|-- user_id: string (nullable = true)
|-- var1: long (nullable = true)
None
+-------+----+
|user_id|var1|
+-------+----+
|u1 |1 |
|u1 |2 |
|u2 |1 |
|u2 |1 |
|u2 |1 |
|u3 |3 |
+-------+----+
Now I want to group all the unique users and show the number of unique var for them in a new column. The desired output would look like-
+-------+---------------+
|user_id|num_unique_var1|
+-------+---------------+
|u1 |2 |
|u2 |1 |
|u3 |1 |
+-------+---------------+
I can use collect_set and make a udf to find the set's length. But I think there must be a better way to do it.
How do I achieve this in one line of code?

df1.groupBy('user_id').agg(F.countDistinct('var1').alias('num')).show()
countDistinct is exactly what I needed.
Output-
+-------+---+
|user_id|num|
+-------+---+
| u3| 1|
| u2| 1|
| u1| 2|
+-------+---+

countDistinct is surely the best way to do it, but for the sake of completeness, what you said in your question is also possible without using an UDF. You can use size to get the length of the collect_set:
df1.groupBy('user_id').agg(F.size(F.collect_set('var1')).alias('num'))
this is helpful if you want to use it in a window function, because countDistinct is not supported in a window function.
e.g.
df1.withColumn('num', F.countDistinct('var1').over(Window.partitionBy('user_id')))
would fail, but
df1.withColumn('num', F.size(F.collect_set('var1')).over(Window.partitionBy('user_id')))
would work.

How to get the info in table header (schema)?

env: spark2.4.5
source: id-name.json
{"1": "a", "2": "b", "3":, "c"..., "n": "z"}
I load the .json file into spark Dataset with Json format and it is stored like:
+---+---+---+---+---+
| 1 | 2 | 3 |...| n |
+---+---+---+---+---+
| a | b | c |...| z |
+---+---+---+---+---+
And I want it to be generated like such result:
+------------+------+
| id | name |
+------------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| . | . |
| . | . |
| . | . |
| n | z |
+------------+------+
My solution using spark-sql:
select stack(n, '1', `1`, '2', `2`... ,'n', `n`) as ('id', 'name') from table_name;
It doesn't meet my demand because I don't want to hard-code all the 'id' in sql.
Maybe using 'show columns from table_name' with 'stack()' can help?
I would be very grateful if you could give me some suggestion.

Create required values for stack dynamic & use it where ever it required. Please check below code to generate same values dynamic.
scala> val js = Seq("""{"1": "a", "2": "b","3":"c","4":"d","5":"e"}""").toDS
js: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val df = spark.read.json(js)
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
exprC: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala>
Updated Code to fetch data from json file
scala> "hdfs dfs -cat /tmp/sample.json".!
{"1": "a", "2": "b","3":"c","4":"d","5":"e"}
res4: Int = 0
scala> val df = spark.read.json("/tmp/sample.json")
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
stack: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> df.createTempView("table")
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+

multiline values in a column while spark read file

I have data as below and I need to separate that based on ","
I/p file : 1,2,4,371003\,5371022\,87200000\,U
The desired result should be :
a b c d e f
1 2 3 4 371003,5371022,87000000 U
val df = spark.read.option("inferSchma","true").option("escape","\\").option("delimiter",",").csv("/user/txt.csv")

try this:
val df = spark.read.csv("/user/txt.csv")
df.show()
+---+---+---+-------+--------+---------+---+
|_c0|_c1|_c2| _c3| _c4| _c5|_c6|
+---+---+---+-------+--------+---------+---+
| 1| 2| 4|371003\|5371022\|87200000\| U|
+---+---+---+-------+--------+---------+---+
df.select(
'_c0, '_c1, '_c2,
regexp_replace(concat_ws(",", '_c3, '_c4, '_c5), "\\\\", ""),
'_c6
).toDF("a","b","c","e","f").show(false)
+---+---+---+-----------------------+---+
|a |b |c |e |f |
+---+---+---+-----------------------+---+
|1 |2 |4 |371003,5371022,87200000|U |
+---+---+---+-----------------------+---+

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!

You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Querying on all keys of a map - apache-spark

Related

Filter rows with minimum and maximum count

How to find out the number of unique elements for a column in a group in PySpark?

How to get the info in table header (schema)?

multiline values in a column while spark read file

Calculating sum,count of multiple top K values spark

Categories

Resources