env: spark2.4.5
source: id-name.json
{"1": "a", "2": "b", "3":, "c"..., "n": "z"}
I load the .json file into spark Dataset with Json format and it is stored like:
+---+---+---+---+---+
| 1 | 2 | 3 |...| n |
+---+---+---+---+---+
| a | b | c |...| z |
+---+---+---+---+---+
And I want it to be generated like such result:
+------------+------+
| id | name |
+------------+------+
| 1 | a |
| 2 | b |
| 3 | c |
| . | . |
| . | . |
| . | . |
| n | z |
+------------+------+
My solution using spark-sql:
select stack(n, '1', `1`, '2', `2`... ,'n', `n`) as ('id', 'name') from table_name;
It doesn't meet my demand because I don't want to hard-code all the 'id' in sql.
Maybe using 'show columns from table_name' with 'stack()' can help?
I would be very grateful if you could give me some suggestion.
Create required values for stack dynamic & use it where ever it required. Please check below code to generate same values dynamic.
scala> val js = Seq("""{"1": "a", "2": "b","3":"c","4":"d","5":"e"}""").toDS
js: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val df = spark.read.json(js)
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
exprC: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala>
Updated Code to fetch data from json file
scala> "hdfs dfs -cat /tmp/sample.json".!
{"1": "a", "2": "b","3":"c","4":"d","5":"e"}
res4: Int = 0
scala> val df = spark.read.json("/tmp/sample.json")
df: org.apache.spark.sql.DataFrame = [1: string, 2: string ... 3 more fields]
scala> val stack = s"""stack(${df.columns.max},${df.columns.flatMap(c => Seq(s"'${c}'",s"`${c}`")).mkString(",")}) as (id,name)"""
stack: String = stack(5,'1',`1`,'2',`2`,'3',`3`,'4',`4`,'5',`5`) as (id,name)
scala> df.select(expr(stack)).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
scala> df.createTempView("table")
scala> spark.sql(s"""select ${stack} from table """).show(false)
+---+----+
|id |name|
+---+----+
|1 |a |
|2 |b |
|3 |c |
|4 |d |
|5 |e |
+---+----+
Related
This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.
No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.
I am joining two dataframes site_bs and site_wrk_int1 and creating site_wrk using a dynamic join condition.
My code is like below:
join_cond=[ col(v_col) == col('wrk_'+v_col) for v_col in primaryKeyCols] #result would be
site_wrk=site_bs.join(site_wrk_int1,join_cond,'inner').select(*site_bs.columns)
join_cond will be dynamic and the value will be something like [ col(id) == col(wrk_id), col(id) == col(wrk_parentId)]
In the above join condition, join will happen satisfying both the conditions above. i.e., the join condition will be
id = wrk_id and id = wrk_parentId
But I want or condition to be applied like below
id = wrk_id or id = wrk_parentId
How to achieve this in Pyspark?
Since logical operations on pyspark columns return column objects, you can chain these conditions in the join statement such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1, "A", "A"),
(2, "C", "C"),
(3, "E", "D"),
], ['id', 'col1', 'col2']
)
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| A| A|
| 2| C| C|
| 3| E| D|
+---+----+----+
df.alias("t1").join(
df.alias("t2"),
(f.col("t1.col1") == f.col("t2.col2")) | (f.col("t1.col1") == f.lit("E")),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
As you can see, I get the True value for left rows with IDs 1 and 2 as col1 == col2 OR col1 == E which is True for three rows of my DataFrame. In terms of syntax, it's important that the Python operators (| & ...) are separated by closed brackets as in example above, otherwise you might get confusing py4j errors.
Alternatively, if you wish to keep to similar notation as you stated in your questions, why not use functools.reduce and operator.or_ to apply this logic to your list, such as:
In this example, I have an AND condition between my column conditions and get NULL only, as expected:
df.alias("t1").join(
df.alias("t2"),
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")],
"left_outer"
).show(truncate=False)
+---+----+----+----+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+----+----+----+
|3 |E |D |null|null|null|
|1 |A |A |null|null|null|
|2 |C |C |null|null|null|
+---+----+----+----+----+----+
In this example, I leverage functools and operator to get same result as above:
df.alias("t1").join(
df.alias("t2"),
functools.reduce(
operator.or_,
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")]),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
I am quite new in spark SQL.
Please notify me if this can be a solution.
site_wrk = site_bs.join(site_work_int1, [(site_bs.id == site_work_int1.wrk_id) | (site_bs.id == site_work_int1.wrk_parentId)], how = "inner")
I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)
I have data as below and I need to separate that based on ","
I/p file : 1,2,4,371003\,5371022\,87200000\,U
The desired result should be :
a b c d e f
1 2 3 4 371003,5371022,87000000 U
val df = spark.read.option("inferSchma","true").option("escape","\\").option("delimiter",",").csv("/user/txt.csv")
try this:
val df = spark.read.csv("/user/txt.csv")
df.show()
+---+---+---+-------+--------+---------+---+
|_c0|_c1|_c2| _c3| _c4| _c5|_c6|
+---+---+---+-------+--------+---------+---+
| 1| 2| 4|371003\|5371022\|87200000\| U|
+---+---+---+-------+--------+---------+---+
df.select(
'_c0, '_c1, '_c2,
regexp_replace(concat_ws(",", '_c3, '_c4, '_c5), "\\\\", ""),
'_c6
).toDF("a","b","c","e","f").show(false)
+---+---+---+-----------------------+---+
|a |b |c |e |f |
+---+---+---+-----------------------+---+
|1 |2 |4 |371003,5371022,87200000|U |
+---+---+---+-----------------------+---+
I want to compare two dataframes that have the same schema, and have a primary key column.
For each primary key, if other columns have any difference (could be multiple columns, so need to use some dynamic way to scan all other columns), I want to output the column name and values of both dataframes.
Also, I want to output the result if one primary key doesn't exist in another dataframe (so "full outer join" will be used). Here is some example:
dataframe1:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book2 | 2 |
|3 |book3 | 3 |
|4 |book4 | 4 |
+-----------+------+------+
dataframe2:
+-----------+------+------+
|primary_key|book |number|
+-----------+------+------+
|1 |book1 | 1 |
|2 |book8 | 8 |
|3 |book3 | 7 |
|5 |book5 | 5 |
+-----------+------+------+
The result would be:
+-----------+------+----------+------------+------------*
|primary_key|diff_column_name | dataframe1 | dataframe2 |
+-----------+------+----------+------------+------------*
|2 |book | book2 | book8 |
|2 |number | 2 | 8 |
|3 |number | 3 | 7 |
|4 |book | book4 | null |
|4 |number | 4 | null |
|5 |book | null | book5 |
|5 |number | null | 5 |
+-----------+------+----------+------------+------------*
I know the first step is to join both dataframes on the primary key:
// joining the two DFs on primary_key
val result = df1.as("l")
.join(df2.as("r"), "primary_key", "fullouter")
But I am not sure how to proceed. Can someone give me some advice? Thanks
Data:
val df1 = Seq(
(1, "book1", 1), (2, "book2", 2), (3, "book3", 3), (4, "book4", 4)
).toDF("primary_key", "book", "number")
val df2 = Seq(
(1, "book1", 1), (2, "book8", 8), (3, "book3", 7), (5, "book5", 5)
).toDF("primary_key", "book", "number")
Imports
import org.apache.spark.sql.functions._
Define list of columns:
val cols = Seq("book", "number")
Join as you do right now:
val joined = df1.as("l").join(df2.as("r"), Seq("primary_key"), "fullouter")
Define:
val comp = explode(array(cols.map(c => struct(
lit(c).alias("diff_column_name"),
// Value left
col(s"l.${c}").cast("string").alias("dataframe1"),
// Value right
col(s"r.${c}").cast("string").alias("dataframe2"),
// Differs
not(col(s"l.${c}") <=> col(s"r.${c}")).alias("diff")
)): _*))
Select and filter:
joined
.withColumn("comp", comp)
.select($"primary_key", $"comp.*")
// Filter out mismatches and get rid of obsolete diff
.where($"diff").drop("diff")
.orderBy("primary_key").show
// +-----------+----------------+----------+----------+
// | 2| book| book2| book8|
// | 2| number| 2| 8|
// | 3| number| 3| 7|
// | 4| book| book4| null|
// | 4| number| 4| null|
// | 5| book| null| book5|
// | 5| number| null| 5|
// +-----------+----------------+----------+----------+