I have created a view from two tables on the hive, and let's say I have
df = spark.select(select * from view)
and
df1 = spark.sql(select * from table1)
df2 = spark.sql(select * from table2)
df3 = df1.join(df2)
I assume that the condition to join in both cases is the same.
Now the question is, will the output of df and df3 be identical? Why and how?
Related
I have 2 df's
df1:
columns: col1, col2, col3
partitioned on col1
no of partitions: 120000
df2:
columns: col1, col2, col3
partitioned on col1
no of partitions: 80000
Now I want to join the df1, df2 on (df1.col1=df2.col1 and df1.col2=df2.col2) without much shuffles
tried to join but taking a lot of time...
How do i do it.. Can any one help..?
Imo you can try to use broadcast join if one of your dataset is small (lets say few hundrests of mb) - in this case smaller dataset will be broadcasted and you will skip the shuffle
Without broadcast hint catalyst is probably going to pick SMJ(sort-merge join) and during this join algorithm data needs to be repartitioned by join key and then sorted. I prepared quick example
import org.apache.spark.sql.functions._
spark.conf.set("spark.sql.shuffle.partitions", "10")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val data = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 7), ("test55", 86))
val data2 = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 6), ("test33", 76))
val df = data.toDF("Name", "Value").repartition(5, col("Name"))
df.show
val df2 = data2.toDF("Name", "Value").repartition(5, col("Name"))
df2.show
df.join(df2, Seq("Name", "Value")).show
autoBroadcastJoinThreshold is set to -1 to disable broadcastJoin
sql.shuffle.partitions is set to 10 to show that join is going to use this value during repartition
i repartitioned dfs before join with 5 partitions and called action to be sure that they are paritioned by the same column before join
And in sql tab i can see that Spark is repartitioning data again
If you cant broadcast and your join is taking a lot of time you may check if you have some skew.
You may read this blogpost by Dima Statz to find more informations about skew on joins
Is there any sense to reduce not required columns before I join it in Spark data frames?
For example:
DF1 has 10 columns, DF2 has 15 columns, DF3 has 25 columns.
I want to join them, select needed 10 columns and save it in .parquet.
Does it make sense to transform DFs with select only needed columns before the join or Spark engine will optimize the join by itself and will not operate with all 50 columns during the join operation?
Yes, it makes a perfect sense because it reduce the amount of data shuffled between executors. And it's better to make selection of only necessary columns as early as possible - in most cases, if file format allows (Parquet, Delta Lake), Spark will read data only for necessary columns, not for all columns. I.e.:
df1 = spark.read.parquet("file1") \
.select("col1", "col2", "col3")
df2 = spark.read.parquet("file2") \
.select("col1", "col5", "col6")
joined = df1.join(df2, "col1")
I have df1 and df2 and i wanna make a left join using pandas,
i also tryed this:
data_2 = pd.merge(df1, df2, ['var1' , 'var2' , 'var3' ] )
but is not really what i want to do.
i write the following join in SQL just to show what i really wanna do (please notice that the two df has different column name):
create df3 as
select a.* , b.*
from df1 as a left join df2 as b
on a.id=b.id_var
and a.speciality=b.speciality
and upcase(a.global_name)= upcase(b.product_name)
how can i do it using pandas?
Equivalent:
(df1.assign(upcase=df1.global_name.str.upper())
.merge(df2.assign(upcase=df2.product_name.str.upper()),
left_on=['id', 'speciality', 'upcase'],
right_on=['id_var', 'speciality', 'upcase'],
how='left')
.drop('upcase', axis=1)
)
I want to get data from only df2 (all columns) by comparing 'no' filed in both df1 and df2.
My 3 line code is below, for this i'm getting all columns from df1 and df2 not able to trim fields from df1. How to achieve ?
I've 2 pandas dataframes like below :
df1:
no,name,salary
1,abc,100
2,def,105
3,abc,110
4,def,115
5,abc,120
df2:
no,name,salary,dept,addr
1,abc,100,IT1,ADDR1
2,abc,101,IT2,ADDR2
3,abc,102,IT3,ADDR3
4,abc,103,IT4,ADDR4
5,abc,104,IT5,ADDR5
6,abc,105,IT6,ADDR6
7,abc,106,IT7,ADDR7
8,abc,107,IT8,ADDR8
df1 = pd.read_csv("D:\\data\\data1.csv")
df2 = pd.read_csv("D:\\data\\data2.csv")
resDF = pd.merge(df1, df2, on='no' , how='inner')
I think you need filter only no column, then on and how parameters are not necessary:
resDF = pd.merge(df1[['no']], df2)
Or use boolean indexing with filtering by isin:
resDF = df2[df2['no'].isin(df1['no'])]
I have a dataframe df1 with 150 columns and many rows. I also have a dataframe df2 with the same schema but very few rows containing edits that should be applied to df1 (there's a key column id to identify which row to update). df2 has only columns with updates populated. The other of the columns are null. What I want to do is to update the rows in df1 with correspoding rows from dataframe df2 in the following way:
if a column in df2 is null, it should not cause any changes in df1
if a column in df2 contains a tilde "~", it should result in nullifying that column in df1
otherwise the value in column in df1 should get replaced with the value from df2
How can I best do it? Can it be done in a generic way without listing all the columns but rather iterating over them? Can it be done using dataframe API or do I need to switch to RDDs?
(Of course by updating dataframe df1 I mean creating a new, updated dataframe.)
Example
Let's say the schema is: id:Int, name:String, age: Int.
df1 is:
1,"Greg",18
2,"Kate",25
3,"Chris",30
df2 is:
1,"Gregory",null
2,~,26
The updated dataframe should look like this:
1,"Gregory",18
2,null,26
3,"Chris",30
you can also use case or coalesce using full outer join to merge the two dataframes. see a link below for an explanation.
Spark incremental loading overwrite old record
I figured out how to do it with an intermediate conversion to RDD. First, create a map idsToEdits where keys are row ids and values are maps of column numbers to values (only the non-null ones).
val idsToEdits=df2.rdd.map{row=>
(row(0),
row.getValuesMap[AnyVal](row.schema.fieldNames.filterNot(colName=>row.isNullAt(row.fieldIndex(colName))))
.map{case (k,v)=> (row.fieldIndex(k),if(v=="~") null else v)} )
}.collectAsMap()
Broadast that map and define an editRow function updating a row.
val idsToEditsBr=sc.broadcast(idsToEdits)
import org.apache.spark.sql.Row
val editRow:Row=>Row={ row =>
idsToEditsBr
.value
.get(row(0))
.map{edits => Row.fromSeq(edits.foldLeft(row.toSeq){case (rowSeq,
(idx,newValue))=>rowSeq.updated(idx,newValue)})}
.getOrElse(row)
}
Finally, use that function on RDD derived from df1 and convert back to a dataframe.
val updatedDF=spark.createDataFrame(df1.rdd.map(editRow),df1.schema)
It sounds like your question is how to perform this without explcitly naming all the columns so I will assume you have some "doLogic" udf function or dataframe functions to perform your logic after joining.
import org.apache.spark.sql.types.StringType
val cols = df1.schema.filterNot(x => x.name == "id").map({ x =>
if (x.dataType == StringType) {
doLogicUdf(col(x), col(x + "2")))
} else {
when(col(x + "2").isNotNull, col(x + "2")).otherwise(col(x))
}
}) :+ col("id")
val df2 = df2.select(df2.columns.map( x=> col(x).alias(x+"2")) : _*))
df1.join(df2, col("id") ===col("id2") , "inner").select(cols : _*)