I have two dataframes df and df1, in df I have NULL in some columns, but in df1 I have non-null values for these columns. I just need to overwrite rows where the NULL exists.
The df is below:
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| Id| Name|Country| City| Address| Latitude| Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St| null| null|
| 42949672965|Comfort Inn Delan...| US| Deland|400 E Internation...| 29.054737| -81.297208|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd| null| null|
The df1 is below:
+-------------+--------------------+-------+------------+--------------------+----------+------------+
| Id| Name|Country| City| Address| Latitude| Longitude|
+-------------+--------------------+-------+------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St|39.6286685|-106.0451009|
| 60129542147|Ubaa Old Crawford...| US| Des Plaines| 5460 N River Rd|42.0654049| -87.8916252|
+-------------+--------------------+-------+------------+--------------------+----------+------------+
I want this result:
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| Id| Name|Country| City| Address| Latitude| Longitude|
+------------+--------------------+-------+---------------+--------------------+----------+------------+
| 42949672960|Americana Resort ...| US| Dillon| 135 Main St|39.6286685|-106.0451009|
| 42949672965|Comfort Inn Delan...| US| Deland|400 E Internation...| 29.054737| -81.297208|
...
...
You can either left join or inner join them then using coalesce to pick first non-null lat/lon.
df1
+-----------+---------+----------+
| id| lat| lon|
+-----------+---------+----------+
|42949672960| null| null|
|42949672965|29.054737|-81.297208|
|60129542147| null| null|
+-----------+---------+----------+
df2
+-----------+----------+------------+
| id| lat| lon|
+-----------+----------+------------+
|42949672960|39.6286685|-106.0451009|
|60129542147|42.0654049| -87.8916252|
+-----------+----------+------------+
Join them together
from pyspark.sql import functions as F
(df1
.join(df2, on=['id'], how='left')
.select(
F.col('id'),
F.coalesce(df1['lat'], df2['lat']).alias('lat'),
F.coalesce(df1['lon'], df2['lon']).alias('lon')
)
.show()
)
# +-----------+----------+------------+
# | id| lat| lon|
# +-----------+----------+------------+
# |42949672965| 29.054737| -81.297208|
# |60129542147|42.0654049| -87.8916252|
# |42949672960|39.6286685|-106.0451009|
# +-----------+----------+------------+
Related
I have two dataframes like this:
df2.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 11| 500|
|Liza| 20| 900|
+----+-------+------+
df3.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
| Cal| 70| 888|
+----+-------+------+
df2 here, represents existing database records and df3 represents new records/updated records(any column) which need to be inserted/updated into db.For ex: NAME=PPan the new balance is 10 as per df3. so For NAME=PPan entire row has to be replaced in df2 and for NAME=Cal, a new row has to be added and for name=Liza will be untouched like this:
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
|Liza| 20| 900|
| Cal| 70| 888|
+----+-------+------+
How can I achieve this use case?
First you need to join both dataframes using full method to keep unmatched rows (new) and to updating the matched records I do prefer to use select with coalesce function:
joined_df = df2.alias('rec').join(df3.alias('upd'), on='NAME', how='full')
# +----+-------+------+-------+------+
# |NAME|BALANCE|SALARY|BALANCE|SALARY|
# +----+-------+------+-------+------+
# |Cal |null |null |70 |888 |
# |Liza|20 |900 |null |null |
# |PPan|11 |500 |10 |700 |
# +----+-------+------+-------+------+
output_df = joined_df.selectExpr(
'NAME',
'COALESCE(upd.BALANCE, rec.BALANCE) BALANCE',
'COALESCE(upd.SALARY, rec.SALARY) SALARY'
)
output_df.sort('BALANCE').show(truncate=False)
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan|10 |700 |
|Liza|20 |900 |
|Cal |70 |888 |
+----+-------+------+
I have two dataframes that need to be joined in a particular way I am struggling with.
dataframe 1:
+--------------------+---------+----------------+
| asset_domain| eid| oid|
+--------------------+---------+----------------+
| test-domain...| 126656| 126656|
| nebraska.aaa.com| 335660| 335660|
| netflix.com| 460| 460|
+--------------------+---------+----------------+
dataframe 2:
+--------------------+--------------------+---------+--------------+----+----+------------+
| asset| asset_domain|dns_count| ip| ev|post|form_present|
+--------------------+--------------------+---------+--------------+----+----+------------+
| sub1.test-domain...| test-domain...| 6354| 11.11.111.111| 1| 1| null|
| netflix.com| netflix.com| 3836| 22.22.222.222|null|null| null|
+--------------------+--------------------+---------+--------------+----+----+------------+
desired result:
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| asset|dns_count| ip| ev|post|form_present| eid| oid|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| netflix.com| 3836|22.22.222.222|null|null| null| 460| 460|
| sub1.test-domain...| 5924|111.11.111.11| 1| 1| null| 126656| 126656|
| nebraska.aaa.com| null| null|null|null| null| 335660| 335660|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
Basically – it should join df1 and df2 on asset_domain but if that doesn't exist in df2, then the resulting asset should be the asset_domain from df1.
I tried df = df2.join(df1, ["asset_domain"], "right").drop("asset_domain") but that obviously leaves null in the asset column for nebraska.aaa.com since it does not have a matching domain in df2. How do I go about adding those to the asset column for this particular case?
you can use coalesce function after join to create asset column.
df2.join(df1, ["asset_domain"], "right").select(coalesce("asset","asset_domain").alias("asset"),"dns_count","ip","ev","post","form_present","eid","oid").orderBy("asset").show()
#+----------------+---------+-------------+----+----+------------+------+------+
#| asset|dns_count| ip| ev|post|form_present| eid| oid|
#+----------------+---------+-------------+----+----+------------+------+------+
#|nebraska.aaa.com| null| null|null|null| null|335660|335660|
#| netflix.com| 3836|22.22.222.222|null|null| None| 460| 460|
#|sub1.test-domain| 6354|11.11.111.111| 1| 1| null|126656|126656|
#+----------------+---------+-------------+----+----+------------+------+------+
After the join you can use the isNull() function
import pyspark.sql.functions as F
tst1 = sqlContext.createDataFrame([('netflix',1),('amazon',2)],schema=("asset_domain",'xtra1'))
tst2= sqlContext.createDataFrame([('netflix','yahoo',1),('amazon','yahoo',2),('flipkart',None,2)],schema=("asset_domain","asset",'xtra'))
tst_j = tst1.join(tst2,on='asset_domain',how='right')
#%%
tst_res = tst_j.withColumn("asset",F.when(F.col('asset').isNull(),F.col('asset_domain')).otherwise(F.col('asset')))
I have the following sample dataframe
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
and I want to explode the values in each row and associate alternating 1-0 values in the generated rows. This way I can identify the start/end entries in each row.
I am able to achieve the desired result this way
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = (df.withColumn('start_end', fn.array('start', 'end'))
.withColumn('date', fn.explode('start_end'))
.withColumn('row_num', fn.row_number().over(w)))
df = (df.withColumn('is_start', fn.when(fn.col('row_num')%2 == 0, 0).otherwise(1))
.select('date', 'is_start'))
which gives
| date | is_start |
|--------|----------|
| start | 1 |
| end | 0 |
| start1 | 1 |
| end1 | 0 |
but it seems overly complicated for such a simple task.
Is there any better/cleaner way without using UDFs?
You can use pyspark.sql.functions.posexplode along with pyspark.sql.functions.array.
First create an array out of your start and end columns, then explode this with the position:
from pyspark.sql.functions import array, posexplode
df.select(posexplode(array("end", "start")).alias("is_start", "date")).show()
#+--------+------+
#|is_start| date|
#+--------+------+
#| 0| end|
#| 1| start|
#| 0| end1|
#| 1|start1|
#+--------+------+
You can try union:
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df = df.withColumn('startv', F.lit(1))
df = df.withColumn('endv', F.lit(0))
df = df.select(['start', 'startv']).union(df.select(['end', 'endv']))
df.show()
+------+------+
| start|startv|
+------+------+
| start| 1|
|start1| 1|
| end| 0|
| end1| 0|
+------+------+
You can rename the columns and re-order the rows starting here.
I had similar situation in my use case. In my situation i had Huge dataset(~50GB) and doing any self join/heavy transformation was resulting in more memory and unstable execution .
I went one more level down of dataset and used flatmap of rdd. This will use map side transformation and it will be cost effective in terms of shuffle, cpu and memory.
df = spark.createDataFrame([('start','end'), ('start1','end1')] ,["start", "end"])
df.show()
+------+----+
| start| end|
+------+----+
| start| end|
|start1|end1|
+------+----+
final_df = df.rdd.flatMap(lambda row: [(row.start, 1), (row.end, 0)]).toDF(['date', 'is_start'])
final_df.show()
+------+--------+
| date|is_start|
+------+--------+
| start| 1|
| end| 0|
|start1| 1|
| end1| 0|
+------+--------+
A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?
A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.
I have two following Spark data frames:
sale_df:
|user_id|total_sale|
+-------+----------+
| a| 1100|
| b| 2100|
| c| 3300|
| d| 4400
and target_df:
user_id|personalized_target|
+-------+-------------------+
| b| 1000|
| c| 2000|
| d| 3000|
| e| 4000|
+-------+-------------------+
How can I join them in a way that output is:
user_id total_sale personalized_target
a 1100 NA
b 2100 1000
c 3300 2000
d 4400 4000
e NA 4000
I have tried all most all the join types but it seems that single join can not make the desired output.
Any PySpark or SQL and HiveContext can help.
You can use the equi-join synthax in Scala
val output = sales_df.join(target_df,Seq("user_id"),joinType="outer")
You should check if it works in python:
output = sales_df.join(target_df,['user_id'],"outer")
You need to perform an outer equi-join :
data1 = [['a', 1100], ['b', 2100], ['c', 3300], ['d', 4400]]
sales = sqlContext.createDataFrame(data1,['user_id','total_sale'])
data2 = [['b', 1000],['c',2000],['d',3000],['e',4000]]
target = sqlContext.createDataFrame(data2,['user_id','personalized_target'])
sales.join(target, 'user_id', "outer").show()
# +-------+----------+-------------------+
# |user_id|total_sale|personalized_target|
# +-------+----------+-------------------+
# | e| null| 4000|
# | d| 4400| 3000|
# | c| 3300| 2000|
# | b| 2100| 1000|
# | a| 1100| null|
# +-------+----------+-------------------+