I have two spark datasets that I'm trying to join. The join keys are nested in dataset A, so I must flatmap it out first before joining with dataset B. The problem is that as soon as I flatmap that field, the column name becomes the default "_1", "_2", etc. Is it possible to change the alias somehow?
A.flatMap(a => a.keys).join(B).where(...)
After applying the transformation like flatMap you lose the columns as which is logical as after applying transformation like flatMap or map it does not guarantee that the number of column or datatype inside each column remain the same.That's why we lose the column name there.
What you can do is you can fetch all previous column and then apply it to the dataset like this:-
val columns = A.columns
A.flatMap(a => a.keys).toDF(columns:_ *).join(B).where(...)
this will only work if the number of columns is same after applying flatmap
Hope this clears your issue
Thanks
Related
I have 2 different dataframes and I was able to join them together based on g_id. Just like below:
df1 = dfx.join(df_gi, regexp_extract(trim(dfx.LOCATION), ".*/GDocs/([0-9]{1,5})/.*", 1) == df_gi.g_id, "inner")\
.select (dfx["*"], df_gi["G_Number2"])
Now, dfx daraframe has a column called G_Number1 and df_gi dataframe has a similar column called G_Number2, Both of these columns combined solves the missing pieces ... Meaning one column has some information and the other has some. Combining both together is the output needed.
How can I achieve in pyspark?? I tried the concat function .. but i was way off.
Thank you in advance.
You can use coalesce:
import pyspark.sql.functions as f
df.withColumn('Output', f.coalesce('G_Number2', 'G_Number1'))
Notice this will prioritize G_Number2 column when both are not null, if you need the other way, just switch the order of the two columns.
I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?
Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.
Edit
<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:
df.withColumn('sim_count',
f.when(
f.col("col_1").isNotNull(),
(
df.where(
f.col("price_list").between(f.col("col1"), f.col("col2"))
).count()
)
).otherwise(f.lit(None).cast(LongType()))
)
I have two dataframes that are large here are sample examples..
first
firstnames|lastnames|age
tom|form|24
bob|lip|36
....
second
firstnames|lastnames|age
mary|gu|24
jane|lip|36
...
I would like to take both dataframes and combine them into one that look like:
firstnames|lastnames|age
tom|form|24
bob|lip|36
mary|gu|24
jane|lip|36
...
now I could write them both out and them read them together but that's a huge waste.
If both dataframes are identical in structure then it's straight forward -union()
df1.union(df2)
In case any dataframe have any missing column then you have add dummy column in that dataframe on that specific column position else union will throw column mismatch exception. in below example column 'c3' is missing in df1 so I am adding dummy column in df1 in last position.
from pyspark.sql.functions import lit
df1.select('c1','c2',lit('dummy')).union(df2.select('c1','c2','c3'))
this is a simple as shown here : union https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")
I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)