I have two dataframes that are large here are sample examples..
first
firstnames|lastnames|age
tom|form|24
bob|lip|36
....
second
firstnames|lastnames|age
mary|gu|24
jane|lip|36
...
I would like to take both dataframes and combine them into one that look like:
firstnames|lastnames|age
tom|form|24
bob|lip|36
mary|gu|24
jane|lip|36
...
now I could write them both out and them read them together but that's a huge waste.
If both dataframes are identical in structure then it's straight forward -union()
df1.union(df2)
In case any dataframe have any missing column then you have add dummy column in that dataframe on that specific column position else union will throw column mismatch exception. in below example column 'c3' is missing in df1 so I am adding dummy column in df1 in last position.
from pyspark.sql.functions import lit
df1.select('c1','c2',lit('dummy')).union(df2.select('c1','c2','c3'))
this is a simple as shown here : union https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
Related
I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')
I have 2 different dataframes and I was able to join them together based on g_id. Just like below:
df1 = dfx.join(df_gi, regexp_extract(trim(dfx.LOCATION), ".*/GDocs/([0-9]{1,5})/.*", 1) == df_gi.g_id, "inner")\
.select (dfx["*"], df_gi["G_Number2"])
Now, dfx daraframe has a column called G_Number1 and df_gi dataframe has a similar column called G_Number2, Both of these columns combined solves the missing pieces ... Meaning one column has some information and the other has some. Combining both together is the output needed.
How can I achieve in pyspark?? I tried the concat function .. but i was way off.
Thank you in advance.
You can use coalesce:
import pyspark.sql.functions as f
df.withColumn('Output', f.coalesce('G_Number2', 'G_Number1'))
Notice this will prioritize G_Number2 column when both are not null, if you need the other way, just switch the order of the two columns.
I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df
I have 2 different dataframes like so -
and
I need to add a column "Present In" to the first dataframe, that lists all the items in C that correspond to the K ID in the second dataframe. So, the first table should look something like -
How can I do this using Pandas? Thanks! :)
I will do gruopby with df2 , the map
s=df2.groupby('K ID')['C'].apply(','.join)
df1['Present In']=df1['K ID'].map(s).fillna('')
I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)