I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)
Related
I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df
I have two dataframes that are large here are sample examples..
first
firstnames|lastnames|age
tom|form|24
bob|lip|36
....
second
firstnames|lastnames|age
mary|gu|24
jane|lip|36
...
I would like to take both dataframes and combine them into one that look like:
firstnames|lastnames|age
tom|form|24
bob|lip|36
mary|gu|24
jane|lip|36
...
now I could write them both out and them read them together but that's a huge waste.
If both dataframes are identical in structure then it's straight forward -union()
df1.union(df2)
In case any dataframe have any missing column then you have add dummy column in that dataframe on that specific column position else union will throw column mismatch exception. in below example column 'c3' is missing in df1 so I am adding dummy column in df1 in last position.
from pyspark.sql.functions import lit
df1.select('c1','c2',lit('dummy')).union(df2.select('c1','c2','c3'))
this is a simple as shown here : union https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html
I have two spark datasets that I'm trying to join. The join keys are nested in dataset A, so I must flatmap it out first before joining with dataset B. The problem is that as soon as I flatmap that field, the column name becomes the default "_1", "_2", etc. Is it possible to change the alias somehow?
A.flatMap(a => a.keys).join(B).where(...)
After applying the transformation like flatMap you lose the columns as which is logical as after applying transformation like flatMap or map it does not guarantee that the number of column or datatype inside each column remain the same.That's why we lose the column name there.
What you can do is you can fetch all previous column and then apply it to the dataset like this:-
val columns = A.columns
A.flatMap(a => a.keys).toDF(columns:_ *).join(B).where(...)
this will only work if the number of columns is same after applying flatmap
Hope this clears your issue
Thanks
How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??
I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")