Spark Union vs adding columns using lit in spark - apache-spark

This is a spark related question. I have to add static data to various types of records, each type of record being processed as a different dataframe (say df1, df2, .. df6)
The static data that I intend to add, has to be repeated with all the 6 dataframes.
What would be a more performant way:
For each of the 6 dataframes, use:
.witColumn("testA", lit("somethingA"))
.witColumn("testB", lit("somethingB"))
.witColumn("testC", lit("somethingC"))
or
Create a new DF, say staticDF which has all the columns that I intend to append to each of the 6 dataframes and use a union?
or
Any other option that I have not considered?

The first way is correct. The second way wouldn't work because union add rows to a dataframe, not columns.
Another way is to use select to select all new columns at the same time:
df2 = df.select(
'*',
lit('somethingA').alias('testA'),
lit('somethingB').alias('testB'),
lit('somethingC').alias('testC')
)

Related

Merge, Combine 2 column in spark dataframe

I have 2 different dataframes and I was able to join them together based on g_id. Just like below:
df1 = dfx.join(df_gi, regexp_extract(trim(dfx.LOCATION), ".*/GDocs/([0-9]{1,5})/.*", 1) == df_gi.g_id, "inner")\
.select (dfx["*"], df_gi["G_Number2"])
Now, dfx daraframe has a column called G_Number1 and df_gi dataframe has a similar column called G_Number2, Both of these columns combined solves the missing pieces ... Meaning one column has some information and the other has some. Combining both together is the output needed.
How can I achieve in pyspark?? I tried the concat function .. but i was way off.
Thank you in advance.
You can use coalesce:
import pyspark.sql.functions as f
df.withColumn('Output', f.coalesce('G_Number2', 'G_Number1'))
Notice this will prioritize G_Number2 column when both are not null, if you need the other way, just switch the order of the two columns.

How to create single row panda DataFrame with headers from big panda DataFrame

I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df

split,operate and union dataframe in spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

Spark sql dataframe drop all columns from alias table after join

I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Resources