This is a spark related question. I have to add static data to various types of records, each type of record being processed as a different dataframe (say df1, df2, .. df6)
The static data that I intend to add, has to be repeated with all the 6 dataframes.
What would be a more performant way:
For each of the 6 dataframes, use:
.witColumn("testA", lit("somethingA"))
.witColumn("testB", lit("somethingB"))
.witColumn("testC", lit("somethingC"))
or
Create a new DF, say staticDF which has all the columns that I intend to append to each of the 6 dataframes and use a union?
or
Any other option that I have not considered?
The first way is correct. The second way wouldn't work because union add rows to a dataframe, not columns.
Another way is to use select to select all new columns at the same time:
df2 = df.select(
'*',
lit('somethingA').alias('testA'),
lit('somethingB').alias('testB'),
lit('somethingC').alias('testC')
)
Related
I have 2 different dataframes and I was able to join them together based on g_id. Just like below:
df1 = dfx.join(df_gi, regexp_extract(trim(dfx.LOCATION), ".*/GDocs/([0-9]{1,5})/.*", 1) == df_gi.g_id, "inner")\
.select (dfx["*"], df_gi["G_Number2"])
Now, dfx daraframe has a column called G_Number1 and df_gi dataframe has a similar column called G_Number2, Both of these columns combined solves the missing pieces ... Meaning one column has some information and the other has some. Combining both together is the output needed.
How can I achieve in pyspark?? I tried the concat function .. but i was way off.
Thank you in advance.
You can use coalesce:
import pyspark.sql.functions as f
df.withColumn('Output', f.coalesce('G_Number2', 'G_Number1'))
Notice this will prioritize G_Number2 column when both are not null, if you need the other way, just switch the order of the two columns.
I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this
I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.
I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df
How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??
I am trying to join two dataframes with the same column names and compute some new values. after that i need to drop all columns of second table. The number of columns is huge. How can I do it in easier way? I tried to .drop("table2.*"),but this dont work.
You can use select with aliases:
df1.alias("df1")
.join(df2.alias("df2"), Seq("someJoinColumn"))
.select($"df1.*", $"someComputedColumn", ...)
reference with the parent DataFrame:
df1.join(df2, Seq("someJoinColumn")).select(df1("*"), $"someComputedColumn", ...)
Instead of dropping, you can select all the necessary columns that you want hold for further operations something like below
val newDataFrame = joinedDataFrame.select($"col1", $"col4", $"col6")
I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)