Merge two column in spark dataframe to form single column - apache-spark

I have a Spark dataframe with two columns; src_edge and dest_edge. I simply want to create new spark dataframe so that it contains a single column id with values from src_edge and dest_edge.
src dst
1 2
1 3
I want to create df2 as:
id
1
1
2
3
If possible, I would also like to create df2 with no duplicates values. Does anyone have any idea how to do this?
id
1
2
3

Update
The simplest thing may be to select each column, union them, and call distinct:
from pyspark.sql.functions import col
df2 = df.select(col("src").alias("id")).union(df.select(col("dst").alias("id"))).distinct()
df2.show()
#+---+
#| id|
#+---+
#| 1|
#| 3|
#| 2|
#+---+
You can also accomplish this with an outer join:
df2 = df.select(col("src").alias("id"))\
.join(
df.select(col("dst").alias("id")),
on="id",
how="outer"
)\
.distinct()

Create a new column using array and explode to combine and flatten the two columns. Then, to remove duplicates use dropDuplicates:
from pyspark.sql.functions import array, explode
df2 = df.select(explode(array("src", "dst")).alias("id"))
.dropDuplicates()

Related

Pyspark replace string in every column name

I am converting Pandas commands into Spark ones. I bumped into wanting to convert this line into Apache Spark code:
This line replaces every two spaces into one.
df = df.columns.str.replace(' ', ' ')
Is it possible to replace a string from all columns using Spark?
I came into this, but it is not quite right.
df = df.withColumnRenamed('--', '-')
To be clear I want this
//+---+----------------------+-----+
//|id |address__test |state|
//+---+----------------------+-----+
to this
//+---+----------------------+-----+
//|id |address_test |state|
//+---+----------------------+-----+
You can apply the replace method on all columns by iterating over them and then selecting, like so:
df = spark.createDataFrame([(1, 2, 3)], "id: int, address__test: int, state: int")
df.show()
+---+-------------+-----+
| id|address__test|state|
+---+-------------+-----+
| 1| 2| 3|
+---+-------------+-----+
from pyspark.sql.functions import col
new_cols = [col(c).alias(c.replace("__", "_")) for c in df.columns]
df.select(*new_cols).show()
+---+------------+-----+
| id|address_test|state|
+---+------------+-----+
| 1| 2| 3|
+---+------------+-----+
On the sidenote: calling withColumnRenamed makes Spark create a Projection for each distinct call, while a select makes just single Projection, hence for large number of columns, select will be much faster.
Here's a suggestion.
We get all the target columns:
columns_to_edit = [col for col in df.columns if "__" in col]
Then we use a for loop to edit them all one by one:
for column in columns_to_edit:
new_column = column.replace("__", "_")
df = df.withColumnRenamed(column, new_column)
Would this solve your issue?

pyspark value of column when other column has first nonmissing value

Suppose I have the following pyspark dataframe df:
id date var1 var2
1 1 NULL 2
1 2 b 3
2 1 a NULL
2 2 a 1
I want the first non missing observation for all var* columns and additionally the value of date where this is from, i.e. the final result should look like:
id var1 dt_var1 var2 dt_var2
1 b 2 2 1
2 a 1 1 2
Getting the values is straightforward using
df.orderBy(['id','date']).groupby('id').agg(
*[F.first(x, ignorenulls=True).alias(x) for x in ['var1', 'var2']]
)
But I fail to see how I could get the respective dates. I could loop variable for variable, drop missing, and keep the first row. But this sounds like a poor solution that will not scale well, as it would require a separate dataframe for each variable.
I would prefer a solution that scales to many columns (var3, var4,...)
You should not use groupby if you want to get the first non-null according to date ordering. The order is not guaranteed after a groupby operation even if you called orderby just before.
You need to use window functions instead. To get the date associated with each var value you can use this trick with structs:
from pyspark.sql import Window, functions as F
w = (Window.partitionBy("id").orderBy("date")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
df1 = df.select(
"id",
*[F.first(
F.when(F.col(x).isNotNull(), F.struct(x, F.col("date").alias(f"dt_{x}"))),
ignorenulls=True).over(w).alias(x)
for x in ["var1", "var2"]
]
).distinct().select("id", "var1.*", "var2.*")
df1.show()
#+---+----+-------+----+-------+
#| id|var1|dt_var1|var2|dt_var2|
#+---+----+-------+----+-------+
#| 1| b| 2| 2| 1|
#| 2| a| 1| 1| 2|
#+---+----+-------+----+-------+

Promote Row 1 as Column Heading - Spark DataFrame

I got below Spark Data Frame.
I want to promote Row 1 as column Headings and the new spark DataFrame should be
I know this can be done in pandas easily as:
new_header = pandaDF.iloc[0]
pandaDF = pandaDF[1:]
pandaDF.columns = new_header
But doesn't want to convert into Pandas DF as have to persist this into to Database, wherein have to convert back pandas DF to Spark DF and then register as table and then write to db.
Try with .toDF and filter our the column values.
Example:
#sample dataframe
df.show()
#+----------+------------+----------+
#| prop_0| prop_1| prop_2|
#+----------+------------+----------+
#|station_id|station_name|sample_num|
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
#+----------+------------+----------+
from pyspark.sql.functions import *
cols=sc.parallelize(cols).map(lambda x:x).collect()
df.toDF(*cols).filter(~col("station_id").isin(*cols)).show()
#+----------+------------+----------+
#|station_id|station_name|sample_num|
#+----------+------------+----------+
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
#+----------+------------+----------+

How to use"select" and "withColumn" together- Pyspark

I have two dataframes df1 and df2
I have to join the the two dataframes and create a new one
the join is carried using df1.col1 = df2.col1, inner join
My query here is can I use "select" and "withColumn" statements together?
for example
df3 = df1.join(df2,df1.col1 = df2.col1,'inner').select(df1.col4,df2.col4).
withColumn("col2",(df1.col1+df2.col2))
withColumn("col3",(df1.col1/df2.col2))
How can I achieve this
separately select and withcolumn works.
Dataframe_example
You need to select all the required columns in .select and only those columns will be used in .withColumn
Example:
df1=spark.createDataFrame([("a","1","4","t"),("b","2","5","v"),("c","3","6","v")],["col1","col2","col3","col4"])
df2=spark.createDataFrame([("a","1","4","ord2"),("b","2","5","ord1"),("c","3","6","ord3")],["col1","col2","col3","col4"])
df1.join(df2,df1.col1 == df2.col1,'inner').select(df1.col1,df2.col2,df1.col3,df1.col2,df2.col4).withColumn("col3",(df1.col3 / df2.col2).cast("double")).withColumn("col2",(df1.col2 + df2.col2).cast("int")).show()
#+----+----+----+----+----+
#|col1|col2|col3|col2|col4|
#+----+----+----+----+----+
#| a| 2| 4.0| 2|ord2|
#| b| 4| 2.5| 4|ord1|
#| c| 6| 2.0| 6|ord3|
#+----+----+----+----+----+
Perhaps you want to rearrange the order of your operations. From all the columns in the dataframe select filters that list. If you intent to use withColumn make sure the columns are available (selected). As a rule of thumb, leave select statements at the end of your transformations.
# make sure to use the keyword` attributes so you don't get confused
df3 = df1.join(df2, on='col1',how='inner') \
.withColumn("col2",(df1.col2+df2.col2)) \
.withColumn("col3",(df2.col3/df1.col2)) \
.select('col1', 'col2', 'col3', df2.col4)
To see what is happening in each of the transformations add .show() statement and it will all be much clearer step by step.

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources