Duplicate Records move to other temp table in pyspark

Duplicate Records move to other temp table in pyspark - apache-spark

I am using Pyspark
My Input Data looks like below.
COL1|COL2
|TYCO|130003|
|EMC |120989|
|VOLVO|102329|
|BMW|130157|
|FORD|503004|
|TYCO|130003|
I have created DataFrame and querying for duplicates like below.
from pyspark.sql import Row
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
data = spark.read.csv("filepath")
data.registerTempTable("data")
spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 ").show()
This give correct result but can we get duplicate value in seperate temp table.
output data in Temp1
+----+------+
| 1|120989|
| 1|102329|
| 1|130157|
| 1|503004|
+----+------+
output data in temp2
+----+------+
| 2|130003|
+----+------+

sqlDF = spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 having cnt > 1 ");
sqlDF.createOrReplaceTempView("temp2");

Related

Create PySpark dataframe with timeseries column

I have an initial PySpark dataframe from which I would like to take the MIN and MAX from a date column and then create a new PySpark dataframe with a timeseries (daily date), using the MIN and MAX from my initial dataframe.
I will use it to then join with my initial dataframe and find missing days (null in the rest of the column of my inital DF).
I tried in many different ways to build the timeseries DF, but it doesn't seem to work in PySpark. Any suggestions?

Max column's value can be extracted like this:
df.agg(F.max('col_name')).head()[0]
Date range df can be created like this:
df2 = spark.sql("SELECT sequence(to_date('2000-01-01'), to_date('2000-02-02'), interval 1 day) as date_col").withColumn('date_col', F.explode('date_col'))
And then join.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([(1, '2022-04-01'),(2, '2022-04-05')], ['id', 'df1_date']).select('id', F.col('df1_date').cast('date'))
df1.show()
# +---+----------+
# | id| df1_date|
# +---+----------+
# | 1|2022-04-01|
# | 2|2022-04-05|
# +---+----------+
min_date = df1.agg(F.min('df1_date')).head()[0]
max_date = df1.agg(F.max('df1_date')).head()[0]
df2 = spark.sql(f"SELECT sequence(to_date('{min_date}'), to_date('{max_date}'), interval 1 day) as df2_date").withColumn('df2_date', F.explode('df2_date'))
df3 = df2.join(df1, df1.df1_date == df2.df2_date, 'left')
df3.show()
# +----------+----+----------+
# | df2_date| id| df1_date|
# +----------+----+----------+
# |2022-04-01| 1|2022-04-01|
# |2022-04-02|null| null|
# |2022-04-03|null| null|
# |2022-04-04|null| null|
# |2022-04-05| 2|2022-04-05|
# +----------+----+----------+

In Spark dataframe how to Transpose rows to columns?

this may be a very simple question. I want to transpose all the rows of dataframe to columns. I want to convert this df as shown below output DF. What are the ways in spark to achieve this?
Note : I have single column in input DF
import sparkSession.sqlContext.implicits._
val df = Seq(("row1"), ("row2"), ("row3"), ("row4"), ("row5")).toDF("COLUMN_NAME")
df.show(false)
Input DF:
+-----------+
|COLUMN_NAME|
+-----------+
|row1 |
|row2 |
|row3 |
|row4 |
|row5 |
+-----------+
Output DF
+----+----+----+----+----+
|row1|row2|row3|row4|row5|
+----+----+----+----+----+

Does this help you ?
df.withColumn("group",monotonicallyIncreasingId ).groupBy("group").pivot("COLUMN_NAME").agg(first("COLUMN_NAME")).show

Removing somepart of a dataframe column

I have a dataframe named DF like this
Dataframe DF
I have the below code
def func(row):
temp=row.asDict()
temp["concat_val"]="|".join([str(x) for x in row])
put=Row(**temp)
return put
DF.show()
row_rdd=DF.rdd.map(func)
concat_df=row_rdd.toDF().show()
I am getting a result like this
However I want an output which will remove id and nm colume values from concat_val column.
The table should look like below
Please suggest a way to remove id and nm value

So here you are trying to concat the column txt and uppertx and the values should be delimited by "|". You can try below code.
# Load required libraries
from pyspark.sql.functions import *
# Create DataFrame
df = spark.createDataFrame([(1,"a","foo","qwe"), (2,"b","bar","poi"), (3,"c","mnc","qwe")], ["id", "nm", "txt", "uppertxt"])
# Concat column txt and uppertxt delimited by "|"
# Approach - 1 : using concat function.
df1 = df.withColumn("concat_val", concat(df["txt"] , lit("|"), df["uppertxt"]))
# Approach - 2 : Using concat_ws function
df1 = df.withColumn("concat_val", concat_ws("|", df["txt"] , df["uppertxt"]))
# Display Output
df1.show()
Output
+---+---+---+--------+----------+
| id| nm|txt|uppertxt|concat_val|
+---+---+---+--------+----------+
| 1| a|foo| qwe| foo|qwe|
| 2| b|bar| poi| bar|poi|
| 3| c|mnc| qwe| mnc|qwe|
+---+---+---+--------+----------+
You can fnd more info on concat and concat_ws in spark docs.
I hope this helps.

Adding a new column to a spark dataframe [duplicate]

This question already has answers here:
How do I add a new column to a Spark DataFrame (using PySpark)?
(10 answers)
Primary keys with Apache Spark
(4 answers)
Closed 4 years ago.
I want to add a column to a spark dataframe which has been registered as a table. This column needs to have an auto incrementing long.
df = spark.sql(query)
df.createOrReplaceTempView("user_stories")
df = spark.sql("ALTER TABLE user_stories ADD COLUMN rank int AUTO_INCREMENT")
df.show(5)
This throws the following error,
Py4JJavaError: An error occurred while calling o72.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'ALTER TABLE user_stories ADD COLUMN'(line 1, pos 29)
== SQL ==
ALTER TABLE user_stories ADD COLUMN rank int AUTO_INCREMENT
-----------------------------^^^
What am I missing here?

if you want to add new incremental column to DF, you could do in following ways.
df.show()
+-------+
| name|
+-------+
|gaurnag|
+-------+
from pyspark.sql.functions import monotonically_increasing_id
new_df = df.withColumn("id", monotonically_increasing_id())
new_df.show()
+-------+---+
| name| id|
+-------+---+
|gaurnag| 0|
+-------+---+

Merge two data frame with few different columns

I want to merge several DataFrames having few different columns.
Suppose ,
DataFrame A has 3 columns: Column_1, Column_2, Column 3
DataFrame B has 3 columns: Column_1, Columns_2, Column_4
DataFrame C has 3 Columns: Column_1, Column_2, Column_5
I want to merge these DataFrames such that I get a DataFrame like :
Column_1, Column_2, Column_3, Column_4 Column_5
number of DataFrames may increase. Is there any way to get this merge ? such that for a particular Column_1 Column_2 combination i get the values for other three columns in same row, and if for a particular combination of Column_1 Column_2 there is no data in some Columns then it should show null there.
DataFrame A:
Column_1 Column_2 Column_3
1 x abc
2 y def
DataFrame B:
Column_1 Column_2 Column_4
1 x xyz
2 y www
3 z sdf
The merge of A and B :
Column_1 Column_2 Column_3 Column_4
1 x abc xyz
2 y def www
3 z null sdf

If I understand your question correctly, you'll be needing to perform an outer join using a sequence of columns as keys.
I have used the data provided in your question to illustrate how it is done with an example :
scala> val df1 = Seq((1,"x","abc"),(2,"y","def")).toDF("Column_1","Column_2","Column_3")
// df1: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string]
scala> val df2 = Seq((1,"x","xyz"),(2,"y","www"),(3,"z","sdf")).toDF("Column_1","Column_2","Column_4")
// df2: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_4: string]
scala> val df3 = df1.join(df2, Seq("Column_1","Column_2"), "outer")
// df3: org.apache.spark.sql.DataFrame = [Column_1: int, Column_2: string, Column_3: string, Column_4: string]
scala> df3.show
// +--------+--------+--------+--------+
// |Column_1|Column_2|Column_3|Column_4|
// +--------+--------+--------+--------+
// | 1| x| abc| xyz|
// | 2| y| def| www|
// | 3| z| null| sdf|
// +--------+--------+--------+--------+
This is called an equi-join with another DataFrame using the given columns.
It is different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.
Note
Outer equi-joins are available since Spark 1.6.

First use following codes for all three data frames, so that SQL queries can be implemented on dataframes
DF1.createOrReplaceTempView("df1view")
DF2.createOrReplaceTempView("df2view")
DF3.createOrReplaceTempView("df3view")
then use this join command to merge
val intermediateDF = spark.sql("SELECT a.column1, a.column2, a.column3, b.column4 FROM df1view a leftjoin df2view b on a.column1 = b.column1 and a.column2 = b.column2")`
intermediateDF.createOrReplaceTempView("imDFview")
val resultDF = spark.sql("SELECT a.column1, a.column2, a.column3, a.column4, b.column5 FROM imDFview a leftjoin df3view b on a.column1 = b.column1 and a.column2 = b.column2")
these join can also be done together in one join, also since you want all values of column1 and column2,you can use full outer join instead of left join

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Duplicate Records move to other temp table in pyspark - apache-spark

sqlDF = spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 having cnt > 1 "); sqlDF.createOrReplaceTempView("temp2");

Related

Create PySpark dataframe with timeseries column

In Spark dataframe how to Transpose rows to columns?

Removing somepart of a dataframe column

Adding a new column to a spark dataframe [duplicate]

Merge two data frame with few different columns

Categories

Resources