How to add Id to all the rows of dataframe in spark - apache-spark

Hi my csv file structure is like
File-1
id col1 col2
a_1 sd fd
ds emd
sf jhk
File-2
id col1 col2
b_2 sd fd
ds emd
sf jhk
Now I want that when I load my csv file into the dataframe my id column for the all the rows of this file 1 should have value 1 and of file 2 value 2. So my datframe should look like-
id col1 col2
a_1 sd fd
a_1 ds emd
a_1 sf jhk
b_2 sd fd
b_2 ds emd
b_2 sf jhk
I want to do this so I can identify the rows by file id if I am reading multiple csv file.Please note that I dont want to add filename as id, I want to use the id column in the first row of file to extend to all the corresponding rows of file in the dataframe.

if you are sure if its going to be in the first row.
Below is psudo code.
file1_id = df_file1.filter(id != None).select(col('id')).collect()[0]
and then use the above calculated id for the file as
df_file1.drop('id').withColumn('id',lit(file1_id))
Follow the same for the second dataframe df_file2
then do a union
df_file = df_file1.unionAll(df_file2)

Related

How can I count unique values and groupby?

I have been trying to count and group per row the number of unique values. Perhaps will be easier to explain showing a table. I should first transpose before counting and groupby??
Box1
Box2
Box3
Count Result 1
Count Result 2
Count Result 3
Data A
Data A
Data B
Data A = 2
Data B = 1
Data C
Data D
Data B
Data C = 1
Data D = 1
Data B = 1
in GS try:
=ARRAYFORMULA(TRIM(SPLIT(FLATTEN(QUERY(QUERY(
QUERY(SPLIT(FLATTEN(A2:C3&" = ×"&ROW(A2:C3)), "×"),
"select max(Col1) group by Col1 pivot Col2")&
QUERY(SPLIT(FLATTEN(A2:C3&" = ×"&ROW(A2:C3)), "×"),
"select count(Col1) group by Col1 pivot Col2")&"​",
"offset 1", ),,9^9)), "​")))

Spark: How to merge two similar columns from two DataFrames in one column by doing join?

I have SQL table that I have to update by using data from with table.
For this purpose, I calculate DataFrame.
I have two DataFrame: that I calculate and that I get from database.
val myDF = spark.read.<todo something>.load()
val dbDF = spark.read.format("jdbc").<...>.load()
Finally, both DataFrame have the same structure.
For example:
myDF
key
column
key1
1
key2
2
key3
3
dbDF
key
column
key1
5
key2
5
key3
5
I need to get new DF that will have only one column with name Column.
newDF
key
column
key1
6
key2
7
key3
8
For this purpose, I do next actions:
myDF
.as("left")
.join(dbDF.as("right"), "key")
.withColumn("column_temp", $"left.column" + $"right.column")
.drop($"left.column")
.drop(s"right.column")
.withColumnRenamed("column_temp", "column")
I have to do these actions for each column that I have to calculate.
In other words, my joins don't assume adding new columns. I have to merge similar columns into one column.
I can calculate new column by sum two column, or a can just choose not null column from two given columns, like that:
myDF
.as("left")
.join(dbDF.as("right"), $"key", "outer")
.withColumn("column_temp", coalesce($"left.column", $"right.column"))
.drop($"left.column")
.drop(s"right.column")
.withColumnRenamed("column_temp", "column")
And when my DataFrame have many columns and only 1 or 2 key columns, I have to repeat above actions for each column.
My question is:
Is there more effective way to do what I do? Or do I do it right?
myDF.join(dbDF,myDF.col("key").equalTo(dbDF.col("key")))
.select(myDF.col("key"),myDF.col("column").plus(dbDF.col("column")).alias("column");
Can you try this? It is an inner join so only those rows in the left table that have a match in the right are selected. Is that your case?

Pandas: Compare between same columns for same Ids between 2 dataframes and create a new dataframe with the differences for each column

Hello
I have 2 dataframes one old and one new. After comparing the 2 dataframes I want output generated with the column names for each id and the only changes in values as shown below.
I could merge the 2 dataframes and find the differences for each column separately like
a=df1.merge(df2, on='Ids')
a[a['ColA_x'] != a['ColA_y']]
But I have 80 columns and I want to get the difference with column names and values as shown in the output. Is there a way to do this?
Stack each dataframe to convert column names into row indexes. Concatenate the dataframes side by side:
combined = pd.concat([df1.stack(), df2.stack()], axis=1)
Now, extract the rows with the values that do not match:
combined[combined[0] != combined[1]]
# 0 1
#Ids
#123 ColA AH AB
#234 ColB GO MO
#456 ColA GH AB
#...

Splitting dataframe based on multiple column values

I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!
You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.

How to Groupby based on multiple field and display all columns

I am trying to create a csv file where if few columns are same then i will merge row with similar value into one row .
eg:
Input :
Party_No install_date Start_date End_date Product_family Version City state
111 24-05-2018 25-05-2019 21-03-2020 storage 1 LA USA
111 24-05-2018 25-05-2019 21-03-2020 storage 1 KA USA
111 24-05-2018 25-05-2019 21-03-2020 storage 2 PA UK
Output
Party_No install_date Start_date End_date Product_family Version City state
111 24-05-2018 25-05-2019 21-03-2020 storage 1,2 LA,KA,PA UK,USA
ex : in my case
if value of party_number , item_install_date ,Contract_subline_date , Contract_Subline_end_date , Instance_family
i will merger row with same value into one row . other column apart from above mentioned will have comma separated value
Input CSV file link
Expected output CSV link
Code i tried:
import pandas as pd
import np
df = None
df = pd.read_csv("Export.csv")
df.fillna(0,inplace=True)
pf=df.groupby(['PARTY_NUMBER','ITEM_INSTALL_DATE','CONTRACT_SUBLINE_START_DATE','CONTRACT_SUBLINE_END_DATE','INSTANCE_PRODUCT_FAMILY']).agg([','.join])
pf.to_csv("result1.csv", index=False)
Adding the unqiue (or set when order is not important)
df.groupby(['...']).agg(lambda x : ','.join(x.unique())) # set(x)

Resources