Merging two different pyspark dataframes [duplicate]

Merging two different pyspark dataframes [duplicate] - apache-spark

This question already has answers here:
How to split a dataframe into dataframes with same column values?
(3 answers)
Concatenate two PySpark dataframes
(12 answers)
Closed 4 years ago.
I have two pyspark dataframes with different values that I want to merge on some condition. The below is what I have
DF-1
date person_surname person_order_number item
2017-08-09 pearson 1 shoes
2017-08-09 zayne 3 clothes
DF-2
date person_surname person_order_number person_slary
2017-08-09 pearson 2 $1000
2017-08-09 zayne 5 $2000
I want to merge DF1 and DF2 such that the surnames of the people match and the person_order_number is merged correct. So i want the following returned
DF_pearson
date person_surname person_order_number item salary
2017-08-09 pearson 1 shoes
2017-08-09 pearson 2 $1000
DF_Zayne
date person_surname person_order_number item salary
2017-08-09 zayne 3 clothes
2017-08-09 zayne 5 $2000
How do i achieve this? I want to then perform operations on each of these dataframes as well.

Related

How to transpose two column header as row values and make values of these columns comes under another column name or header using Python Pandas? [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 7 months ago.
I want to make two column header as row values comes under a new column and also make their values comes under another column header or column name using Python Pandas. I searched about it, but I could not found a solution for this.
First table:
I want to make table like this :
second table:
Can anyone give me a solution to solve this.

Try this:
df1 = df.melt(id_vars = ['name', 'place'], value_vars = ['weight', 'numbers'], var_name = 'measure', value_name = 'measureing_values')
print(df1)
name place measure measureing_values
0 apple delhi weight 2
1 orange up weight 3
2 onion goa weight 4
3 apple delhi numbers 6
4 orange up numbers 8
5 onion goa numbers 25

How to Melt data shape in custom format in python pandas? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I have below database format stored in a pandas dataframe
ID Block
MGKfdkldr Product 1
MGKfdkldr Product 2
MGKfdkldr Product 3
GLOsdasd Product 2
GLOsdasd Product 3
NewNew Product 1
OldOld Product 4
OldOld Product 8
Here is the sample dataframe code
df1 = pd.DataFrame({'ID':['MGKfdkldr','MGKfdkldr','MGKfdkldr','GLOsdasd','GLOsdasd','NewNew','OldOld','OldOld'],'Block':['Product 1','Product 2','Product 3','Product 2','Product 3','Product 1','Product 4','Product 8']})
I am looking for below data format from this(Expected output):
ID Block-1 Block-2 Block-3
MGKfdkldr Product 1 Product 2 Product 3
GLOsdasd Product 2 Product 3
NewNew Product 1
OldOld Product 4 Product 8
I have tried to melt it with pd.melt function but it just transposing data to column header but I am looking for bit difference. Is there any other method through which I can get my Expected output?
Can anyone help me on this? Please

The function you're looking for is pivot not melt. You'll also need to provide a "counter" column that simply counts the repeated "ID"s to get everything to align properly.
df1["Block_id"] = df1.groupby("ID").cumcount() + 1
new_df = (df1.pivot("ID", "Block_id", "Block") # reshapes our data
.add_prefix("Block-") # adds "Block-" to our column names
.rename_axis(columns=None) # fixes funky column index name
.reset_index()) # inserts "ID" as a regular column instead of an Index
print(new_df)
ID Block-1 Block-2 Block-3
0 GLOsdasd Product 2 Product 3 NaN
1 MGKfdkldr Product 1 Product 2 Product 3
2 NewNew Product 1 NaN NaN
3 OldOld Product 4 Product 8 NaN
If you want actual blanks (e.g. empty string "") instead of a NaN, you can use new_df.fillna("")

Un-merge a dataframe based on a column [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 2 years ago.
I am reading a csv file into a pandas dataframe and the data inside dataframe is as below:
item seq_no db_xml
0 28799179 5 ['<my_xml>....</my_xml>']
1 28839888 1 ['<my_xml>....</my_xml>']
2 28840113 75 ['<my_xml>....</my_xml>']
3 28852466 20,22 ['<my_xml1>....</my_xml1>', '<my_xml2>....</my_xml2>']
I need to convert above dataframe as below i.e. each seq_no for same item and its db_xml should be in different rows. I need to unmerge seq_no of same item in subsequent rows.
item seq_no db_xml
0 28799179 5 ['<my_xml>....</my_xml>']
1 28839888 1 ['<my_xml>....</my_xml>']
2 28840113 75 ['<my_xml>....</my_xml>']
3 28852466 20 ['<my_xml1>....</my_xml1>']
4 28852466 22 ['<my_xml2>....</my_xml2>']
Please let me know on how to achieve the same in pandas so that even seq_no is also split and in separate rows?

Finding values in a column that are same and making separate dataframes of them [duplicate]

This question already has answers here:
Splitting dataframe into multiple dataframes
(13 answers)
Closed 4 years ago.
I have a pandas dataframe that contains two columns, one columns has many values and the other column only contains values most of which are same. This is what the dataframe looks like:
Item Price
Apple 10
Banana 5
Mango 10
Pineapple 7
Kiwi 5
Tomatoes 2
Eggs 10
Potatoes 7
Burgers 5
Milk 2
Chicken 10
Coffee 7
Noodles 5
The values on the price column change. I want to be be able to filter the items for which the prices are same and create a new dataframe from them. What I'm not being able to do is the look for different prices which occur, if the prices were to stay the same then I am able to filter the items based on that but I can't when the prices arbitrarily change.
This is what I want to achieve, it is only a dataframe for one particular price as an example.
Item Price
Apple 10
Mango 10
Eggs 10
Chicken 10

You can do this using below:
df_new = []
for i in df['Price'].unique():
df1 = df[df['Price']==i]
df_new.append(df1)
print(df_new[0])
Output:
Item Price
Apple 10
Mango 10
Eggs 10
Chicken 10

You can use dataframe function for getting the subset.
price_list = df.Price.unique()
sets = []
for price in price_list:
sets.append(df[df.Price == price])

generate time series dataframe based on a given dataframe [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 4 years ago.
There is a dataframe, which includes one column of time and another column of bill. As can be seen from the table, there can have multiple records for a single day. The order of time can be random
time bill
2006-1-18 10.11
2006-1-18 9.02
2006-1-19 12.34
2006-1-20 6.86
2006-1-12 10.05
Based on these information, I would like to generate a time series dataframe, which has two columns Time and total bill
The time column will save the date in order, the total bill will save the sum of multiple bill records belonging to one day.

newdf = pd.DataFrame(df.groupby('time').bill.sum())
newdf.rename(columns={'time':'Time', 'bill': 'total bill'}, inplace = True)
newdf
output:
Time total_bill
0 2006-1-18 10.11
1 2006-1-18 9.02
2 2006-1-19 12.34
3 2006-1-20 6.86
4 2006-1-12 10.05

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Merging two different pyspark dataframes [duplicate] - apache-spark

Related

How to transpose two column header as row values and make values of these columns comes under another column name or header using Python Pandas? [duplicate]

How to Melt data shape in custom format in python pandas? [duplicate]

Un-merge a dataframe based on a column [duplicate]

Finding values in a column that are same and making separate dataframes of them [duplicate]

generate time series dataframe based on a given dataframe [duplicate]

Categories

Resources