How to do join including nulls in pyspark?

How to do join including nulls in pyspark? - apache-spark

I have two data frames
first data frame has 8 columns - col1,col2,col3,col4.col5,col6,col7,col8
second data frame has 10 columns - t_col1,tcol_2,t_col3,t_col4,t_col5,t_col6,t_col7,t_col8,answer_1,answer_2
I am trying for a new merged data frame - that has all the columns from the first data frame and the last two columns from the second dataframe.
The code I used is:
res = res.join(df2,
(res.col_1== df2.t_col1) &
(res.col_2== df2.t_col2) &
(res.col_3== df2.t_col3) &
(res.col_4 == df2_t_col4) &
(res.col_5== df2.t_col5) &
(res.col_6== df2.t_col6) &
(res.col_7== df2.t_col7) &
(res.col_8== df2.t_col8),
how='left')
But somehow the null values are getting ignored in the merged data frame. Also, I am getting many NaNs
The result data frame should have cols - col1,col2,col3,col4,col5,col6,col7,col8,answer_1,answer_2

Related

How to update the empty Pandas dataframe from the the sum of the bottom x rows of a column from an another Pandas dataframe

I would like to sum the bottom x rows of each column of the dataframe and update it in a another empty data frame.
I tried the below code but i could not update the dataframe.
The master DataFrame is ‘df_new_final’ and it contains numerical values.
I want to update in a ‘df_new_final_tail’ as an input of the sum of tail 15 rows from Master DataFrame. But df_new_final_tail is still an empty but i can see that ‘sum_x’ is getting calculated. Not sure why it is not getting updated.
Master DataFrame ——> df_new_final
Child DataFrame ——-> df_new_final_tail
df_series_list = df_series.columns.values.tolist()
df_new_final_tail = pd.DataFrame(columns=df_series_list)
for items in df_series_list:
sum_x = df_new_final.tail(15)[items+’_buy’].sum()
df_new_final_tail[items]=sum_x
Thanks

Convert Series after sum to one column DataFrame by Series.to_frame and for one row DataFrame use transpose by DataFrame.T:
df_new_final_tail = df_new_final.tail(15).sum().to_frame().T
If df_series is another DataFrame and columns names are same with suffix _buy for parse df_new_final use:
items = df_series.columns
df_new_final_tail = df_new_final.tail(15)[items+'_buy'].sum().to_frame().T

Using Pandas: How do I combine multiple rows of data into a single row based on a common key?

Need help merging multiple rows of data with various datatypes for multiple columns
I have a dataframe that contains 14 columns and x number of rows of data. An example slice of the dataframe is linked below:
Current Example of my dataframe
I want to be able to merge all four rows of data into a single row based on the "work order" column. See linked image below. I am currently using pandas to take data from four different data sources and create a dataframe that has all the relevant data I want based on each work order number. I have tried various methods including groupby, merge, join, and others without any good results.
How I want my dataframe to look in the end
I essentially want to groupby the work order value, merge all the site names into a single value, then have all data essentially condense to a single row. If there is identical data in a column then I just want it to merge together. If there are values in a column that are different (such as in "Operator Ack Timestamp") then I don't mind the data being a continuous string of data (ex. one date after the next within the same cell).
example dataframe data:
df = pd.DataFrame({'Work Order': [10025,10025,10025,10025],
'Site': ['SC1', 'SC1', 'SC1', 'SC1'],
'Description_1':['','','Inverter 10A-1 - No Comms',''],
'Description_2':['','','Inverter 10A-1 - No Comms',''],
'Description_3':['Inverter 10A-1 has lost communications.','','',''],
'Failure Type':['','','Communications',''],
'Failure Class':['','','2',''],
'Start of Fault':['','','2021-05-30 06:37:00',''],
'Operator Ack Timestamp':['2021-05-30 8:49:21','','2021-05-30 6:47:57',''],
'Timestamp of Notification':['2021-05-30 07:18:58','','',''],
'Actual Start Date':['','2021-05-30 6:37:00','','2021-05-30 6:37:00'],
'Actual Start Time':['','06:37:00','','06:37:00'],
'Actual End Date':['','2021-05-30 08:24:00','',''],
'Actual End Time':['','08:24:00','','']})
df.head()

4 steps to get expected output:
Replace empty values by pd.NA,
Group your data by Work Order column because it seems to be the index key,
For each group, fill NA value by last valid observation and keep the last record,
Reset index to have the same format as input.
I choose to group by "Work Order" because it seems to be the index key of your dataframe.
The index of your dataframe is "Work Order":
df = df.set_index("Work Order")
out = df.replace({'': pd.NA}) \
.groupby("Work Order", as_index=False) \
.apply(lambda x: x.ffill().tail(1)) \
.reset_index(level=0, drop=True)```
>>> out.T # transpose for better vizualisation
Work Order 10025
Site SC1
Description_1 Inverter 10A-1 - No Comms
Description_2 Inverter 10A-1 - No Comms
Description_3 Inverter 10A-1 has lost communications.
Failure Type Communications
Failure Class 2
Start of Fault 2021-05-30 06:37:00
Operator Ack Timestamp 2021-05-30 6:47:57
Timestamp of Notification 2021-05-30 07:18:58
Actual Start Date 2021-05-30 6:37:00
Actual Start Time 06:37:00
Actual End Date 2021-05-30 08:24:00
Actual End Time 08:24:00

pandas: populate all rows in a dataframe column with the values from one cell in a separate data frame

I want to populate all rows in a dataframe column with the values from one cell in a separate data frame.
Both dfs are based on data read in from the same CSV.
data_description = pd.read_csv(file.csv, nrows=1)
#this two rows of data: one row of column headers and one row of values. The value I want to use later is under the header "average duration"
data_table = pd.read_csv(file.csv, skiprows=3)
#this is a multi row data table located directly below the description. I to want add a "duration" column will all rows populated by "average duration" from above.
df1 = pd.DataFrame(data_description)
df2 = pd.DataFram(data_table)
df2['duration'] = df1['average duration']
The final line only works for the first from in the column. How can extend down all rows?
If I directly assign the 'average duration' value it works, e.g. df2['duration'] = 60, but I want it to be dynamic.

You have to extract the value from df1 and then assign the value to df2. What you're assigning is a Series, not the value.
data_description = pd.read_csv(file.csv, nrows=1)
data_table = pd.read_csv(file.csv, skiprows=3)
df1 = pd.DataFrame(data_description)
df2 = pd.DataFram(data_table)
df2['duration'] = df1['average duration'][0]

Why dataframe describe() method dosn't show all of the columns details after merge method?

I have two csv file of datasets and I select some datas from first and marge them with the second dataset and replace to first one. so now I have new columns in my first datasets and now I need the average of one column what is new after marge. But when I use mean method or describe method they dosen't show that column details.
dfgdp = pd.read_csv('../datasets/2014_world_gdp_with_codes.csv')
dfgdp.rename(columns = {'GDP (BILLIONS)':'gdp', 'COUNTRY':'country'}, inplace = True)
dfgdp = dfgdp[dfgdp['gdp']>1000]
dfgdp = pd.merge(dfgdp, dfmig, how='inner', on='country')
dfgdp.describe()
That show the 4 columns but I have 8 columns in my table.

Joining a temporary DataFrame to permanent DataFrame

I am trying to join a predefined Data Frame to a temporary Data Frame which is located within a loop. My code template is as follows:
df1_date = pd.date_range('2008-01-01', '2017-01-01')
df1 = pd.DataFrame(index=final_date)
#A loop:
dates = pd.date_range('2008-01-01', '2009-01-01')
df2 = pd.DataFrame(data=data, index=dates)
df1 = df1.join(df2, how='left')
When the loop runs for the first time the index values of DataFrame which is calculated in loop joins with the DataFrame which was defined in begining. But when the loop runs for the next time it gives the following error:
ValueError: columns overlap but no suffix specified: Index(['Valuation'], dtype='object')
The next time loop runs it calculates values for the next time period and I want it to join with the permanent Data Frame, rather than giving the above error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to do join including nulls in pyspark? - apache-spark

Related

How to update the empty Pandas dataframe from the the sum of the bottom x rows of a column from an another Pandas dataframe

Using Pandas: How do I combine multiple rows of data into a single row based on a common key?

pandas: populate all rows in a dataframe column with the values from one cell in a separate data frame

Why dataframe describe() method dosn't show all of the columns details after merge method?

Joining a temporary DataFrame to permanent DataFrame

Categories

Resources