generate time series dataframe based on a given dataframe [duplicate] - python-3.x

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 4 years ago.
There is a dataframe, which includes one column of time and another column of bill. As can be seen from the table, there can have multiple records for a single day. The order of time can be random
time bill
2006-1-18 10.11
2006-1-18 9.02
2006-1-19 12.34
2006-1-20 6.86
2006-1-12 10.05
Based on these information, I would like to generate a time series dataframe, which has two columns Time and total bill
The time column will save the date in order, the total bill will save the sum of multiple bill records belonging to one day.

newdf = pd.DataFrame(df.groupby('time').bill.sum())
newdf.rename(columns={'time':'Time', 'bill': 'total bill'}, inplace = True)
newdf
output:
Time total_bill
0 2006-1-18 10.11
1 2006-1-18 9.02
2 2006-1-19 12.34
3 2006-1-20 6.86
4 2006-1-12 10.05

Related

Groupby on python without aggegation

I am having problem on selecting observations on each unique index.
I would like to extract data from each group like the following table:
enter image description here
The table consists of 12 unique months and they repeat 10 times for 10 years. I would like to group by their unique months and see the distribution of 12 different months. So the final table would be having 12 columns of 12 months and 10 rows of years.
I am thinking of starting from groupby function and use for loop to print out different groups.
data1 = data.groupby(by='month')
for name, group in data1:
print(name)
print(group)

How to select first row after each 3 rows pandas [duplicate]

This question already has answers here:
Pandas every nth row
(7 answers)
Closed 1 year ago.
I have one dataframe, i want to get first row of each 3 rows in dataframe and save new dataframe
here is input data
df=pd.DataFrame({'x':[1,2,5,6,7,8,9,9,6]})
output:
df_out=pd.DataFrame({'x':[1,6,9]})
Use DataFrame.iloc with slicing:
print (df.iloc[::3])
x
0 1
3 6
6 9

how to get percentage of columns to sum of row in python [duplicate]

This question already has an answer here:
Normalize rows of pandas data frame by their sums [duplicate]
(1 answer)
Closed 2 years ago.
I have a very high dimensional data with more than 100 columns. As an example, I am sharing the simplified version of it given as a below:
date product price amount
11/17/2019 A 10 20
11/24/2019 A 10 20
12/22/2020 A 20 30
15/12/2019 C 40 50
02/12/2020 C 40 50
I am trying to calculate the percentage of columns based on total row sum illustrated below:
date product price amount
11/17/2019 A 10/(10+20) 20/(10+20)
11/24/2019 A 10/(10+20) 20/(10+20)
12/22/2020 A 20/(20+30) 30/(20+30)
15/12/2019 C 40/(40+50) 50/(40+50)
02/12/2020 C 40/(40+50) 50/(40+50)
Is there any way to do this efficiently for high dimensional data? Thank you.
In addition to the provided link (Normalize rows of pandas data frame by their sums), you need to locate the specific columns as your first two column are non-numeric:
cols = df.columns[2:]
df[cols] = df[cols].div(df[cols].sum(axis=1), axis=0)
Out[1]:
date product price amount
0 11/17/2019 A 0.3333333333333333 0.6666666666666666
1 11/24/2019 A 0.3333333333333333 0.6666666666666666
2 12/22/2020 A 0.4 0.6
3 15/12/2019 C 0.4444444444444444 0.5555555555555556
4 02/12/2020 C 0.4444444444444444 0.5555555555555556

How to extract the entire column from a df based on a string of the column name? [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 3 years ago.
I have 2 dfs:
Sample of df1: s12
BacksGas_Flow_sccm ContextID StepID Time_Elapsed
46.6796875 7289972 12 25.443
46.6796875 7289972 12 26.443
Sample of df2: step12
ContextID BacksGas_Flow_sccm StepID Time_Elapsed
7289973 46.6796875 12 26.388
7289973 46.6796875 12 27.388
Since the BacksGas_Flow_sccm is on different positions in both the dfs, I would like to know as to how can I extract the column using df.columns.str.contains('Flow')
I tried doing:
s12.columns[s12.columns.str.contains('Flow')]
but it just gives the following output:
Index(['BacksGas_Flow_sccm'], dtype='object')
I would like the entire column to be extracted. How can this be done?
You are close, use DataFrame.loc with : for get all rows and columns filtered by conditions:
s12.loc[:, s12.columns.str.contains('Flow')]
Another idea is select by columns names:
cols = s12.columns[s12.columns.str.contains('Flow')]
s12[cols]

Merging two different pyspark dataframes [duplicate]

This question already has answers here:
How to split a dataframe into dataframes with same column values?
(3 answers)
Concatenate two PySpark dataframes
(12 answers)
Closed 4 years ago.
I have two pyspark dataframes with different values that I want to merge on some condition. The below is what I have
DF-1
date person_surname person_order_number item
2017-08-09 pearson 1 shoes
2017-08-09 zayne 3 clothes
DF-2
date person_surname person_order_number person_slary
2017-08-09 pearson 2 $1000
2017-08-09 zayne 5 $2000
I want to merge DF1 and DF2 such that the surnames of the people match and the person_order_number is merged correct. So i want the following returned
DF_pearson
date person_surname person_order_number item salary
2017-08-09 pearson 1 shoes
2017-08-09 pearson 2 $1000
DF_Zayne
date person_surname person_order_number item salary
2017-08-09 zayne 3 clothes
2017-08-09 zayne 5 $2000
How do i achieve this? I want to then perform operations on each of these dataframes as well.

Resources