Pandas dataframe row, divide the row elements with a fixed float value, Dataframe has only one row - python-3.x

QTY 1 QTY 2 QTY 3 QTY 4 QTY 5 QTY 6 QTY 7 QTY 8 QTY 9 QTY 10 QTY 11 QTY 12 QTY 13 QTY 14 QTY 15 QTY 16 QTY 17 QTY 18 QTY 19 QTY 20 REV Price
28 39 41 44 33 33 44 36 41 46 29 34 35 31 28 38 31 36 45 30 250 25
I have only one in a pandas Data frame which 22 columns
I want to divide all the columns that starts with QTY with "mean_val" which is equivalent to 25.3654
How to do this?

You can do this, lets say your dataframe is in variable df :
df.loc[:,df.columns.str.startswith('QTY')] /= mean_val

import pandas as pd
# Sample data frame defined as per your requirement and the mean value.
data = { 'qty 1': [23] , 'qty 2': [26], 'price': [25], 'rate': [250]}
df = pd.DataFrame(data=data, index=None)
mean_value = 25.3654
# Using dataframe's update() and filter() functions to achieve the outcome
df.update(df.filter(regex='qty.*') / mean_value)
print(df)

Related

Extract weekly data from daily and reshape it from long to wide format using Pandas

Given a sample data as follows, I hope to extract one data entry for each week, if for the week having multiple entries, then I will use the largest weekday's data as for that week:
date variable value
0 2020-11-4 quantity 564.0000
1 2020-11-11 quantity 565.0000
2 2020-11-18 quantity 566.0000
3 2020-11-25 quantity 566.0000
4 2020-11-2 price 1829.1039
5 2020-11-3 price 1789.5883
6 2020-11-4 price 1755.4307
7 2020-11-5 price 1750.0727
8 2020-11-6 price 1746.7239
9 2020-11-9 price 1756.1005
10 2020-11-10 price 1752.0820
11 2020-11-11 price 1814.3693
12 2020-11-12 price 1833.7922
13 2020-11-13 price 1833.7922
14 2020-11-16 price 1784.2302
15 2020-11-17 price 1764.1376
16 2020-11-18 price 1770.1654
17 2020-11-19 price 1757.4400
18 2020-11-20 price 1770.1654
To get week number of each date, I use df['week_number'] = pd.to_datetime(df['date']).dt.week.
date variable value week_number
0 2020-11-4 quantity 564.0000 45 --> to keep
1 2020-11-11 quantity 565.0000 46 --> to keep
2 2020-11-18 quantity 566.0000 47 --> to keep
3 2020-11-25 quantity 566.0000 48 --> to keep
4 2020-11-2 price 1829.1039 45
5 2020-11-3 price 1789.5883 45
6 2020-11-4 price 1755.4307 45
7 2020-11-5 price 1750.0727 45
8 2020-11-6 price 1746.7239 45 --> to keep, since it's the largest weekday for this week
9 2020-11-9 price 1756.1005 46
10 2020-11-10 price 1752.0820 46
11 2020-11-11 price 1814.3693 46
12 2020-11-12 price 1833.7922 46
13 2020-11-13 price 1833.7922 46 --> to keep, since it's the largest weekday for this week
14 2020-11-16 price 1784.2302 47
15 2020-11-17 price 1764.1376 47
16 2020-11-18 price 1770.1654 47
17 2020-11-19 price 1757.4400 47
18 2020-11-20 price 1770.1654 47 --> to keep, since it's the largest weekday for this week
Finally, I will reshape rows indicating to_keep to the expected result as follow:
variable the_45th_week the_46th_week the_47th_week the_48th_week
0 quantity 564.0000 565.0000 566.0000 566.0
1 price 1756.1005 1833.7922 1770.1654 NaN
How could I manipulate data to get the expected result? Sincere thanks.
EDIT:
df = df.sort_values(by=['variable','date'], ascending=False)
df.drop_duplicates(['variable', 'week_number'], keep='last')
Out:
date variable value week_number
0 2020-11-4 quantity 564.0000 45
3 2020-11-25 quantity 566.0000 48
2 2020-11-18 quantity 566.0000 47
1 2020-11-11 quantity 565.0000 46
4 2020-11-2 price 1829.1039 45
14 2020-11-16 price 1784.2302 47
10 2020-11-10 price 1752.0820 46
In your solution is possible add pivot with rename:
df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
df = df.drop_duplicates(['variable', 'week_number'], keep='last')
f = lambda x: f'the_{x}th_week'
out = df.pivot('variable','week_number','value').rename(columns=f)
print(out)
week_number the_45th_week the_46th_week the_47th_week the_48th_week
variable
price 1829.1039 1752.082 1784.2302 NaN
quantity 564.0000 565.000 566.0000 566.0
Or remove DataFrame.drop_duplicates, so is possible use DataFrame.pivot_table with aggregate function last:
df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
f = lambda x: f'the_{x}th_week'
out = df.pivot_table(index='variable',columns='week_number',values='value', aggfunc='last').rename(columns=f)
EDIT: to get an exact same result as the expected one:
out.reset_index().rename_axis(None, axis=1)
Out:
variable the_45th_week the_46th_week the_47th_week the_48th_week
0 price 1829.1039 1752.082 1784.2302 NaN
1 quantity 564.0000 565.000 566.0000 566.0

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Split dates into time ranges in pandas

14 [2018-03-14, 2018-03-13, 2017-03-06, 2017-02-13]
15 [2017-07-26, 2017-06-09, 2017-02-24]
16 [2018-09-06, 2018-07-06, 2018-07-04, 2017-10-20]
17 [2018-10-03, 2018-09-13, 2018-09-12, 2018-08-3]
18 [2017-02-08]
this is my data, every ID has it's own dates that range between 2017-02-05 and 2018-06-30. I need to split dates into 5 time ranges of 4 months each, so that for the first 4 months every ID should have dates only in that time range (from 2017-02-05 to 2017-06-05), like this
14 [2017-03-06, 2017-02-13]
15 [2017-02-24]
16 [null] # or delete empty rows, it doesn't matter
17 [null]
18 [2017-02-08]
then for 2017-06-05 to 2017-10-05 and so on for every 4 month ranges. Also I can't use nested for loops because the data is too big. This is what I tried so far
months_4 = individual_dates.copy()
for _ in months_4['Date']:
_ = np.where(pd.to_datetime(_) <= pd.to_datetime('2017-9-02'), _, np.datetime64('NaT'))
and
months_8 = individual_dates.copy()
range_8 = pd.date_range(start='2017-9-02', end='2017-11-02')
for _ in months_8['Date']:
_ = _[np.isin(_, range_8)]
achieved absolutely no result, data stays the same no matter what
update: I did what you said
individual_dates['Date'] = individual_dates['Date'].str.strip('[]').str.split(', ')
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ClientId'].repeat(individual_dates['Date'].str.len())
})
df
and here is the result
Date ID
0 '2018-06-30T00:00:00.000000000' '2018-06-29T00... 14
1 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 15
2 '2018-03-14T00:00:00.000000000' '2018-03-13T00... 16
3 '2017-12-14T00:00:00.000000000' '2017-03-28T00... 17
4 '2017-05-30T00:00:00.000000000' '2017-05-22T00... 18
5 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 19
6 '2017-03-27T00:00:00.000000000' '2017-03-26T00... 20
7 '2017-12-15T00:00:00.000000000' '2017-11-20T00... 21
8 '2017-07-05T00:00:00.000000000' '2017-07-04T00... 22
9 '2017-12-12T00:00:00.000000000' '2017-04-06T00... 23
10 '2017-05-21T00:00:00.000000000' '2017-05-07T00... 24
For better performance I suggest convert list to column - flatten it and then filtering by isin with boolean indexing:
from itertools import chain
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ID'].repeat(individual_dates['Date'].str.len())
})
range_8 = pd.date_range(start='2017-02-05', end='2017-06-05')
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Date'].isin(range_8)]
print (df)
Date ID
0 2017-03-06 14
0 2017-02-13 14
1 2017-02-24 15
4 2017-02-08 18

How can I delete duplicates group 3 columns using two criteria (first two columns)?

That is my data set enter code here
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
2 2018 6 62 47 18
3 2018 6 62 47 18
4 2018 6 62 47 18
In last three columns there is already the sum for the year and week. I need to get rid of duplicates so that the table contains unique values (for the example above):
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
4 2018 6 62 47 18
I tried to group data but it somehow works wrong and does what I need but just for one column.
df.groupby(['Year created', 'Week created']).size()
And output:
Year created Week created
2017 48 2
49 25
50 54
51 36
52 1
2018 1 17
2 50
3 37
But it is just one column and I don't know which one because even if I separate the data on three parts and do the same procedure for each part I get the same result (as above) for all.
I believe need drop_duplicates:
df = df.drop_duplicates(['Year created', 'Week created'])
print (df)
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
df2 = df.drop_duplicates(['Year created', 'Week created', 'SUM_New', 'SUM_Closed'])
print(df2)
hope this helps.

row substraction in lambda pandas dataframe

I have a dataframe with multiple columns. One of the column is the cumulative revenue column. If the year is not ended then the revenue will be constant for the rest of the period because the coming daily revenue is 0.
The dataframe looks like this
Now I want to create a new column where the row is substracted by the last row and if the result is 0 then print 0 for that row in the new column. If not zero then use the row value. The new dataframe should look like this:
My idea was to do this with the apply lambda method. So this is the thinking:
{df['2017new'] = df['2017'].apply(lambda x: 0 if row - lastrow == 0 else x)}
But i do not know how to write the row - lastrow part of the code. How to do this? Thanks in advance!
By using np.where
df2['New']=np.where(df2['2017'].diff().eq(0),0,df2['2017'])
df2
Out[190]:
2016 2017 New
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0
We can shift the data and fill the values based on condition using np.where i.e
df['new'] = np.where(df['2017']-df['2017'].shift(1)==0,0,df['2017'])
or with df.where i.e
df['new'] = df['2017'].where(df['2017']-df['2017'].shift(1)!=0,0)
2016 2017 new
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0

Resources