Pandas Data Frame, find max value and return adjacent column value, not the entire row - python-3.x

New to Pandas so I'm sorry if there is an obvious solution...
I imported a CSV that only had 2 columns and I created a 3rd column.
Here's a screen shot of the top 10 rows and header:
Screen shot of DataFrame
I've figured out how to find the min and max values in the ['Amount Changed'] column but also need to pull the date associated with the min and max - but not the index and ['Profit/Loss']. I've tried iloc, loc, read about groupby - I can't get any of them to return a single value (in this case a date) that I can use again.
My goal is to create a new variable 'Gi_Date' that is in the same row as the max value in ['Amount Changed'] but tied to the date in the ['Date'] column.
I'm trying to keep the variables separate so I can use them in print statements, write them to txt files, etc.
import os
import csv
import pandas as pd
import numpy as np
#path for CSV file
csvpath = ("budget_data.csv")
#Read CSV into Panadas and give it a variable name Bank_pd
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
#Number of month records in the CSV
Months = Bank_pd["Date"].count()
#Total amount of money captured in the data converted to currency
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
#Determine the amount of increase or decrease from the previous month
AmtChange = Bank_pd["Profit/Losses"].diff()
Bank_pd["Amount Changed"] = AmtChange
#Identify the greatest positive change
GreatestIncrease = '${:.0f}'.format(Bank_pd["Amount Changed"].max())
Gi_Date = Bank_pd[Bank_pd["Date"] == GreatestIncrease]
#Identify the greatest negative change
GreatestDecrease = '${:.0f}'.format(Bank_pd["Amount Changed"].min())
Gd_Date = Bank_pd[Bank_pd['Date'] == GreatestDecrease]
print(f"Total Months: {Months}")
print(f"Total: {Total_Funds}")
print(f"Greatest Increase in Profits: {Gi_Date} ({GreatestIncrease})")
print(f"Greatest Decrease in Profits: {Gd_Date} ({GreatestDecrease})")
When I run the script in git bash I don't get an error anymore so I think I'm getting close, rather than showing the date it says:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($1926159)
Greatest Decrease in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($-2196167)
I want it to print out like this:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Feb-2012 ($1926159)
Greatest Decrease in Profits: Sept-2013 ($-2196167)
Here is one years worth of the original DataFrame:
bank_pd = pd.DataFrame({'Date':['Jan-10', 'Feb-10', 'Mar-10', 'Apl-10', 'May-10', 'Jun-10', 'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10'],
'Profit/Losses':[867884, 984655, 322013, -69417, 310503, 522857, 1033096, 604885, -216386, 477532, 893810, -80353]})
The expected output with the sample df would be:
Total Months: 12
Total Funds: $5651079
Greatest Increase in Profits: Oct-10 ($693918)
Greatest Decrease in Profits: Dec-10 ($-974163)
I also had an error in the sample dataframe from above, I was missing a month when I typed it out quickly - it's fixed now.
Thanks!

I'm seeing few glitches in the variables used.
Bank_pd["Amount Changed"] = AmtChange
The above statement is actually replacing the dataframe with column "Amount Changed". After this statement you can use this column for any manipulation.
Below is the updated code and highlighted the newly added lines. You could add further formatting:
import pandas as pd
csvpath = ("budget_data.csv")
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
inp_bank_pd = pd.DataFrame(Bank_pd)
Months = Bank_pd["Date"].count()
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
AmtChange = Bank_pd["Profit/Losses"].diff()
GreatestIncrease = Bank_pd["Amount Changed"].max()
Gi_Date = inp_bank_pd.loc[Bank_pd["Amount Changed"] == GreatestIncrease]
print(Months)
print(Total_Funds)
print(Gi_Date['Date'].values[0])
print(GreatestIncrease)

In your example code, Gi_date and Gd_date are trying to initialize new DF's instead of calling values. Change Gi_Date and Gd_Date:
Gi_Date = Bank_pd.sort_values('Profit/Losses').tail(1).Date
Gd_Date = Bank_pd.sort_values('Profit/Losses').head(1).Date
Check outputs:
Gi_Date
Jul-10
Gd_Date
Sep-10
To print how you want to print using string formatting:
print("Total Months: %s" %(Months))
print("Total: %s" %(Total_Funds))
print("Greatest Increase in Profits: %s %s" %(Gi_Date.to_string(index=False), GreatestIncrease))
print("Greatest Decrease in Profits: %s %s" %(Gd_Date.to_string(index=False), GreatestDecrease))
Note if you don't use the:
(Gd_Date.to_string(index=False)
The pandas object information will be included in the print output, like it is in your example when you see the DataFrame info.
Output for 12 month sample DF:
Total Months: 12
Total: $5651079
Greatest Increase in Profits: Jul-10 $693918
Greatest Decrease in Profits: Sep-10 $-974163

Use Series.idxmin and Series.idxmax with loc:
df.loc[df['Amount Changed'].idxmin(), 'Date']
df.loc[df['Amount Changed'].idxmax(), 'Date']
Full example based on your sample DataFrame:
df = pd.DataFrame({'Date':['Jan-2010', 'Feb-2010', 'Mar-2010', 'Apr-2010', 'May-2010',
'Jun-2010', 'Jul-2010', 'Aug-2010', 'Sep-2010', 'Oct-2010'],
'Profit/Losses': [867884,984655,322013,-69417,310503,522857,
1033096,604885,-216386,477532]})
df['Amount Changed'] = df['Profit/Losses'].diff()
print(df)
Date Profit/Losses Amount Changed
0 Jan-2010 867884 NaN
1 Feb-2010 984655 116771.0
2 Mar-2010 322013 -662642.0
3 Apr-2010 -69417 -391430.0
4 May-2010 310503 379920.0
5 Jun-2010 522857 212354.0
6 Jul-2010 1033096 510239.0
7 Aug-2010 604885 -428211.0
8 Sep-2010 -216386 -821271.0
9 Oct-2010 477532 693918.0
print(df.loc[df['Amount Changed'].idxmin(), 'Date'])
print(df.loc[df['Amount Changed'].idxmax(), 'Date'])
Sep-2010
Oct-2010

Related

Pandas converts some numbers into zeros or other fixed values

I'm using Python with Pandas through Google Colab for some data analysis. I was analyzing some data through plots and noticed some missing data. However, when I looked at the original Excel data before any Python work, there was no missing data in these places. Somehow it's turning the first four days of a month of hourly data into zeros OR, but only for some of the files and some of the time periods. Following the zeros is also a period of other constant values.
I have four similar data files and two of them seem to be working just fine, but the other two get these zeros at the start of SOME (consecutive) months, while nothing is wrong with the original data. Is there some feature in Pandas that could cause some numbers to turn into zeros or other constant values? The same code is used for all the different files, which are all in the same format.
I thought it could be just a problem with using 'resample' during plotting, but even when I just print the values without 'resample', the values are still missing. I included a figure here to show what the data problem looks like.
Function to read the data:
def read_elec_data(data_file_name):
df = pd.read_excel(data_file_name) # Read the original data
# Convert the time value (30.11.2018 0:00-1:00) into a Pandas-compatible timestamp format (2018-11-30 0:00)
new = df["Päivämäärä ja tunti"].str.split("-", n = 1, expand = True) # Split the time column by the delimiter and make it into two new columns [0, 1]. The ending hour [1] can be ignored.
time_data = new[0]
time_data_fixed = pd.to_datetime(time_data) # Convert the modified time data into datetime format
df['Aika'] = time_data_fixed # Add the new time column to the dataframe
# Remove all columns except the new timestamp and energy consumption columns. Rename the consumption according to the building name
building_name = df['Kohde'][0]
df.drop(columns =["Päivämäärä ja tunti", "Tunti", 'Kohde', 'Mittarin nimi'], inplace = True) # Remove everything except the new timestamp and energy consumption
df = df.rename(columns={'Kulutus[kWh]' : building_name})
df = df.set_index('Aika') # Set the timestamp as the index for the final DataFrame that will be utilized in the calculations
return df
Calling of the function:
all_electricity_data_list = []
for buildingname in list_of_electricity_data:
df = read_elec_data(buildingname) # Use the file reading and modification function
all_electricity_data_list.append(df)
all_electricity_data = pd.concat(all_electricity_data_list, axis=1)
Some numbers are converted to zeros or other constant values even though the original data is fine:

How to count the number of times a certain SSN number is occuring in a particular cell in a dataframe

I have a dataframe column with random unstructured entries. I need to count the number of times the SSN numbers are appearing in a particular cell.
Consider the below example as entries in a dataframe column:
the SSN 569-458-555 has to be replaced by 8965-78-698
SSN:25-965-9654 has to be coverage of$59 and 8968-65965 of $85
please find SSN#256-8695-65
payment completed for SSN= 569856-548, 55866-89-96,56478-9658
Output:
2
2
1
3
Text used as input
Text
the SSN 569-458-555 has to be replaced by 8965-78-698
SSN:25-965-9654 has to be coverage of$59 and 8968-65965 of $85
please find SSN#256-8695-65
payment completed for SSN= 569856-548, 55866-89-96,56478-9658
import pandas as pd
import re
df = pd.read_csv("data.csv", sep="\n")
print(df)
# convert to nine-digit number in the format "AAA-GG-SSSS"
def format_ssn(x):
ssn_l = []
for ssn in x:
res = re.split(r"(\d{3})(\d{2})(\d{4})", ssn)
# res[1:-1] clear empty matches from begin/end of string
ssn_l.append("-".join(res[1:-1]))
return ssn_l
dft = df["Text"].str.replace("-", "")
df["Count"] = dft.str.count(pat="(\d{9})")
df["SSNs"] = dft.str.findall(pat=r"(\d{9})")
df["SSNs"] = df["SSNs"].apply(lambda x: format_ssn(x))
print(df)
Output from df
Text Count SSNs
0 the SSN 569-458-555 has to be replaced ... 2 [569-45-8555, 896-57-8698]
1 SSN:25-965-9654 has to be coverage of$5... 2 [259-65-9654, 896-86-5965]
2 please find SSN#2... 1 [256-86-9565]
3 payment completed for SSN= 569856-548, ... 3 [569-85-6548, 558-66-8996, 564-78-9658]

cumalativive the all other columns expect date column in python ML with cumsum()

I have stock data set like
**Date Open High ... Close Adj Close Volume**
0 2014-09-17 465.864014 468.174011 ... 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 ... 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 ... 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 ... 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 ... 398.821014 398.821014 26580100
I need to cumulative sum the columns Open,High,Close,Adj Close, Volume
I tried this df.cumsum(), its shows the the error time stamp error.
I think for processing trade data is best create DatetimeIndex:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
And then if necessary cumulative sum for all column:
df = df.cumsum()
If want cumulative sum only for some columns:
cols = ['Open','High','Close','Adj Close','Volume']
df[cols] = df.cumsum()

Convert pandas column from object type [] in python 3

I have read this Pandas: convert type of column and this How to convert datatype:object to float64 in python?
I have current output of df:
Day object
Time object
Open float64
Close float64
High float64
Low float64
Day Time Open Close High Low
0 ['2019-03-25'] ['02:00:00'] 882.2 882.6 884.0 882.1
1 ['2019-03-25'] ['02:01:00'] 882.9 882.9 883.4 882.9
2 ['2019-03-25'] ['02:02:00'] 882.8 882.8 883.0 882.7
So I can not use this:
day_=df.loc[df['Day'] == '2019-06-25']
My final purpose is to extract df by filtering the value of column "Day" by specific condition.
I think the reason of df.loc above failed to excecute is that dtype of Day is object so I can not execute df.loc
so I try to convert the above df to something like this:
Day Time Open Close High Low
0 2019-03-25 ['02:00:00'] 882.2 882.6 884.0 882.1
1 2019-03-25 ['02:01:00'] 882.9 882.9 883.4 882.9
2 2019-03-25 ['02:02:00'] 882.8 882.8 883.0 882.7
I have tried:
df=pd.read_csv('output.csv')
df = df.convert_objects(convert_numeric=True)
#df['Day'] = df['CTR'].str.replace('[','').astype(np.float64)
df['Day'] = pd.to_numeric(df['Day'].str.replace(r'[,.%]',''))
But it does not work with error like this:
ValueError: Unable to parse string "['2019-03-25']" at position 0
I am novice at pandas and this may be duplicated!
Pls, help me to find solution. Thanks alot.
Try this I hope it would work
first remove list brackets by from day then do filter using .loc
df = pd.DataFrame(data={'Day':[['2016-05-12']],
'day2':[['2016-01-01']]})
df['Day'] = df['Day'].apply(''.join)
df['Day'] = pd.to_datetime(df['Day']).dt.date.astype(str)
days_df=df.loc[df['Day'] == '2016-05-12']
Second Solution
If the list is stored as string
from ast import literal_eval
df2 = pd.DataFrame(data={'Day':["['2016-05-12']"],
'day2':["['2016-01-01']"]})
df2['Day'] = df2['Day'].apply(literal_eval)
df2['Day'] = df2['Day'].apply(''.join)
df2['Day'] = pd.to_datetime(df2['Day']).dt.date.astype(str)
days_df=df2.loc[df2['Day'] == '2016-05-12']

How to get specific attributes of a df that has been grouped

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

Resources