I am trying to figure out how to make my pandas DataFrame group data together. Currently, if you input data, for example: 1, 1, 2, 2, 3 - it will come out like this:
column1 column2
1: 2
2: 2
3: 1
I would like it to just group the data if they have the same value in column2 and for this example, just show 2 : 2, meaning 2 of the numbers inserted were both inserted twice, so we will just count that. Here is my current code:
from collections import Counter
import time
filename = input("Enter name to save this file as:\n\n")
print("Your file will be saved as: " + filename + ".csv\n\n")
time.sleep(0.5)
print("Please enter information")
contents = []
def amount():
while True:
try:
line = input()
except EOFError:
break
contents.append(line)
return
amount()
count = Counter(contents)
print(count)
import pandas as pd
d = count
df = pd.DataFrame.from_dict(d, orient='index').reset_index()
df.columns =['column1', 'column2']
df.sort_values(by=['column2'].value_counts())
df.to_csv(filename + ".csv", encoding='utf-8', index=False)
Any help would be appreciated.
I have tried in this using value_counts() in Pandas, not sure if it'll work for you!
values = [1, 1, 2, 2, 3]
df = pd.DataFrame(values, columns=['values'])
res = df['values'].value_counts().to_frame().reset_index().sort_values('index')
# renaming the columns
res.columns = ['Values', 'Count']
display(res)
Related
I wrote a small python script to read from a CSV, check for certain values, then write results to another CSV file.
Here is my code
`
import pandas as pd
import os
current_directory = os.getcwd()
pf = pd.read_csv(current_directory + '\Appointments.csv')
total_kept = {}
for index, row in pf.iterrows():
if total_kept.get(row['Acct#']) not in total_kept:
values = {"row['Acct#']": 0}
total_kept.update(values) //this line and the above one are supposed to initalize values to 0
if row['Status'] == 'K':
total_kept[row['Acct#']] = total_kept.get(row['Acct#'],0) + 1
else:
if row['Status'] == 'K':
total_kept[row['Acct#']] = total_kept.get(row['Acct#'],0) + 1
with open("results.csv", 'w') as f:
for key in total_kept.keys():
f.write("%s, %s\n" % (key, total_kept[key]))
`
For some reason, it is not adding ID's with no 'K' to the dictionary when they should have a value of 0. I am trying to count how many 'K' statuses from each ID. My thought was to create a dictionary with the key being the ID and the value being the number of 'K' instances.
there is an option in pandas to keep the format of the file, when I use df.to_excel to save the data on the file?
The only workaround that i found is:
from openpyxl import load_workbook
import pandas as pd
# df_data is a pd.DataFrame
wb = load_workbook(fout_file)
sheet = wb.active
for r, row in enumerate(dataframe_to_rows(df_data, index=False, header=False), 2):
for c in range(0, df_columns):
sheet.cell(row=r, column=c + 1).value = row[c]
wb.save(fout_file)
There a better way where i don't must copy cell by cell?
Thanks
stefano G.
#DSteman thanks for the idea, I jus tryed to use StyleForm as you advised me.
def main ():
...
...
...
# df_new_data = pd.DataFrame(columns=self.df_values.columns)
df_new_data = StyleFrame.read_excel(self.template_fout, read_style=True)
...
...
...
cr_dfnewdata = 0
for j, row_data in data.iterrows():
original_row = row_data.copy(deep=True)
# df_new_data = df_new_data.append(original_row)
cr_dfnewdata += 1
df_new_data[cr_dfnewdata] = original_row
...
...
...
compensa_row = row_data.copy(deep=True)
compensa_row[self.importo_col] = importo * -1
# compensa_row[self.qta_col] = qta * -1
compensa_row[self.cod_ribal_col] = f"{cod_ribal}-{j}"
# df_new_data = df_new_data.append(compensa_row)
cr_dfnewdata += 1
df_new_data[cr_dfnewdata] = compensa_row
...
...
...
def save_working_data(self, cod_ribalt: str, df_data):
fout_working_name = f"{self.working_dir}/working_{cod_ribalt}.xlsx"
df_data.to_excel(fout_working_name).save()
BUT i got this error:
export_df.index = [row_index.value for row_index in export_df.index]
AttributeError: 'int' object has no attribute 'value'
You can do this using df.to_clipboard(index=False)
from win32com.client import Dispatch
import pandas as pd
xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlApp.Workbooks.Open(r'c:\Chadee\test.xlsx')
xlApp.ActiveSheet.Cells(1,1).Select
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_clipboard(index=False)
xlApp.ActiveWorkbook.ActiveSheet.PasteSpecial()
Output:
Note that the cell colors are still the same
Hope that helps! :-)
Use the styleframe module to perserve most of the styling in your sheet. If you have a styled sheet with columns ['Entry 1', 'Entry 2'] for example, you can enter values like this:
from styleframe import StyleFrame
sf = StyleFrame.read_excel('test.xlsx', read_style=True)
sf.loc[0,'Entry 1'].value = 'Modified 1'
sf.to_excel('test.xlsx').save()
Make sure that the cell you are trying to fill already has a placeholder value like 0. My script returned errors if it attempted to fill an empty cell.
Check out this thread too: Overwriting excel columns while keeping format using pandas
I have a dataframe given below
I want to extract all the non-zero values from each column to put it in a summarize way like this
If any value repeated for period of time then starting time of value should go in 'FROM' column and end time of value should go in 'TO' column with column name in 'BLK-ASB-INV' column and value should go in 'Scount' column. For this I have started to write the code like this
import pandas as pd
df = pd.read_excel("StringFault_Bagewadi_16-01-2020.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
ss=df[col].iloc[df[col].to_numpy().nonzero()[0]]
.......
After that I am unable to think how should I approach to get the desired output. Is there any way to do this in python? Thanks in advance for any help.
Finally I have solved my problem, I have written the code given below works perfectly for me.
import pandas as pd
df = pd.read_excel("StringFault.xlsx")
df = df.set_index(['Date (+05:30)'])
cols=['BLK-ASB-INV', 'Scount', 'FROM', 'TO']
res=pd.DataFrame(columns=cols)
for col in df.columns:
device = []
for i in range(len(df[col])):
if df[col][i] == 0:
None
else:
if i < len(df[col])-1 and df[col][i]==df[col][i+1]:
try:
if df[col].index[i] > device[2]:
continue
except IndexError:
device.append(df[col].name)
device.append(df[col][i])
device.append(df[col].index[i])
continue
else:
if len(device)==3:
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
else:
device.append(df[col].name)
device.append(df[col][i])
if i == 0:
device.append(df[col].index[i])
else:
device.append(df[col].index[i-1])
device.append(df[col].index[i])
res = res.append({'BLK-ASB-INV':device[0], 'Scount':device[1], 'FROM':device[2], 'TO': device[3]}, ignore_index=True)
device=[]
For reference, here is the output datafarme
I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.
I got csv-file with numerous URLs. I read it into a pandas dataframe for convenience. I need to do some statistical work later - and pandas is just handy. It looks a little like this:
import pandas as pd
csv = [{"URLs" : "www.mercedes-benz.de", "electric" : 1}, {"URLs" : "www.audi.de", "electric" : 0}]
df = pd.DataFrame(csv)
My task is to check if the websites contain certain strings and to add an extra column with 1 if so, and else 0. For example: I want to check, wether www.mercedes-benz.de contains the string car.
import requests
page_content = requests.get("www.mercedes-benz.de")
if "car" in page_content.text:
print ('1')
else:
print ('0')
How do I iterate/loop through pd.URLs and store the information in the pandas dataframe?
I think you need loop by data by DataFrame.iterrows and then create new values with loc:
for i, row in df.iterrows():
page_content = requests.get(row['URLs'])
if "car" in page_content.text:
df.loc[i, 'car'] = '1'
else:
df.loc[i, 'car'] = '0'
print (df)
URLs electric car
0 http://www.mercedes-benz.de 1 1
1 http://www.audi.de 0 1