Python - CSV - Calculate average of column values by a column id - python-3.x

I have a very large CSV file that I managed to order by a column id, but I cannot calculate the average column values that have that column id.
88741,42.84286022,16.41829224,1
88797,42.78081536,16.40743455,1
88797,42.78081536,16.21153455,1
88823,42.51512511,16.43304948,2
88885,42.88204193,16.12412548,2
87227,42.88204193,16.64223948,3
and so on...
I need to get a new csv without the SchoolCode column, and with the Lat and Long averaged for each cluster. And also, the digits number should be the same. I tried pandas it throws me this error.
The output should be something like this:
Lat,Long,Cluster
<average_lat_forCluster1>,<average_long_forCluster1>,1
<average_lat_forCluster2>,<average_long_forCluster2>,2
<average_lat_forCluster3>,<average_long_forCluster3>,3
and so on...
My code:
import pandas as pd
df = pd.read_csv('SortedCluster.csv', names=[
'SchoolCode', 'Lat', 'Long', 'Cluster'])
df2 = df.groupby('Cluster')['Lat','Long'].mean()
df2.to_csv('AverageOutput.csv')
Error:
Traceback (most recent call last):
File "averager.py", line 6, in <module>
df2 = df.groupby('Cluster')['Lat','Long'].mean()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1306, in mean
return self._cython_agg_general('mean', **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3974, in _cython_agg_general
how, alt=alt, numeric_only=numeric_only, min_count=min_count)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 4046, in _cython_agg_blocks
raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate

I believe need convert values to numeric first if necessary:
df[['Lat','Long']] = df[['Lat','Long']].apply(pd.to_numeric, errors='coerce')
And then aggregate mean per groups:
df.groupby('Cluster')['Lat','Long'].mean()

Related

Changes csv row value

This is my code:
import pandas as pd
import re
# reading the csv file
patients = pd.read_csv("partial.csv")
# updating the column value/data
for patient in patients.iterrows():
cip=patient['VALOR_ID']
new_cip = re.sub('^(\w+|)',r'FIXED_REPLACED_STRING',cip)
patient['VALOR_ID'] = new_cip
# writing into the file
df.to_csv("partial-writer.csv", index=False)
print(df)
I'm getting this message:
Traceback (most recent call last):
File "/home/jeusdi/projects/workarea/salut/load-testing/load.py", line 28, in
cip=patient['VALOR_ID']
TypeError: tuple indices must be integers or slices, not str
EDIT
Form code above you can think I need to set a same fixed value to all rows.
I need to loop over "rows" and generate a random string and set it on each different "row".
Code above would be:
for patient in patients.iterrows():
new_cip = generate_cip()
patient['VALOR_ID'] = new_cip
Use Series.str.replace, but not sure about | in regex. Maybe should be removed it:
df = pd.read_csv("partial.csv")
df['VALOR_ID'] = df['VALOR_ID'].str.replace('^(\w+|)',r'FIXED_REPLACED_STRING')
#if function return scalars
df['VALOR_ID'] = df['VALOR_ID'].apply(generate_cip)
df.to_csv("partial-writer.csv", index=False)

how calc values in array imported from csv.reader?

I have this csv file
Germany,1,5,10,20
UK,0,2,4,10
Hungary,6,11,22,44
France,8,22,33,55
and this script,
I would like to make some aritmetic operations with values in 2D array(data)
For example print value (data[1][3]) increased of 10,
Seems that I need some conversion to integer, right ?
What is best solution please ?
import csv
datafile = open('sample.csv', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
data.append(row)
print ((data[1][3])+10)
I got this error
/python$ python3 read6.py
Traceback (most recent call last):
File "read6.py", line 8, in <module>
print ((data[1][3])+10)
TypeError: must be str, not int
You'll have to manually convert to integers as you suspected:
import csv
datafile = open('sample.csv', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
data.append([row[0]] + list(map(int, row[1:])))
print ((data[1][3])+10)
Specifically this modification on line 7 of your code:
data.append([row[0]] + list(map(int, row[1:])))
The csv package docs mention that
No automatic data type conversion is performed unless the QUOTE_NONNUMERIC format option is specified (in which case unquoted fields are transformed into floats).
Since the strings in your CSV are not quoted (i.e. "Germany" instead of Germany), this isn't useful for your case, so converting manually is the way to go.

AttributeError: 'DataFrame' object has no attribute 'NET_NAME'

python 3.7
A task. Add a new column in the received date frame based on two conditions:
if the value in the NET_NAME column is equal to one of the list and the value in the ECELL_TYPE column is LTE, then assign the value to the SHARING column from the ENODEB_NAME column.
import csv
import os
import pandas as pd
import datetime
import numpy as np
from time import gmtime, strftime
WCOUNT=strftime("%V", gmtime())
WCOUNT = int(WCOUNT)
WCOUNT_last = int(WCOUNT)-1
os.environ['NLS_LANG'] = 'Russian.AL32UTF8'
cell_file_list=pd.read_excel('cdt_config.xlsx',sheet_name ='cdt_config',index_col='para_name')
filial_name_list=pd.read_excel('FILIAL_NAME.xlsx')
gcell_file_name1=cell_file_list.para_value.loc['ucell_file_name']
ecell_file_name=cell_file_list.para_value.loc['ecell_file_name']
cols_simple=['RECDATE','REGION_PHOENIX_NAME','NET_NAME','CELL_NAME_IN_BSC','ENODEB_NAME','ECELL_TYPE','NRI_ADDRESS', 'NRI_BS_NUMBER','NRI_SITEID','STOPTIME', ]
cols_export=['GSM', 'UMTS', 'LTE', 'TOTAL', 'NWEEK', 'SHARING' ]
ecell_df=df = pd.read_csv(ecell_file_name, sep=",",encoding='cp1251',
dtype={'NRI_SITEID': str})
ecell_df=ecell_df.rename(columns={"RECDATE.DATE": "RECDATE"})
ecell_df=ecell_df.rename(columns={"ECELL_MNEMONIC": "CELL_NAME_IN_BSC"})
#replace ","
ecell_df.STOPTIME=pd.to_numeric(ecell_df.STOPTIME.replace(',', '', regex=True), errors='coerce')/3600
ecell_df=ecell_df[cols_simple]
#pivot ecell table
ecell_sum_df=pd.pivot_table(ecell_df,values='STOPTIME',index=['RECDATE','NRI_SITEID','REGION_PHOENIX_NAME','NET_NAME','ENODEB_NAME','ECELL_TYPE'],aggfunc='sum')
ecell_sum_df=ecell_sum_df.fillna(0)
#create a empty column with the same index as the pivot table.
ecell_export_df= pd.DataFrame(index=ecell_sum_df.index.copy())
ecell_export_df=ecell_export_df.assign(LTE=0)
ecell_export_df.LTE=ecell_sum_df.STOPTIME
ecell_export_df['SHARING'] = 0
ecell_export_df.SHARING.replace(ecell_export_df.NET_NAME in filial_name_list, ENODEB_NAME,inplace=True)
print(ecell_export_df)
#print (ecell_export_df)
del ecell_df
del ecell_sum_df
export_df=pd.concat([ecell_export_df],join='outer',axis=1)
export_df=export_df.fillna(0)
export_df['TOTAL'] = export_df.sum(axis=1)
export_df['NWEEK'] = WCOUNT_last
del ecell_export_df
#################################################
Below is the error message:
Traceback (most recent call last):
File "C:/Users/PycharmProjects/ReportCDT/CDT 4G_power pivot.py", line 43, in <module>
ecell_export_df.SHARING.replace(ecell_sum_df.NET_NAME in filial_name_list, ENODEB_NAME,inplace=True)
File "C:\Users\vavrumyantsev\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 5067, in __getattr__
eturn object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'NET_NAME'
Your traceback contains: DataFrame object has no attribute NET_NAME,
meaning actually that this DataFrame has no column of this name.
This message pertains to ecell_sum_df.NET_NAME (also contained in
the traceback), so let's look how you created this DataFrame (slightly
reformatted for readablity):
ecell_sum_df=pd.pivot_table(ecell_df, values='STOPTIME',\
index=['RECDATE', 'NRI_SITEID', 'REGION_PHOENIX_NAME', 'NET_NAME',
'ENODEB_NAME', 'ECELL_TYPE'], aggfunc='sum')
Note that NET_NAME is a part of the index list, so in the DataFrame
created it is a part of the MultiIndex, not an "ordinary" column.
So Python is right displaying this message.
Maybe you should move this level of the MultiIndex to a "normal" column?

Creating a big pandas Dataframe [duplicate]

This question already has answers here:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
(5 answers)
Closed 5 years ago.
My code is retrieving historical data of 365 days back from today of 50 different stocks.
I want to store all those data in one dataframe to make it easier to analyse, here I want to filter all those data, date wise and calculate number of advancing/declining stocks at a given date.
My code:
import datetime
from datetime import date, timedelta
import pandas as pd
import nsepy as ns
#setting default dates
end_date = date.today()
start_date = end_date - timedelta(365)
#Deriving the names of 50 stocks in Nifty 50 Index
nifty_50 = pd.read_html('https://en.wikipedia.org/wiki/NIFTY_50')
nifty50_symbols = nifty_50[1][1]
for x in nifty50_symbols:
data = ns.get_history(symbol = x, start=start_date, end=end_date)
big_df = pd.concat(data)
Output:
Traceback (most recent call last):
File "F:\My\Getting data from NSE\advances.py", line 27, in <module>
big_df = pd.concat(data)
File "C:\Users\Abinash\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\concat.py", line 212, in concat
copy=copy)
File "C:\Users\Abinash\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\concat.py", line 227, in __init__
'"{name}"'.format(name=type(objs).__name__))
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
I am very new to python, I went through the tutorial of pandas and saw that pandas.concat was used to merge multiple dataframes into one. I might have understood that wrong.
Data for concatenation has to be iterable for example list.
results = []
for x in nifty50_symbols:
data = ns.get_history(symbol = x, start=start_date, end=end_date)
results.append(data)
big_df = pd.concat(results)

I am trying to read a .csv file which contains data in the order of timestamp, number plate, vehicle type and exit/entry

I need to store the timestamps in a list for further operations and have written the following code:
import csv
from datetime import datetime
from collections import defaultdict
t = []
columns = defaultdict(list)
fmt = '%Y-%m-%d %H:%M:%S.%f'
with open('log.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
#t = row[1]
for i in range(len(row)):
columns[i].append(row[i])
if (row):
t=list(datetime.strptime(row[0],fmt))
columns = dict(columns)
print (columns)
for i in range(len(row)-1):
print (t)
But I am getting the error :
Traceback (most recent call last):
File "parking.py", line 17, in <module>
t = list(datetime.strptime(row[0],fmt))
TypeError: 'datetime.datetime' object is not iterable
What can I do to store each timestamp in the column in a list?
Edit 1:
Here is the sample log file
2011-06-26 21:27:41.867801,KA03JI908,Bike,Entry
2011-06-26 21:27:42.863209,KA02JK1029,Car,Exit
2011-06-26 21:28:43.165316,KA05K987,Bike,Entry
If you have a csv file than why not use pandas to get what you want.The code for your problem may be something like this.
import Pandas as pd
df=pd.read_csv('log.csv')
timestamp=df[0]
if the first column of csv is of Timestamp than you have an array with having all the entries in the first column in the list known as timestamp.
After this you can convert all the entries of this list into timestamp objects using datetime.datetime.strptime().
Hope this is helpful.
I can't comment for clarifications yet.
Would this code get you the timestamps in a list? If yes, give me a few lines of data from the csv file.
from datetime import datetime
timestamps = []
with open(csv_path, 'r') as readf_obj:
for line in readf_obj:
timestamps.append(line.split(',')[0])
fmt = '%Y-%m-%d %H:%M:%S.%f'
datetimes_timestamps = [datetime.strptime(timestamp_, fmt) for timestamp_ in timestamps]

Resources