python 3.7
A task. Add a new column in the received date frame based on two conditions:
if the value in the NET_NAME column is equal to one of the list and the value in the ECELL_TYPE column is LTE, then assign the value to the SHARING column from the ENODEB_NAME column.
import csv
import os
import pandas as pd
import datetime
import numpy as np
from time import gmtime, strftime
WCOUNT=strftime("%V", gmtime())
WCOUNT = int(WCOUNT)
WCOUNT_last = int(WCOUNT)-1
os.environ['NLS_LANG'] = 'Russian.AL32UTF8'
cell_file_list=pd.read_excel('cdt_config.xlsx',sheet_name ='cdt_config',index_col='para_name')
filial_name_list=pd.read_excel('FILIAL_NAME.xlsx')
gcell_file_name1=cell_file_list.para_value.loc['ucell_file_name']
ecell_file_name=cell_file_list.para_value.loc['ecell_file_name']
cols_simple=['RECDATE','REGION_PHOENIX_NAME','NET_NAME','CELL_NAME_IN_BSC','ENODEB_NAME','ECELL_TYPE','NRI_ADDRESS', 'NRI_BS_NUMBER','NRI_SITEID','STOPTIME', ]
cols_export=['GSM', 'UMTS', 'LTE', 'TOTAL', 'NWEEK', 'SHARING' ]
ecell_df=df = pd.read_csv(ecell_file_name, sep=",",encoding='cp1251',
dtype={'NRI_SITEID': str})
ecell_df=ecell_df.rename(columns={"RECDATE.DATE": "RECDATE"})
ecell_df=ecell_df.rename(columns={"ECELL_MNEMONIC": "CELL_NAME_IN_BSC"})
#replace ","
ecell_df.STOPTIME=pd.to_numeric(ecell_df.STOPTIME.replace(',', '', regex=True), errors='coerce')/3600
ecell_df=ecell_df[cols_simple]
#pivot ecell table
ecell_sum_df=pd.pivot_table(ecell_df,values='STOPTIME',index=['RECDATE','NRI_SITEID','REGION_PHOENIX_NAME','NET_NAME','ENODEB_NAME','ECELL_TYPE'],aggfunc='sum')
ecell_sum_df=ecell_sum_df.fillna(0)
#create a empty column with the same index as the pivot table.
ecell_export_df= pd.DataFrame(index=ecell_sum_df.index.copy())
ecell_export_df=ecell_export_df.assign(LTE=0)
ecell_export_df.LTE=ecell_sum_df.STOPTIME
ecell_export_df['SHARING'] = 0
ecell_export_df.SHARING.replace(ecell_export_df.NET_NAME in filial_name_list, ENODEB_NAME,inplace=True)
print(ecell_export_df)
#print (ecell_export_df)
del ecell_df
del ecell_sum_df
export_df=pd.concat([ecell_export_df],join='outer',axis=1)
export_df=export_df.fillna(0)
export_df['TOTAL'] = export_df.sum(axis=1)
export_df['NWEEK'] = WCOUNT_last
del ecell_export_df
#################################################
Below is the error message:
Traceback (most recent call last):
File "C:/Users/PycharmProjects/ReportCDT/CDT 4G_power pivot.py", line 43, in <module>
ecell_export_df.SHARING.replace(ecell_sum_df.NET_NAME in filial_name_list, ENODEB_NAME,inplace=True)
File "C:\Users\vavrumyantsev\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 5067, in __getattr__
eturn object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'NET_NAME'
Your traceback contains: DataFrame object has no attribute NET_NAME,
meaning actually that this DataFrame has no column of this name.
This message pertains to ecell_sum_df.NET_NAME (also contained in
the traceback), so let's look how you created this DataFrame (slightly
reformatted for readablity):
ecell_sum_df=pd.pivot_table(ecell_df, values='STOPTIME',\
index=['RECDATE', 'NRI_SITEID', 'REGION_PHOENIX_NAME', 'NET_NAME',
'ENODEB_NAME', 'ECELL_TYPE'], aggfunc='sum')
Note that NET_NAME is a part of the index list, so in the DataFrame
created it is a part of the MultiIndex, not an "ordinary" column.
So Python is right displaying this message.
Maybe you should move this level of the MultiIndex to a "normal" column?
Related
This is my code:
import pandas as pd
import re
# reading the csv file
patients = pd.read_csv("partial.csv")
# updating the column value/data
for patient in patients.iterrows():
cip=patient['VALOR_ID']
new_cip = re.sub('^(\w+|)',r'FIXED_REPLACED_STRING',cip)
patient['VALOR_ID'] = new_cip
# writing into the file
df.to_csv("partial-writer.csv", index=False)
print(df)
I'm getting this message:
Traceback (most recent call last):
File "/home/jeusdi/projects/workarea/salut/load-testing/load.py", line 28, in
cip=patient['VALOR_ID']
TypeError: tuple indices must be integers or slices, not str
EDIT
Form code above you can think I need to set a same fixed value to all rows.
I need to loop over "rows" and generate a random string and set it on each different "row".
Code above would be:
for patient in patients.iterrows():
new_cip = generate_cip()
patient['VALOR_ID'] = new_cip
Use Series.str.replace, but not sure about | in regex. Maybe should be removed it:
df = pd.read_csv("partial.csv")
df['VALOR_ID'] = df['VALOR_ID'].str.replace('^(\w+|)',r'FIXED_REPLACED_STRING')
#if function return scalars
df['VALOR_ID'] = df['VALOR_ID'].apply(generate_cip)
df.to_csv("partial-writer.csv", index=False)
I am trying to vectorize the dask.dataframe with dask HashingVectorizer. I want the vectorization results to stay in the cluster (distributed system). That's why I am using client.persist when I try to transform the data. But for some reason, I am getting the error below.
Traceback (most recent call last):
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hybrid_feature_vectorizer
CLUSTERING_FEATURES=self.clustering_features)
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/text_vectorizer.py", line 143, in vectorize
X = self.client.persist(fitted_vectorizer.transform, combined_data)
File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2860, in persist
assert all(map(dask.is_dask_collection, collections))
AssertionError
I can't share the data but all of the necessary information about the data is as below:
>>>type(combined_data)
<class 'dask.dataframe.core.Series'>
>>>type(combined_data.compute())
<class 'pandas.core.series.Series'>
>>>combined_data.compute().shape
12
A minimal working example can be found below. Below, in the code snippet, combined_data holds the merged columns. Meaning: all of the columns are merged into 1 column. Data has 12 rows. All of the values inside the rows are string. This is the code where I am getting the error:
from stop_words import get_stop_words
from dask_ml.feature_extraction.text import HashingVectorizer as daskHashingVectorizer
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
def convert_dataframe_to_single_text(documents):
"""
Combine all of the columns into 1 column.
"""
if type(documents) is dask.dataframe.core.DataFrame:
cols = documents.columns
documents['combined'] = documents[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1,
meta=('str'))
document_texts = documents.drop(cols, axis=1)
else:
raise TypeError('Wrong type of data. Expected Pandas DF or Dask DF but received ', type(documents))
return document_texts
# Init the client.
client = Client('localhost:8786')
# Get stopwords
stopwords = get_stop_words(language="english")
# Create dask dataframe from pandas dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':["twenty", "twentyone", "nineteen", "eighteen"]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
# Init the vectorizer
vectorizer = daskHashingVectorizer(stop_words=stopwords, alternate_sign=False,
norm=None, binary=False,
n_features=10000)
# Combine all of to columns into 1 column.
combined_data = convert_dataframe_to_single_text(df)
# Fit the vectorizer.
fitted_vectorizer = client.persist(vectorizer.fit(combined_data))
# Transform the data.
X = client.persist(fitted_vectorizer.transform, combined_data)
I hope the information is enough.
Important note: I am not getting any kind of error when I say client.compute but from what I understand this doesn't work in the cluster of machines and instead it runs in the local machine. And it returns a csr matrix instead of a lazily evaluated dask.array.
This is not how I was supposed to use client.persist. Functions I was looking for are client.submit and client.map... In my case client.submit solved my issue.
I have a very large CSV file that I managed to order by a column id, but I cannot calculate the average column values that have that column id.
88741,42.84286022,16.41829224,1
88797,42.78081536,16.40743455,1
88797,42.78081536,16.21153455,1
88823,42.51512511,16.43304948,2
88885,42.88204193,16.12412548,2
87227,42.88204193,16.64223948,3
and so on...
I need to get a new csv without the SchoolCode column, and with the Lat and Long averaged for each cluster. And also, the digits number should be the same. I tried pandas it throws me this error.
The output should be something like this:
Lat,Long,Cluster
<average_lat_forCluster1>,<average_long_forCluster1>,1
<average_lat_forCluster2>,<average_long_forCluster2>,2
<average_lat_forCluster3>,<average_long_forCluster3>,3
and so on...
My code:
import pandas as pd
df = pd.read_csv('SortedCluster.csv', names=[
'SchoolCode', 'Lat', 'Long', 'Cluster'])
df2 = df.groupby('Cluster')['Lat','Long'].mean()
df2.to_csv('AverageOutput.csv')
Error:
Traceback (most recent call last):
File "averager.py", line 6, in <module>
df2 = df.groupby('Cluster')['Lat','Long'].mean()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1306, in mean
return self._cython_agg_general('mean', **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 3974, in _cython_agg_general
how, alt=alt, numeric_only=numeric_only, min_count=min_count)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 4046, in _cython_agg_blocks
raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate
I believe need convert values to numeric first if necessary:
df[['Lat','Long']] = df[['Lat','Long']].apply(pd.to_numeric, errors='coerce')
And then aggregate mean per groups:
df.groupby('Cluster')['Lat','Long'].mean()
This question already has answers here:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
(5 answers)
Closed 5 years ago.
My code is retrieving historical data of 365 days back from today of 50 different stocks.
I want to store all those data in one dataframe to make it easier to analyse, here I want to filter all those data, date wise and calculate number of advancing/declining stocks at a given date.
My code:
import datetime
from datetime import date, timedelta
import pandas as pd
import nsepy as ns
#setting default dates
end_date = date.today()
start_date = end_date - timedelta(365)
#Deriving the names of 50 stocks in Nifty 50 Index
nifty_50 = pd.read_html('https://en.wikipedia.org/wiki/NIFTY_50')
nifty50_symbols = nifty_50[1][1]
for x in nifty50_symbols:
data = ns.get_history(symbol = x, start=start_date, end=end_date)
big_df = pd.concat(data)
Output:
Traceback (most recent call last):
File "F:\My\Getting data from NSE\advances.py", line 27, in <module>
big_df = pd.concat(data)
File "C:\Users\Abinash\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\concat.py", line 212, in concat
copy=copy)
File "C:\Users\Abinash\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\concat.py", line 227, in __init__
'"{name}"'.format(name=type(objs).__name__))
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
I am very new to python, I went through the tutorial of pandas and saw that pandas.concat was used to merge multiple dataframes into one. I might have understood that wrong.
Data for concatenation has to be iterable for example list.
results = []
for x in nifty50_symbols:
data = ns.get_history(symbol = x, start=start_date, end=end_date)
results.append(data)
big_df = pd.concat(results)
I need to store the timestamps in a list for further operations and have written the following code:
import csv
from datetime import datetime
from collections import defaultdict
t = []
columns = defaultdict(list)
fmt = '%Y-%m-%d %H:%M:%S.%f'
with open('log.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
#t = row[1]
for i in range(len(row)):
columns[i].append(row[i])
if (row):
t=list(datetime.strptime(row[0],fmt))
columns = dict(columns)
print (columns)
for i in range(len(row)-1):
print (t)
But I am getting the error :
Traceback (most recent call last):
File "parking.py", line 17, in <module>
t = list(datetime.strptime(row[0],fmt))
TypeError: 'datetime.datetime' object is not iterable
What can I do to store each timestamp in the column in a list?
Edit 1:
Here is the sample log file
2011-06-26 21:27:41.867801,KA03JI908,Bike,Entry
2011-06-26 21:27:42.863209,KA02JK1029,Car,Exit
2011-06-26 21:28:43.165316,KA05K987,Bike,Entry
If you have a csv file than why not use pandas to get what you want.The code for your problem may be something like this.
import Pandas as pd
df=pd.read_csv('log.csv')
timestamp=df[0]
if the first column of csv is of Timestamp than you have an array with having all the entries in the first column in the list known as timestamp.
After this you can convert all the entries of this list into timestamp objects using datetime.datetime.strptime().
Hope this is helpful.
I can't comment for clarifications yet.
Would this code get you the timestamps in a list? If yes, give me a few lines of data from the csv file.
from datetime import datetime
timestamps = []
with open(csv_path, 'r') as readf_obj:
for line in readf_obj:
timestamps.append(line.split(',')[0])
fmt = '%Y-%m-%d %H:%M:%S.%f'
datetimes_timestamps = [datetime.strptime(timestamp_, fmt) for timestamp_ in timestamps]