Saving panda dataframe to csv changes values - python-3.x

I want to save a bunch of values in a dataframe to csv but I keep running in the problem that something changes to values while saving. Let's have a look at the MWE:
import pandas as pd
import csv
df = {
"value1": [110.589, 222.534, 390.123],
"value2": [50.111, 40.086, 45.334]
}
df.round(1)
#checkpoint
df.to_csv(some_path)
If I debug it and look at the values of df at the step which I marked "checkpoint", thus after rounding, they are like
[110.6000, 222.5000, 390.1000],
[50.1000, 40.1000, 45.3000]
In reality, my data frame is much larger and when I open the csv after saving, some values (usually in a random block of a couple of dozen rows) have changed! They then look like
[110.600000000001, 222.499999999999, 390.099999999999],
[50.099999999999, 40.100000000001, 45.300000000001]
So it's always a 0.000000000001 offset from the "real"/rounded values. Does anybody know what's going on here/how I can avoid this?

This is a typical floating point problem. pandas gives you the option to define a float_format:
df.to_csv(some_path, float_format='%.4f')
This will force 4 decimals (or actually, does a cut-off at 4 decimals). Note that values will be treated as strings now, so if you set quoting on strings, then these columns are also quoted.

Related

Better way to read columns from excel file as variables in Python

I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)
Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()

How to manipulate this csv, in py especially, while i insert one Character like psg, and the result is 9

This is the example csv. I already tried another way but still print useless things, i just want to get the values (Count).
You can just read through pandas:
import pandas as pd
df = pd.read_csv(r'FILE.csv')
print(df['Count'])
If you need to print only the Counts:
print(*df['Count'].values.tolist(),sep="\n")

Need to sort a numpy array based on column value, but value is present in String format

I am a newbie in python and need to extract info from a csv file containing terrorism data.
I need to extract top 5 cities in India, having maximum casualities, where Casuality = Killed(given in CSV) + Wounded(given in CSV).
City column is also given in the CSV file.
Output format should be like below in descending order of casuality
city_1 casualty_1 city_2 casualty_2 city_3 casualty_3 city_4
casualty_4 city_5 casualty_5
Link to CSV- https://ninjasdatascienceprod.s3.amazonaws.com/3571/terrorismData.csv?AWSAccessKeyId=AKIAIGEP3IQJKTNSRVMQ&Expires=1554719430&Signature=7uYCQ6pAb1xxPJhI%2FAfYeedUcdA%3D&response-content-disposition=attachment%3B%20filename%3DterrorismData.csv
import numpy as np
import csv
file_obj=open("terrorismData.csv",encoding="utf8")
file_data=csv.DictReader(file_obj,skipinitialspace=True)
country=[]
killed=[]
wounded=[]
city=[]
final=[]
#Making lists
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(row['Killed'])
wounded.append(row['Wounded'])
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])
#Making numpy arrays out of lists
np_month=np.array(country)
np_killed=np.array(killed)
np_wounded=np.array(wounded)
np_city=np.array(city)
np_final=np.array(final)
#Fixing blank values in final arr
for i in range(len(np_final)):
for j in range(len(np_final[0])):
if np_final[i][j]=='':
np_final[i][j]='0.0'
#Counting casualities(killed+wounded) and storing in 1st column of final array
for i in range(len(np_final)):
np_final[i,1]=float(np_final[i,1])+float(np_final[i,2])
#Descending sort on casualities column
np_final=np_final[np_final[:,1].argsort()[::-1]]
I expect np_final to get sorted on column casualities , but it's not happening because type(casualities) is coming as 'String'
Any help is appreciated.
I would offer for you to use Pandas. It would be easier for you to manipulate date.
Read everything to DataFrame. It should read numbers into number formats.
If you must to use np, while reading data, you could simply cast your values to float or integer and everything should work, if there are no other bugs.
Something like this:
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(int(row['Killed']))
wounded.append(int(row['Wounded']))
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])

Pandas - iterating to fill values of a dataframe

I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?

using split() to split values in an entire column in a python dataframe

I am trying to clean a list of url's that has garbage as shown.
/gradoffice/index.aspx(
/gradoffice/index.aspx-
/gradoffice/index.aspxjavascript$
/gradoffice/index.aspx~
I have a csv file with over 190k records of different url's. I tried to load the csv into a pandas dataframe and took the entire column of url's into a list by using the statement
str = df['csuristem']
it clearly gave me all the values in the column. when i use the following code - It is only printing 40k records and it starts some where in the middle. I don't know where am going wrong. the program runs perfectly but is showing me only partial number of results. any help would be much appreciated.
import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
s = s.split(".")[0]
print s
I am looking to get an output like this
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
Thank you,
Santhosh.
You need to do the following, so call .str.split on the column and then .str[0] to access the first portion of the split string of interest:
In [6]:
df['csuristem'].str.split('.').str[0]
Out[6]:
0 /gradoffice/index
1 /gradoffice/index
2 /gradoffice/index
3 /gradoffice/index
Name: csuristem, dtype: object

Resources