Better way to read columns from excel file as variables in Python - python-3.x

I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)

Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()

Related

Saving panda dataframe to csv changes values

I want to save a bunch of values in a dataframe to csv but I keep running in the problem that something changes to values while saving. Let's have a look at the MWE:
import pandas as pd
import csv
df = {
"value1": [110.589, 222.534, 390.123],
"value2": [50.111, 40.086, 45.334]
}
df.round(1)
#checkpoint
df.to_csv(some_path)
If I debug it and look at the values of df at the step which I marked "checkpoint", thus after rounding, they are like
[110.6000, 222.5000, 390.1000],
[50.1000, 40.1000, 45.3000]
In reality, my data frame is much larger and when I open the csv after saving, some values (usually in a random block of a couple of dozen rows) have changed! They then look like
[110.600000000001, 222.499999999999, 390.099999999999],
[50.099999999999, 40.100000000001, 45.300000000001]
So it's always a 0.000000000001 offset from the "real"/rounded values. Does anybody know what's going on here/how I can avoid this?
This is a typical floating point problem. pandas gives you the option to define a float_format:
df.to_csv(some_path, float_format='%.4f')
This will force 4 decimals (or actually, does a cut-off at 4 decimals). Note that values will be treated as strings now, so if you set quoting on strings, then these columns are also quoted.

Accessing columns in a csv using python

I am trying to access data from a CSV using python. I am able to access entire columns for data values; however, I want to also access rows, an use like and indexed coordinate system (0,1) being column 0, row 1. So far I have this:
#Lukas Robin
#25.07.2021
import csv
with open("sun_data.csv") as sun_data:
sunData = csv.reader(sun_data, delimiter=',')
global data
for data in sunData:
print(data)
I don't normally use data tables or CSV, so this is a new area for me.
As mentioned in the comment, you could make the jump to using pandas and spend a little time learning that. It would be a good investment of time if you plan to do much data analysis or work with data tables regularly.
If you just want to pull in a table of numbers and access it as you request, you are perfectly fine using csv package and doing that. Below is an example...
If your .csv file has a header in it, you can simply add in next(sun_data) before starting the inner loop to advance the iterator and let that data fall on the floor...
import csv
f_in = 'data_table.csv'
data = [] # a container to hold the results
with open(f_in, 'r') as source:
sun_data = csv.reader(source, delimiter=',')
for row in sun_data:
# convert the read-in values to float data types (or ints or ...)
row = [float(t) for t in row]
# append it to the data table
data.append(row)
print(data[1][0])

Need to sort a numpy array based on column value, but value is present in String format

I am a newbie in python and need to extract info from a csv file containing terrorism data.
I need to extract top 5 cities in India, having maximum casualities, where Casuality = Killed(given in CSV) + Wounded(given in CSV).
City column is also given in the CSV file.
Output format should be like below in descending order of casuality
city_1 casualty_1 city_2 casualty_2 city_3 casualty_3 city_4
casualty_4 city_5 casualty_5
Link to CSV- https://ninjasdatascienceprod.s3.amazonaws.com/3571/terrorismData.csv?AWSAccessKeyId=AKIAIGEP3IQJKTNSRVMQ&Expires=1554719430&Signature=7uYCQ6pAb1xxPJhI%2FAfYeedUcdA%3D&response-content-disposition=attachment%3B%20filename%3DterrorismData.csv
import numpy as np
import csv
file_obj=open("terrorismData.csv",encoding="utf8")
file_data=csv.DictReader(file_obj,skipinitialspace=True)
country=[]
killed=[]
wounded=[]
city=[]
final=[]
#Making lists
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(row['Killed'])
wounded.append(row['Wounded'])
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])
#Making numpy arrays out of lists
np_month=np.array(country)
np_killed=np.array(killed)
np_wounded=np.array(wounded)
np_city=np.array(city)
np_final=np.array(final)
#Fixing blank values in final arr
for i in range(len(np_final)):
for j in range(len(np_final[0])):
if np_final[i][j]=='':
np_final[i][j]='0.0'
#Counting casualities(killed+wounded) and storing in 1st column of final array
for i in range(len(np_final)):
np_final[i,1]=float(np_final[i,1])+float(np_final[i,2])
#Descending sort on casualities column
np_final=np_final[np_final[:,1].argsort()[::-1]]
I expect np_final to get sorted on column casualities , but it's not happening because type(casualities) is coming as 'String'
Any help is appreciated.
I would offer for you to use Pandas. It would be easier for you to manipulate date.
Read everything to DataFrame. It should read numbers into number formats.
If you must to use np, while reading data, you could simply cast your values to float or integer and everything should work, if there are no other bugs.
Something like this:
for row in file_data:
if row['Country']=='India':
country.append(row['Country'])
killed.append(int(row['Killed']))
wounded.append(int(row['Wounded']))
city.append(row['City'])
final.append([row['City'],row['Killed'],row['Wounded']])

How to load multiple excel files with multiple sheets in to one dataframe in python

We're trying to make an automatic program, that can take multiple excel files with multiple sheets from a folder, and append them to one data frame.
Our problem is that we're not quite sure how to do this, so the process becomes most automatic. And since the sheets varies in names, we can't specify any variable for them.
Alle of the files are *.xlsx, and the code has to load a arbitrary number of files.
We have tried with different types of codes, primarily using pandas, but we can't seem to append them in one data frame.
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer)
writer.save()
sheet1 = xls.parse(0)
We expect to have one data frame with all data, such that we can use data and extract different features and make statistics.
The documentation of pandas.read_excel states:
*sheet_name : str, int, list, or None, default 0
Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets.
Available cases:
Defaults to 0: 1st sheet as a DataFrame
1: 2nd sheet as a DataFrame
"Sheet1": Load sheet with name “Sheet1”
[0, 1, "Sheet5"]: Load first, second and sheet named “Sheet5” as a dict of DataFrame
None: All sheets.*
I would suggest to try the last option, being pd.read_excel(f,sheet_name = None). Otherwise you might want to create a loop and pass indexes vs. the actual sheet names, this way you don't have to have prior knowledge of the .xlsx files.

Pandas: Iterate through a list of DataFrames and export each to excel sheets

Trying to teach myself coding to automate some tedious tasks at work. I apologize for any unintentional ignorance.
I have created Data Frames in pandas (python 3.x). I want to print each data frame to a different excel sheet. Here is what I have for 2 Data Frames, it works perfect but I want to scale it to loop through a list of data frames so that I can make it a bit more dynamic.
writer = pandas.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
Data.to_excel(writer, sheet_name="Data")
ByBrand.to_excel(writer, sheet_name="ByBrand")
writer.save()
Easy enough, but when there are 50+ sheets that need to be created it will get tedious.
Here is what I tried, it did not work:
writer = pandas.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
List = [Data , ByBrand]
for i in List:
i.to_excel(writer, sheet_name= i)
writer.save()
I think the issue is that the sheet_name field must be a string because as-is it creates an error. But if I put sheet_name= "i", it only creates one sheet called "i" with the data from Data, but doesn't iterate to ByBrand. Also, the excel file would be a nightmare if the sheets weren't named to their corresponding data frame, so please no suggestions for things like numbered sheets.
Thank you so much in advance, this website has been invaluable for my journey into coding.
-Stephen
It is easier to go from the string 'Data' to the value Data than the other way around. You can use locals()['Data'] to access the value associated to the variable whose string name is 'Data':
import pandas as pd
writer = pd.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
seq = ['Data', 'ByBrand']
for name in seq:
df = locals()[name]
df.to_excel(writer, sheet_name=name)
writer.save()
locals() returns a read-only dictionary containing the current scope's local variables.
globals() returns a dictionary containing the current scope's global variables. (Thus, if Data and ByBrand are defined in the global namespace rather than the local namespace, use globals() instead of locals().)
Another option is to collect the DataFrames in a dict. Instead of creating a variable for each DataFrame, make one dict, and let the keys be the sheet names and the values be DataFrames:
import pandas as pd
dfs = dict()
dfs['Data'] = ...
dfs['ByBrand'] = ...
writer = pd.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
for name, df in dfs.items():
df.to_excel(writer, sheet_name=name)
writer.save()
I think this is preferable since it does not require introspection tools like locals() or globals(). This second approach just uses a dict the way dicts are intended to be used: mapping keys to values.

Resources