I am trying to access data from a CSV using python. I am able to access entire columns for data values; however, I want to also access rows, an use like and indexed coordinate system (0,1) being column 0, row 1. So far I have this:
#Lukas Robin
#25.07.2021
import csv
with open("sun_data.csv") as sun_data:
sunData = csv.reader(sun_data, delimiter=',')
global data
for data in sunData:
print(data)
I don't normally use data tables or CSV, so this is a new area for me.
As mentioned in the comment, you could make the jump to using pandas and spend a little time learning that. It would be a good investment of time if you plan to do much data analysis or work with data tables regularly.
If you just want to pull in a table of numbers and access it as you request, you are perfectly fine using csv package and doing that. Below is an example...
If your .csv file has a header in it, you can simply add in next(sun_data) before starting the inner loop to advance the iterator and let that data fall on the floor...
import csv
f_in = 'data_table.csv'
data = [] # a container to hold the results
with open(f_in, 'r') as source:
sun_data = csv.reader(source, delimiter=',')
for row in sun_data:
# convert the read-in values to float data types (or ints or ...)
row = [float(t) for t in row]
# append it to the data table
data.append(row)
print(data[1][0])
Related
I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)
Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()
We're trying to make an automatic program, that can take multiple excel files with multiple sheets from a folder, and append them to one data frame.
Our problem is that we're not quite sure how to do this, so the process becomes most automatic. And since the sheets varies in names, we can't specify any variable for them.
Alle of the files are *.xlsx, and the code has to load a arbitrary number of files.
We have tried with different types of codes, primarily using pandas, but we can't seem to append them in one data frame.
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer)
writer.save()
sheet1 = xls.parse(0)
We expect to have one data frame with all data, such that we can use data and extract different features and make statistics.
The documentation of pandas.read_excel states:
*sheet_name : str, int, list, or None, default 0
Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets.
Available cases:
Defaults to 0: 1st sheet as a DataFrame
1: 2nd sheet as a DataFrame
"Sheet1": Load sheet with name “Sheet1”
[0, 1, "Sheet5"]: Load first, second and sheet named “Sheet5” as a dict of DataFrame
None: All sheets.*
I would suggest to try the last option, being pd.read_excel(f,sheet_name = None). Otherwise you might want to create a loop and pass indexes vs. the actual sheet names, this way you don't have to have prior knowledge of the .xlsx files.
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column
I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?
Trying to teach myself coding to automate some tedious tasks at work. I apologize for any unintentional ignorance.
I have created Data Frames in pandas (python 3.x). I want to print each data frame to a different excel sheet. Here is what I have for 2 Data Frames, it works perfect but I want to scale it to loop through a list of data frames so that I can make it a bit more dynamic.
writer = pandas.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
Data.to_excel(writer, sheet_name="Data")
ByBrand.to_excel(writer, sheet_name="ByBrand")
writer.save()
Easy enough, but when there are 50+ sheets that need to be created it will get tedious.
Here is what I tried, it did not work:
writer = pandas.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
List = [Data , ByBrand]
for i in List:
i.to_excel(writer, sheet_name= i)
writer.save()
I think the issue is that the sheet_name field must be a string because as-is it creates an error. But if I put sheet_name= "i", it only creates one sheet called "i" with the data from Data, but doesn't iterate to ByBrand. Also, the excel file would be a nightmare if the sheets weren't named to their corresponding data frame, so please no suggestions for things like numbered sheets.
Thank you so much in advance, this website has been invaluable for my journey into coding.
-Stephen
It is easier to go from the string 'Data' to the value Data than the other way around. You can use locals()['Data'] to access the value associated to the variable whose string name is 'Data':
import pandas as pd
writer = pd.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
seq = ['Data', 'ByBrand']
for name in seq:
df = locals()[name]
df.to_excel(writer, sheet_name=name)
writer.save()
locals() returns a read-only dictionary containing the current scope's local variables.
globals() returns a dictionary containing the current scope's global variables. (Thus, if Data and ByBrand are defined in the global namespace rather than the local namespace, use globals() instead of locals().)
Another option is to collect the DataFrames in a dict. Instead of creating a variable for each DataFrame, make one dict, and let the keys be the sheet names and the values be DataFrames:
import pandas as pd
dfs = dict()
dfs['Data'] = ...
dfs['ByBrand'] = ...
writer = pd.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
for name, df in dfs.items():
df.to_excel(writer, sheet_name=name)
writer.save()
I think this is preferable since it does not require introspection tools like locals() or globals(). This second approach just uses a dict the way dicts are intended to be used: mapping keys to values.