Live updating graph from increasing amount of csv files - python-3.x

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

Related

DataFrames getting distorted while unpacking list of dataframes in Python Streamlit, with each dataframe looking different from the other

I am working on Streamlit in python.
I have created a function which allows the user to upload multiple files, reads those files one by one and stores them to a dataframe. I want this function to return all those dataframes to a separate .py file, for which I appended all the dataframes to a list i.e. list of dataframes, and here my problem started!
The list named 'all_df_list' has all the dataframes in it.
But when I unpack this list in the other python script and see the dataframes on that other script, the result is getting distorted i.e. Dataframes are not able to hold the shape and looks which it used to be, before appending it to the list.
Here is the code snippet:
import streamlit as st
import pandas as pd
def user_selection(action):
operation_mode = action
mydf = pd.DataFrame()
with st.sidebar:
if operation_mode == 'Multiple_files':
uploaded_files = st.file_uploader("Choose required files", accept_multiple_files = True)
all_df_list = []
for uploaded_file in uploaded_files:
mydf = mydf.iloc[0:0]
if uploaded_file.name is not None:
# For now, assuming the user will only upload multiple csv files
mydf = pd.read_csv(uploaded_file)
all_df_list.append(mydf)
st.write(all_df_list)
return all_df_list
Now, lets say, I called this function in another script and unpacked the list 'all_df_list' as:
*all_dfs = user_selection('Multiple_files')
st.write(all_dfs[0]) # the dataframe all_dfs[0] has lost its shape and original orientation.
How can I successfully return multiple dataframes via function, such that when I unpack them I should be able to see the dataframes as they were before appending them to the list.
Any lead on this will help. Thank you.
Just like I wrote in the comment section, I seem to see irrelevant lines of code and inappropriate code construction rather than undesired format of dataframes returned by user_selection:
If you are only looking to return list of dfs:
def user_selection(action):
operation_mode = action
mydf = pd.DataFrame()
all_df_list = []
with st.sidebar:
if operation_mode == 'Multiple_files':
uploaded_files = st.file_uploader("Choose required files", accept_multiple_files=True)
if uploaded_files:
for uploaded_file in uploaded_files:
mydf = pd.read_csv(uploaded_file)
all_df_list.append(mydf)
st.write(all_df_list)
return all_df_list
Call this function in another script:
all_dfs = user_selection('Multiple_files')
if all_dfs:
st.write(all_dfs[0])

Split CSV File into two files keeping header in both files

I am trying to split a large CSV file into two files. I am using below code
import pandas as pd
#csv file name to be read in
in_csv = 'Master_file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of rows of data to write to the csv,
#you can change the row size according to your need
rowsize = 600000
#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):
df = pd.read_csv(in_csv,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'File_Number' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=True,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
It is splitting the file but its missing header in second file. How can I fix it
.read_csv() returns an iterator when used with chunksize and then keeps track of the header. The following is an example. This should be much faster since the original code above reads the entire file to count the lines, then re-reads all previous lines in each chunk iteration; whereas below reads through the file only once:
import pandas as pd
with pd.read_csv('Master_file.csv', chunksize=60000) as reader:
for i,chunk in enumerate(reader):
chunk.to_csv(f'File_Number{i}.csv', index=False, header=True)

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in the folder located in Azure data lake to databricks without having to name the specific file so in the future new files are read and appended to make one big data set. The files are all the same schema, columns are in the same order, etc.
So far I have tried for loops with regex expressions:
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
The output print all the paths and it counts each dataset that is being read, but it only displays the last one. I understand because I'm not storing it or appending in the for loop, but when I add append it breaks.
appended_data = []
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
appended_data.append(read)
But I get this error, FileInfo(path='dbfs:/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/Initialization_DSS.xlsx', name='Initialization_DSS.xlsx', size=39781)
TypeError: not supported type: <class 'py4j.java_gateway.JavaObject'>
The final way I tried:
li = []
for f in glob.glob('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx'):
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)
This says that there are no object to concatenate. I have been researching everywhere and trying everything. Please help.
If you want to use pandas to read excel file in databricks, the path should be like /dbfs/mnt/....
For example
import os
import glob
import pandas as pd
li = []
os.chdir(r'/dbfs/mnt/<mount-name>/<>')
allFiles = glob.glob("*.xlsx") # match your csvs
for file in allFiles:
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)

Issue when exporting dataframe to csv

I'm working on a mechanical engineering project. For the following code, the user enters the number of cylinders that their compressor has. A dataframe is then created with the correct number of columns and is exported to Excel as a CSV file.
The outputted dataframe looks exactly like I want it to as shown in the first link, but when opened in Excel it looks like the image in the second link:
1.my dataframe
2.Excel Table
Why is my dataframe not exporting properly to Excel and what can I do to get the same dataframe in Excel?
import pandas as pd
CylinderNo=int(input('Enter CylinderNo: '))
new_number=CylinderNo*3
list1=[]
for i in range(1,CylinderNo+1):
for j in range(0,3):
Cylinder_name=str('CylinderNo ')+str(i)
list1.append(Cylinder_name)
df = pd.DataFrame(list1,columns =['Kurbel/Zylinder'])
list2=['Triebwerk', 'Packung','Ventile']*CylinderNo
Bauteil = {'Bauteil': list2}
df2 = pd.DataFrame (Bauteil, columns = ['Bauteil'])
new=pd.concat([df, df2], axis=1)
list3=['Nan','Nan','Nan']*CylinderNo
Bewertung={'Bewertung': list3}
df3 = pd.DataFrame (Bewertung, columns = ['Bewertung'])
new2=pd.concat([new, df3], axis=1)
Empfehlung={'Empfehlung': list3}
df4 = pd.DataFrame (Empfehlung, columns = ['Empfehlung'])
new3=pd.concat([new2, df4], axis=1)
new3.set_index('Kurbel/Zylinder')
new3 = new3.set_index('Kurbel/Zylinder', append=True).swaplevel(0,1)
#export dataframe to csv
new3.to_csv('new3.csv')
To be clear, a comma-separated values (CSV) file is not an Excel format type or table. It is a delimited text file that Excel like other applications can open.
What you are comparing is simply presentation. Both data frames are exactly the same. For multindex data frames, Pandas print output does not repeat index values for readability on the console or IDE like Jupyter. But such values are not removed from underlying data frame only its presentation. If you re-order indexes, you will see this presentation changes. The full complete data frame is what is exported to CSV. And ideally for data integrity, you want the full data set exported with to_csv to be import-able back into Pandas with read_csv (which can set indexes) or other languages and applications.
Essentially, CSV is an industry format to store and transfer data. Consider using Excel spreadsheets, HTML markdown, or other reporting formats for your presentation needs. Therefore, to_csv may not be the best method. You can try to build text file manually with Python i/o write methods, with open('new.csv', 'w') as f, but will be an extensive workaround See also #Jeff's answer here but do note the latter part of solution does remove data.

Transfer cell values from different columns and sheets from multiple excel files with same structure into a single dataframe

I have a reporting sheet in excel that contains a set of datapoints that I want to compile from multiple files with the same format into a master dataset.
The initial step I undertook was to extract the data points I need from multiple sheet into one pandas dataframe. See the steps below
I initally imported the excel file and parsed it
import pandas as pd
xl = pd.ExcelFile(r"C:\Users\Nicola\Desktop\ISP 2016-20 Ops-Technical Form.xlsm")
df = xl.parse("FSL, WASH, DRM") #name of sheet #1
Then I located the data points needed for synthesis
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
Then I concatenated and equalised columns positioning to maintain the whole list of values within the same column:
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
And lastly created a dataframe with the string of values I need
master=pd.DataFrame(dfcont2)
finalmaster=master.transpose()
The next two steps I wish to pursue are:
1) Replicate the same code for 50 excel files
2) Compile all string of values from this set of excel files into one single pandas dataframe without running this code over again and compile manually by exporting it into excel.
Any support would be greatly appreciated. Thanks
I believe need loop by file names created by glob and last concat together (all files have same structure):
import glob
dfs = []
for f in glob.glob('*.xlsm'):
df = pd.read_excel(io=f, sheet_name=1)
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
out = pd.concat(dfs, ignore_index=True)
Found the solution that works for me, thank you for the input, jezrael.
To further explain:
1) Imported the files with same structure from my Desktop directory, parsed and selected the Excel sheet from which data can be extracted from different locations (iloc)
import glob
dfs = []
for f in glob.glob('C:/Users/Nicola/Desktop/OPS Form/*.xlsm'):
df = pd.ExcelFile(io=f, sheet_name=1)
df = df.parse("FSL, WASH, DRM")
a=df.iloc[5:20,3:5]
a1=df.iloc[7:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
c=df.iloc[50:56,3:5]
c1=df.iloc[38:39,10:12]
d=df.iloc[57:61,3:5]
e=df.iloc[63:71,3:5]
2) Concatenated and repositioned columns order to compose the first version of the dataframe (output)
dfcon=pd.concat(([a,b,c,d,e]))
dfcon2=pd.concat(([a1,b1,c1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
3) Output presented the same string of values but repeated twice [same label and form-specific entry] from recursive data pull-outs linked to iloc locations.
output = pd.concat(dfs, ignore_index=True)
4) This last snippet simply allowed me to extract the label only once and to select all entries ordered in odd numbers. With the last concatenation, I generated the dataframe I seeked, ready to be processed analytically.
a=output[2:3]
b=output[1::2]
pd.concat([a,b], axis=0, ignore_index=True)

Resources