IllegalArgumentException: transpose requires all collections have the same size - apache-spark

I have parquet files under different folders in ADLS and I want to merge them into one..
the folder structure to path is like this
I ran the code below to do this but getting an error saying
"IllegalArgumentException: transpose requires all collections have the same size"
files = dbutils.fs.ls('abfss://udl-container#container-name.dfs.core.windows.net/UserData/folder1/folder2/folder3/')
combined_df = None
for fi in files:
df = spark.read.parquet(fi.path)
if combined_df == None:
combined_df = df
else:
combined_df = combined_df.union(df)
anyone has can help resolve this error?
any help would be appreciated!

Related

DataFrames getting distorted while unpacking list of dataframes in Python Streamlit, with each dataframe looking different from the other

I am working on Streamlit in python.
I have created a function which allows the user to upload multiple files, reads those files one by one and stores them to a dataframe. I want this function to return all those dataframes to a separate .py file, for which I appended all the dataframes to a list i.e. list of dataframes, and here my problem started!
The list named 'all_df_list' has all the dataframes in it.
But when I unpack this list in the other python script and see the dataframes on that other script, the result is getting distorted i.e. Dataframes are not able to hold the shape and looks which it used to be, before appending it to the list.
Here is the code snippet:
import streamlit as st
import pandas as pd
def user_selection(action):
operation_mode = action
mydf = pd.DataFrame()
with st.sidebar:
if operation_mode == 'Multiple_files':
uploaded_files = st.file_uploader("Choose required files", accept_multiple_files = True)
all_df_list = []
for uploaded_file in uploaded_files:
mydf = mydf.iloc[0:0]
if uploaded_file.name is not None:
# For now, assuming the user will only upload multiple csv files
mydf = pd.read_csv(uploaded_file)
all_df_list.append(mydf)
st.write(all_df_list)
return all_df_list
Now, lets say, I called this function in another script and unpacked the list 'all_df_list' as:
*all_dfs = user_selection('Multiple_files')
st.write(all_dfs[0]) # the dataframe all_dfs[0] has lost its shape and original orientation.
How can I successfully return multiple dataframes via function, such that when I unpack them I should be able to see the dataframes as they were before appending them to the list.
Any lead on this will help. Thank you.
Just like I wrote in the comment section, I seem to see irrelevant lines of code and inappropriate code construction rather than undesired format of dataframes returned by user_selection:
If you are only looking to return list of dfs:
def user_selection(action):
operation_mode = action
mydf = pd.DataFrame()
all_df_list = []
with st.sidebar:
if operation_mode == 'Multiple_files':
uploaded_files = st.file_uploader("Choose required files", accept_multiple_files=True)
if uploaded_files:
for uploaded_file in uploaded_files:
mydf = pd.read_csv(uploaded_file)
all_df_list.append(mydf)
st.write(all_df_list)
return all_df_list
Call this function in another script:
all_dfs = user_selection('Multiple_files')
if all_dfs:
st.write(all_dfs[0])

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in the folder located in Azure data lake to databricks without having to name the specific file so in the future new files are read and appended to make one big data set. The files are all the same schema, columns are in the same order, etc.
So far I have tried for loops with regex expressions:
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
The output print all the paths and it counts each dataset that is being read, but it only displays the last one. I understand because I'm not storing it or appending in the for loop, but when I add append it breaks.
appended_data = []
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
appended_data.append(read)
But I get this error, FileInfo(path='dbfs:/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/Initialization_DSS.xlsx', name='Initialization_DSS.xlsx', size=39781)
TypeError: not supported type: <class 'py4j.java_gateway.JavaObject'>
The final way I tried:
li = []
for f in glob.glob('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx'):
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)
This says that there are no object to concatenate. I have been researching everywhere and trying everything. Please help.
If you want to use pandas to read excel file in databricks, the path should be like /dbfs/mnt/....
For example
import os
import glob
import pandas as pd
li = []
os.chdir(r'/dbfs/mnt/<mount-name>/<>')
allFiles = glob.glob("*.xlsx") # match your csvs
for file in allFiles:
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)

Load tables from excel sheets with both multilevel column data frames and single level column data frames

I have an excel sheet with tables in multiple workbooks.
I want to load all the tables from excel workbooks to pandas dataframes using pandas.read_excel() function.
But the problem is few tables have single level columns and few tables have multilevel columns.
I wrote the following code to handle this issue. This code is working fine, but I want to know if there is any standard pythonic way of handling this issue.
xl = r"D:\\xl_tables.xlsx"
f = pd.ExcelFile(xl)
f.sheet_names = ["multilevel_column", "single_level_column" ]
dfl = []
for i in f.sheet_names:
try:
df = pd.read_excel(xl, sheet_name=i, header=[0,1])
df.iloc[0,1]
dfl.append(df)
except:
df = pd.read_excel(xl, sheet_name=i)
df.iloc[0,1]
dfl.append(df)
Your approach seems to be good but avoid to catch all exceptions. I think you can probably replace except by except ValueError.
Also, use ExcelFile rather than reopen your file multiple times.
dfl = []
with pd.ExcelFile("sample.xlsx") as reader:
for sheet_name in reader.book.sheetnames:
df = reader.parse(sheet_name, header=None)
if pd.isnull(df.iloc[1, 0]):
idx = pd.MultiIndex.from_arrays(df.iloc[:2].values)
i = 2
else:
idx = pd.Index(*df.iloc[:1].values)
i = 1
dfl.append(pd.DataFrame(df[i:].values, columns=idx))

Live updating graph from increasing amount of csv files

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

Extract excel data from multiple folders to a DataFrame and extract the folder names for each rows to the Dataframe

I am trying to extract the data from xlsx file from the multiple folders and also to get the folder name for each row to identify where the is extracted from. I am able to extract the data from all the folders however i am unable to get the folder names to the dataframe. Please help.
Folder structure -
Month-Year - 2020-02
Day folder - 2020-02-01
Under the day folder consist of xlsx files.
paths = []
arr = []
for root, dirs, files in os.walk(Full_Path):
for file in files:
if file.endswith(".xlsx"):
#print(os.path.join(root, file))
ab = os.path.join(root, file)
print(ab)
arr.append(paths)
paths.append(ab)
for lm in arr:
print(lm)
all_data = pd.DataFrame()
for f in paths:
df = pd.read_excel(f, sheet_name='In Queue', usecols = fields)
df['Date'] = lm
all_data = all_data.append(df,ignore_index=True)
I have also tried different ways but not getting the output.
Consider building a list of data frames to be concatenated once outside the loop. Also, use assign to generate the Date column during the loop:
df_list = []
for root, dirs, files in os.walk(Full_Path):
for file in files:
if file.endswith((".xls", ".xlsx", ".xlsb", ".xlsm", ".odf")):
xl_file = os.path.join(root, file)
df = (pd.read_excel(xl_file, sheet_name='In Queue', usecols = fields)
.assign(Date = xl_file))
df_list.append(df)
final_df = pd.concat(df_list, ignore_index=True)

Resources