How to write summary of spark sql dataframe to excel file - apache-spark

I have a very large Dataframe with 8000 columns and 50000 rows.
I want to write its statistics information into excel file.
I think we can use describe() method. But how to write it to excel in good format. Thanks

The return type for describe is a pyspark dataframe. The easiest way to get the describe dataframe into an excel readable format is to convert it to a pandas dataframe and then write the pandas dataframe out as a csv file as below
import pandas
df.describe().toPandas().to_csv('fileOutput.csv')
If you want it in excel format, you can try below
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)

Related

How to merge cells in HTML output of pandas dataframe in Python

How can I merge the following pandas dataframe
into the following in HTML format?
Edit: I know about the df.to_html() method. But the dataframe is returning
but what i want is

Read excel using pandas and join two pandas dataframes without losing formatting styles

Initially I have two excel files. Input file1 contains some colors present in excel columns.
Another excel file looks likes this.
I have to join this two excel file using openpyxl or xlsxwriter(python library) or by any other methods. And in the output file I don't want to loose colors. output file will look like the below image.
please use the code below to create the pandas dataframe for the two input files.
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'name':['rahul','raju','mohan','ram'],
'salary':[20000,34000,10000,998765]
})
print(df)
df1 = pd.DataFrame({'id':[1,2,3,4],
'state':['gujrat','bhopal','mumbai','kolkata']
})
print(df1)

Using a for loop in pandas

I have 2 different tabular files, in excel formats. I want to know if an id number from one of the columns in the first excel file (from the "ID" column) exists in the proteome file in a specific column (take "IHD" for example) and if so, to display the value associated with it. Is there a way to do this, specifically in pandas and possible using a for loop?
After loading the excel files with read_excel(), you should merge() the dataframes on ID and protein. This is the recommended approach with pandas rather than looping.
import pandas as pd
clusters = pd.read_excel('clusters.xlsx')
proteins = pd.read_excel('proteins.xlsx')
clusters.merge(proteins, left_on='ID', right_on='protein')

Using constant memory with pandas xlsxwriter

Im trying to use the below code to write large pandas dataframes to excel worsheets, if i write it directly the system is running out of RAM, is this a viable option or are there any alternatives?
writer = pd.ExcelWriter('Python Output Analysis.xlsx', engine='xlsxwriter',options=dict(constant_memory=True))
The constant_memory mode of XlsxWriter can be used to write very large Excel files with low, constant, memory usage. The catch is that the data needs to be written in row order and (as #Stef points out in the comments above) Pandas writes to Excel in column order. So constant_memory mode won't work with Pandas ExcelWriter.
As an alternative you could avoid ExcelWriter and write the data directly to XlsxWriter from the dataframe on a row by row basis. However, that will be slower from a Pandas point of view.
If your data is large, just consider saving the data with raw text file. e.g. csv, txt, etc.
df.to_csv('file.csv', index=False, sep=',')
df.to_csv('file.tsv', index=False, sep='\t')
Or split DataFrame, and save to small files.
df_size = df.shape[0]
chunksize = df_size//10
for i in range(0, df_size, chunksize):
# print(i, i+chunksize)
dfn = df.iloc[i:i+chunksize,:]
dfn.to_excel('...')

Create a Dataframe from an excel file

I want create a Dataframe from excel file. I am using pandas read_excel function. My requirement is to create a Dataframe for all elements if the column matches some value.
For eg:- Below is my excel file and I want to create the Dataframe with all elements that has Module equal to 'DC-Prod'
Exccel File Image
Welcome, Saagar Sheth!
to make a Dataframe, just import "pandas" it like so...
import pandas as pd
then create a variable for the file to access, like this;
file_var_pandas = 'customer_data.xlsx'
and then, create its dataframe using the read_excel;
customers = pd.read_excel(file_var_pandas,
sheetname=0,
header=0,
index_col=False,
keep_default_na=True
)
finally, use the head() command like so;
customers.head()
if you want to know more just go to this website!
Packet Pandas Dataframe
and have fun!

Resources