writing pyspark data frame to text file - apache-spark

I have a pyspark data frame which I created from one table in sql server and
I did some transformation on that and now I am going to convert it to
dynamic data frame in order to be abale to save it as a text file
in s3 bucket. when I am writing data frame to text file I am going to
add another header to that file.
This is my dynamic data frame that will be saved as a file:
AT_DATE | AMG_INS | MONTHLY_AVG
2021-03-21 | MT.0000| 234.543
2021_02_12| MT.1002 | 34.567
I want to add another header on top of that while I am saving my text file I need to add another row like this:
HDR,FTP,PC
AT_DATE,AMG_INS,MONTHLY_AVG
2021-03-21,MT.0000,234.543
2021_02_12,MT.1002,34.567
This is separate row that I need to add on top of my text file.

To save your dataframe as a text file with additional headers lines, you have to perform the following steps:
Prepare your data dataframe
as you can only write to text one column dataframes, you first concatenate all values into one value column, using concat_ws spark SQL function
then you drop all columns but value column using select dataframe method
you add an order column with literal value 2, it will be used later to ensure that headers are at the top of the output text file
Prepare your header dataframe
You create a headers dataframe, containing one row per desired headers. Each row having two column:
a value column containing the header as a string
an order column containing the header order as an int (0 for the first header and 1 for the second header)
Write the union of headers and data dataframes
you union your first dataframe with the headers dataframe using union dataframe method
you use coalesce(1) dataframe method to have only one text file as output
you order your dataframe by your order column using orderBy dataframe method
you drop your order column
and you write the resulting dataframe
Complete code
Translated into code, it gives you below code snippet. I call your dynamic dataframe output_dataframe and your spark session spark and I write to /tmp/to_text_file:
from pyspark.sql import functions as F
data = output_dataframe \
.select(F.concat_ws(',', F.col("AT_DATE"), F.col("AMG_INS"), F.col("MONTHLY_AVG")).alias('value')) \
.withColumn('order', F.lit(2))
headers = sparkSession.createDataFrame([('HDR,FTP,PC', 0), ('AT_DATE,AMG_INS,MONTHLY_AVG', 1)], ['value', 'order'])
headers.union(data) \
.coalesce(1) \
.orderBy('order')\
.drop('order') \
.write.text("/tmp/to_text_file")

Related

Write dataframe and strings in one single csv file

I want to export a dataframe (500 rows,2 columns) from python to a CSV file.
However, I need to ensure that 1st 20 rows have some text/strings written and then the dataframe(500 rows,2 columns) should start from the 21st row onwards.
I referred to the following link: Skip first rows when writing csv (pandas.DataFrame.to_csv) . However, it does not satisfy my requirements.
Can somebody please let me know how do we do this?
Get first 20 rows and save it to another dataframe
Check if there are any null values
If not any null values, remove first 20 rows
Save df as a csv file
df2 = df.head(20)
df2 = df2.isnull().values.any()
if not df2:
df = df[10:]
df.to_csv('updated.csv')

Python3 - Delete Top Rows from CSV before headers

I have a csv file that is a vulnerability report and at the top of the csv is 4 rows of with one to two columns of text about the report before the headers. I am trying to write a script that will delete these rows so that I can combine multiple files for calculations, reporting etc.
I have tried using pandas to convert the csv into a dataframe and then delete the rows, but because these top rows are not headers the dataframe conversion fails. Any advice on how I can delete these top four rows from the csv? Thanks!
Use the skiprows parameter of read_csv. For example:
# Skip 2 rows from top in csv and initialize a dataframe
usersDf = pd.read_csv('users.csv', skiprows=2)
# Skip rows at specific index
usersDf = pd.read_csv('users.csv', skiprows=[0,2,5])

How to append dataframe into csv file with columns order changed without iterating over each column?

I am trying to append new dataframes iteratively to a single CSV file. The problem is dataframes always have same columns names but their order is random.Currently, I am using following code to append new data into csv :
with open(name+'.csv', 'a') as f:
df.to_csv(f, header=f.tell()==0)
But this does not work when column order is changed. It keeps appending values in order they come without taking in account the headers. E.g . If column order in the first dataframe id [A,B,C,D] and in the second dataframe, order is [D,C,A,B] the CSV becomes:
A,B,C,D
a,b,c,d
a,b,c,d
a,b,c,d
...
d,c,a,b
d,c,a,b
d,c,a,b
...
Any suggestions?
Use reindex function
with open(name+'.csv', 'a') as f:
df.reindex(columns=list('ABCD')).to_csv(f, header=f.tell()==0)

Splitting a dataframe (csv)

How do i split a dataframe (csv) in the ratio of 4:1 randomly and store them in two different variables
ex- if there are ten rows from 1 to 10 in the dataframe, i want any 8 rows from it in variable 'a' and the remaining 2 rows in variable 'b'.
I've never done this on a random basis but the basic approach would be:
import pandas 2)
read in your csv
drop empty/null columns(avoid issues with these)
create a new dataframe to put the split values into
assign names to your new columns
split values and combine the values (using apply/combine/lambda)
Code sample:
# importing pandas module
import pandas as pd
# read in csv file
data = pd.read_csv("https://mydata.csv")
# drop null values
data.dropna(inplace = True)
# create new data frame
new = data["ColumnName"].str.split(" ", n = 1, expand = True) #this 'split' code applies to splitting one column into two
# assign new name to first column
data["A"]= new[0] #8 concatenated values will go here
# making seperate last name column from new data frame
data["B"]= new[1] #last two [combined] values in go here
## other/different code required for concatenation of column values - look at this linked SO question##
# df display
data
Hope this helps

How to keep Rows and Columns headers when applying operation using Matlab

I have a data set stored in an excel file, when i importing data using matlab function :
A=xlread(xls -filename)
matrix A only stored numeric values of my table.. and when i used another function such as:
B= readtable(xls-filename)
then table will view complete data include rows and columns headers but when i apply such operation on it like
Bnorm=normc(B)
its unable to perform normalization on it due to the rows and columns headers ..
my question are:
is there any way to avoid rows and columns header in table B.
is there any way to store rows and columns headers when read table using xlread function .. such that
column header = store first row in (xls-filename)
row headers = store first column in (xls-filename)
thanks for any suggestion
dataset table
normalized matrix when apply xlread(xls-filename
The answers to your specific questions are:
With a table, you can avoid row labels but column labels always exist.
As per the doc for xlsread, the first output is the numeric data, and the second output is the text data, which in this case would include your header information.
But, in this case, you just need to learn how to work with tables properly. You want something like,
>> Bnorm = normc(B{:,2:end});
which extracts all the numeric elements of table B and uses them as input to normc.
If you want the result to be a table then use
Bnorm = B;
Bnorm{:,2:end} = normc(B{:,2:end}));

Resources