Pyspark- Save each dataframe to a single file - apache-spark

I am trying to save filtered dataframe back to the same source file.
I wrote below code to transform the content of each file in a directory to separate Dataframe, filter it and save it back to the same file
rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
# collect the RDD to a list
list_elements = rdd.collect()
for element in list_elements:
path, data = element
df = spark.read.json(spark.sparkContext.parallelize([data]))
df = df.filter('d != 721')
df.write.save(path, format="json", mode="overwrite")
I was expecting that it will overwrite the file with the updated data, but it is creating a folder with the file name and creating below structure and part files:
How can I save each updated dataframe back to the same source file(.txt)?
Thanks in Advance.

To save it to 1 file use .coalesce(1) or .repartition(1) option before .save(), that will result in the same folder-like structure, but there will be 1 json file inside.
To save it with a „normal” name after saving it you’d need to cut the 1 json file inside, paste and rename it with desired name. You can see code how it could look like for csv files here

Related

Split CSV File into two files keeping header in both files

I am trying to split a large CSV file into two files. I am using below code
import pandas as pd
#csv file name to be read in
in_csv = 'Master_file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of rows of data to write to the csv,
#you can change the row size according to your need
rowsize = 600000
#start looping through data writing it to a new file for each set
for i in range(0,number_lines,rowsize):
df = pd.read_csv(in_csv,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'File_Number' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=True,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
It is splitting the file but its missing header in second file. How can I fix it
.read_csv() returns an iterator when used with chunksize and then keeps track of the header. The following is an example. This should be much faster since the original code above reads the entire file to count the lines, then re-reads all previous lines in each chunk iteration; whereas below reads through the file only once:
import pandas as pd
with pd.read_csv('Master_file.csv', chunksize=60000) as reader:
for i,chunk in enumerate(reader):
chunk.to_csv(f'File_Number{i}.csv', index=False, header=True)

Live updating graph from increasing amount of csv files

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

How to convert dataframe to a text file in spark?

I unloaded snowflake table and created a data frame.
this table has data of various datatype.
I tried to save it as a text file but got an error:
Text data source does not support Decimal(10,0).
So to resolve the error, I casted my select query and converted all columns to string datatype.
Then I got the below error:
Text data source supports only single column, and you have 5 columns.
my requirement is to create a text file as follows.
"column1value column2value column3value and so on"
You can use a CSV output with a space delimiter:
import pyspark.sql.functions as F
df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ')
If you want only 1 output file, you can add .coalesce(1) before .write.
You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write:
df.select(
concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value")
).write
.text("output")

Read parquet files from S3 folder using wildcard

I have S3 folders as below, each with parquet files:
s3://bucket/folder1/folder2/2020-02-26-12/key=Boston_20200226/
s3://bucket/folder1/folder2/2020-02-26-12/key=Springfield_20200223/
s3://bucket/folder1/folder2/2020-02-26-12/key=Toledo_20200226/
s3://bucket/folder1/folder2/2020-02-26-12/key=Philadelphia_20191203/
My goal is to be able to open the parquet files from '*_20200226' folders only.
I use a FOR loop to first gather a list/array of all files and then pass it to the READ operation into a DF in spark 2.4.
s3_files = []
PREFIX = "folder1/folder2/"
min_datetime = current_datetime - timedelta(hours=72)
while current_datetime >= min_datetime:
each_hour_prefix = min_datetime.strftime('%Y-%m-%d-%H')
if any(fname.key.endswith('.parquet') for fname in s3_bucket.objects.filter(Prefix=(PREFIX + each_hour_prefix))):
s3_files.append('s3://{bucket}/{prefix}'.format(bucket=INPUT_BUCKET_NAME, prefix=(PREFIX + each_hour_prefix + '/*')))
min_datetime = min_datetime + timedelta(hours=1)
spark.read.option('basePath',('s3://' + INPUT_BUCKET_NAME)).schema(fileSchema).parquet(*s3_files)
where fileSchema is the schema struct of the parquet files, s3_files is a array of all files I picked up by perusing through S3 folders above. The above FOR loop works but my goal is to read Boston_20200226 and Toledo_20200226 folders only. Is it possible to do wildcard searches like "folder1/folder2/2020-02-26-12/key=**_20200226*" or perhaps change the 'read.parquet' command in some way?
Thanks in advance.
Update:
I resorted to a rudimentary way of perusing through all the folders and only finding files that match pattern = '20200226'(not the most efficient way). I collect the keys in a list and then read each parquet file in a DF and perform a union at the end. Everything works fine except the 'key' column is not read in the final DF. It is part of the partitionBy() code that created these parquet files. Any idea on how can capture the 'key'?

SparkSQL: Am I doing in right?

Here is how I use Spark-SQL in a little application I am working with.
I have two Hbase tables say t1,t2.
My input being a csv file, I parse each and every line and query(SparkSQL) the table t1. I write the output to another file.
Now I parse the second file and query the second table and I apply certain functions over the result and I output the data.
the table t1 hast the purchase details and t2 has the list of items that were added to cart along with the time frame by each user.
Input -> CustomerID(list of it in a csv file)
Output - > A csv file in a particular format mentioned below.
CustomerID, Details of the item he brought,First item he added to cart,All the items he added to cart until purchase.
For a input of 1100 records, It takes two hours to complete the whole process!
I was wondering if I could speed up the process but I am struck.
Any help?
How about this DataFrame approach...
1) Create a dataframe from CSV.
how-to-read-csv-file-as-dataframe
or something like this in example.
val csv = sqlContext.sparkContext.textFile(csvPath).map {
case(txt) =>
try {
val reader = new CSVReader(new StringReader(txt), delimiter, quote, escape, headerLines)
val parsedRow = reader.readNext()
Row(mapSchema(parsedRow, schema) : _*)
} catch {
case e: IllegalArgumentException => throw new UnsupportedOperationException("converted from Arg to Op except")
}
}
2) Create Another DataFrame from Hbase data (if you are using Hortonworks) or phoenix.
3) do join and apply functions(may be udf or when othewise.. etc..) and resultant file could be a dataframe again
4) join result dataframe with second table & output data as CSV as in pseudo code as an example below...
It should be possible to prepare dataframe with custom columns and corresponding values and save as CSV file.
you can this kind in spark shell as well.
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema","true").
load("cars93.csv")
val df2=df.filter("quantity <= 4.0")
val col=df2.col("cost")*0.453592
val df3=df2.withColumn("finalcost",col)
df3.write.format("com.databricks.spark.csv").
option("header","true").
save("output-csv")
Hope this helps.. Good luck.

Resources