Spark saveAsTextFile with file extension

Spark saveAsTextFile with file extension - apache-spark

I want to partition my results and save them as a CSV file into a specified location. However, I didn't find any option to specify the file format using the below code. All the files are created with the format part-000**. How can I specify the required file format here?
records.repartition(partitionNum).saveAsTextFile(path)

you can try this
df.coalesce(1).write.option("header",true).csv(path)
this path it will be a folder, and it must not be exists, and you can't generate specify csv file. But you can change the hdfs file name by hadoop api(contains in spark).
import org.apache.hadoop.fs._
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val file = fs.globStatus(new Path(s"$path/part*"))(0).getPath().getName()
val result:Boolean = fs.rename(new Path(s"$path/$file"), new Path(s"$hdfsFolder/${fileName}"))

Related

Recreate resulting CSV as own file in Azure blob

When writing out my dataframe, it drop the result into a folder called "file.csv" with a "part-000..." file. I need to take this resulting file and write it out/copy it as its own csv file with a name. I am using the logic here, but it appears this won't suffice for Azure blob as it's not recognizing a WASB path.
Code to create the dataframe:
val dfOutput = spark.sql("""SELECT * FROM Query""")
dfOutput.coalesce(1).write.option("header","true").mode("overwrite").format("csv").save(OutputFile)
Creates and outputs the dataframe as a CSV file within a folder as a "part-000..." file. The output path in this case is wasb:/mycontainer#myexamplestorage.blob.core.windows.net/file.csv (example).
The next part should grab the "part-000..." file and create it under its own file by copying it with the FileUtil then remove the "file.csv" path.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import java.io.{File}
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val srcPath = new Path(OutputFile)
val destPath = new Path("wasb://mycontainer#myexamplestorage.blob.core.windows.net/resultfile.csv")
val srcFile = FileUtil.listFiles(new File(OutputFile))
.filterNot(f=>f.getPath.endsWith(".csv"))(0)
FileUtil.copy(srcFile,hdfs,destPath,true,hadoopConfig)
hdfs.delete(srcPath,true)
This next part fails on the listFiles with the error "
IOException: Invalid directory or I/O error occurred for dir: wasb://mycontainer#myexamplestorage.blob.core.windows.net/file.csv" and from what I can tell this is caused because it's not able to list files from Azure blob storage.
I need to be able to get the CSV file from Azure blob, then copy it as its own file to Azure blob without the folder and "part-000..." file. I played around with setting the file configuration, but this entire approach appears to be incompatible with Azure blob storage or there is a configuration missing somewhere that allows these to query blob storage.

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)

Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

Pyspark NLTK save output

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.
From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.
Now I want to save dataframes into respected file in HDFS. (every time appending output into same file )
So at the end I have three CSV file named unigram.csv, bigram.csv, trigram.csv containing result of thousands of input file.
If this scenario doesn't possible with HDFS, can you suggest it with using local disk as storage path.

File append in normal programming language is not similar to what Dataframe write mode append is. Whenever we ask Dataframe to save to a location folder it will create a new file for every append . Only way you can achieve it by,
Read the old file into dfOld : Dataframe
Combine the old and new Dataframe dfOld.union(dfNewToAppend)
combine to single output file .coalesce(1)
Write to new temporary location /tempWrite
Delete the old HDFS location
Rename the /tempWrite folder your output folder name
val spark = SparkSession.builder.master("local[*]").getOrCreate;
import org.apache.hadoop.fs._
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
/// Write you unigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/unigram.csv")
/// Write you bigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/bigram.csv")
/// Write you trigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000"), new Path("yourNewHDFSDir/trigram.csv")
```

Spark: How to overwrite a file on S3 folder and not complete folder

Using Spark I am trying to push some data(in csv, parquet format) to S3 bucket.
df.write.mode("OVERWRITE").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)
In above code piece, destination_path variable holds the S3 bucket location where data needs to be exported.
Eg. destination_path = "s3://some-test-bucket/manish/"
In the folder manish of some-test-bucket if I have several files and sub-folders. Above command will delete all of them and spark will write new output files. But I want to overwrite just one file with this new file.
Even if I am able to overwrite just contents of this folder, but sub-folder remain intact even that would solve the problem to certain extent.
How can this be achieved?
I tried to use mode as append instead of overwrite.
Here in this case subfolder name remains intact but again all the contents of manish folder and its sub-folder are overwritten.

Short answer: Set the Spark configuration parameter spark.sql.sources.partitionOverwriteMode to dynamic instead of static. This will only overwrite the necessary partitions and not all of them.
PySpark example:
conf=SparkConf().setAppName("test).set("spark.sql.sources.partitionOverwriteMode","dynamic").setMaster("yarn")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

The file's can be deleted first and then use append mode to insert the data instead of overwriting to retain the sub folder's. Below is an example from Pyspark.
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "{}*.csv.deflate".format(destination_path)])
df.write.mode("append").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)

Spark - How to change the name of the coalesced parquet file

So, when writing parquet files to s3, I'm able to change the directory name using the following code:
spark_NCDS_df.coalesce(1).write.parquet(s3locationC1+"parquet")
Now, when I output this, the contents within that directory are as follows:
I'd like to make two changes:
Can I update the file name for the part-0000....snappy.parquet file?
Can I output this file without the _SUCCESS, _committed and _started files?
The documentation i've found online hasn't been very helpful.

out_file_name = snappy.parquet
path = "mnt/s3locationC1/"
tmp_path = "mnt/s3locationC1/tmp_data"
df = spark_NCDS_df
def copy_file(path,tmp_path,df,out_file_name):
df.coalesce(1).write.parquet(tmp_path)
file = dbutils.fs.ls(tmp_path)[-1][0]
dbutils.fs.cp(file,path+out_file_name)
dbutils.fs.rm(tmp_path,True)
copy_file(path,tmp_path,df,out_file_name)
This function copy and paste your required output file to the destination and then delete the temp files, all the _SUCCESS, _committed and _started removed with it.
If you need anything more, please let me know.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark saveAsTextFile with file extension - apache-spark

Related

Recreate resulting CSV as own file in Azure blob

How to name a csv file after overwriting in Azure Blob Storage

Pyspark NLTK save output

Spark: How to overwrite a file on S3 folder and not complete folder

Spark - How to change the name of the coalesced parquet file

Categories

Resources