Saving Files with tsv Extension with a Spark DataFrameWriter - apache-spark

Is it possible to save files and specify the extension with a DataFrameWriter? In the example below, I save my dataframe, using as delimiter a tab, but the output files are '.csv'.
my_dataframe.write.option("delimiter", "\t").csv(output_path)
Is there a way to specify that the extension is '.tsv'.

This page shows how.
output_path needs to be a filename with a '.tsv' extension.

Related

Read only specific csv files in azure dataflow source

I have a data flow source, a delimited text dataset that points to a folder containing many csv files.
So the source reads all the csv files inside the folder2. The files inside folder2 are
abc.csv
someFile.csv
otherFile_2021.csv
predicted_file_1.csv
predicted_file_2.csv
predicted_file_99.csv
The aim is to read data from only the files like predicted_file_*.csv i.e to only read the last three files. Is it possible to add dynamic content in dataset so that it reads specific pattern files?
In source transformation, under source options, you can provide the wildcard path with filename prefix to read the required files.
Example:
(For debug purpose, I have added column to store the filename to verify the files)
Source:
Source preview:
Refer this document for more information.

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.
ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv
This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.
I want to make this process much faster. Files will be in GBs.
Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.
You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:
spark.read.csv('hdfs://path/to/directory')
If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:
spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')
You can have more information about how to read a CSV file with Spark here
If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)
Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

How to read specific files from a directory based on a file name in spark?

I have a directory of CSV files. The files are named based on date similar to the image below:
I have many CSV files that go back to 2012.
So, I would like to read the CSV files that correspond to a certain date only. How is that could be possible in spark? In other words, I don't want my spark engine to bother and read all CSV files because my data is huge (TBs).
Any help is much appreciated!
You can specify a list of files to be processed when calling the load(paths) or csv(paths) methods from DataFrameReader.
So an option would be to list and filter files on the driver, then load only the "recent" files :
val files: Seq[String] = ???
spark.read.option("header","true").csv(files:_*)
Edit :
You can use this python code (not tested yet)
files=['foo','bar']
df=spark.read.csv(*files)

Azure Data Factory - CSV to Parquet - Changing file extension

I have created a Data Factory to convert a CSV file to Parquet format, as I needed to retain the orginial file name I am using the 'Preserve Hierarchy' at the pipeline. The conversion works fine but the output file is generated with the csv extension (an expected output). Is there any out of the box option I could use to generate the file name without the csv extension. I scanned through the system varaible currently supported by ADF and it doesn't list Input File name as an option to mask the file extension - https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables. Is writing a custom component the only option?
Thanks for your inputs.
You could use get metadata activity to get the file name of your source dataset and then pass it to both input and output dataset of your copy activity if you are using azure data factory v2.

How to load JSON(path saved in csv) with Spark?

I am new to Spark.
I can load the .json file in Spark. What if there are thousands of .json files in a folder. picture of .json files in the folder
And I have a csv file, which classifies the .json files with labels.picture of csv file
What should I do with Spark if I want to load and save the data.(for example.I want to load the first information in csv, but it is text information. But it gives the path of .json, and I want to load the .json, then save the output. So I will know the first Trusted label graph's json information.)
For the JSON:
jsonRDD = sql_context.read.json("path/to/json_folder/");
For CSV install spark-csv from here Databricks' spark-csv
csvRDD = sql_context.read.load("path/to/csv_folder/",format='com.databricks.spark.csv',header='true',inferSchema='true')

Resources