Rename written CSV file Spark - apache-spark

I'm running spark 2.1 and I want to write a csv with results into Amazon S3.
After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename.
I'm using the databricks lib for writing into S3.
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and havent found much.
Thanks

You can use below to rename the output file.
dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("folder/dataframe/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "folder/dataframe/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"file.csv"))

The code as you mentioned here returns a Unit. You would need to confirm when your Spark application has completed its run (assuming this is a batch case) and then rename
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")

You can rename the part files with any specific name using the dbutils command, use the below code to rename the part-generated CSV file, this code works fine for pyspark
x = 'dbfs:mnt/source_path' # your source path
y = 'dbfs:mnt/destination_path' # you destination path
Files = dbutils.fs.ls(x)
#moving or renaming the part-000 CSV file into the normal or specific name
i = 0
for file in Files:
print(file.name)
i = i+1
if file.name[-4] ='.csv': #you can use any file extension like parquet, JSON, etc.
dbutils.fs.mv(x+file.name,y+'OutputData-' + str(i) +'.csv') #you can provide any specific name here
dbutils.fs.rm(x, True) # later remove the source path after renaming all the part-generated files if you want

Related

Reading json file causing corrupt_record in pyspark

I have been trying to read an open source json file available here in the zip format(https://healthy.kaiserpermanente.org/pricing/innetwork/co/2022-08-01_KPIC_CO-COMMERCIAL_in-network-rates.zip). The zipped file size is 50MB and unzipped file size is about 700MB. When I try to read this file, I encountered ['_corrupt_record'] issue.
df = (spark.read.format("json")
.option("multiline", "true")
.load(file_path)
)
Based on this blog https://medium.com/#sasidharan-r/how-to-handle-corrupt-or-bad-record-in-apache-spark-custom-logic-pyspark-aws-430ddec9bb41, I also tried the following method:
df = (spark.read.format("json")
.schema(schema_in_network)
.option("multiline", "true").option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load(file_path)
)
In addition, based on the Databrick website, I attempted the following:
df = (spark.read.format("json")
.option("multiline", "true").option("mode", "PERMISSIVE")
.option("rescuedDataColumn", "_rescued_data")
.load(file_path)
)
None of the above methods rectified the issue. I am not sure if the problem exists in the json file or my approach. In either case, I want to read the content ignoring corrupted value, if any. Thank you in advance for your help.
PS: I am new to Spark.
Try this:
df = (spark.read.format("json")
.option("multiline", "true").option("mode", "DROPMALFORMED")
.option("rescuedDataColumn", "_rescued_data").load(file_path)
)
Mode DROPMALFORMED ignores the whole corrupted records.
Also try to read doc for JSON files

How to iterate in Databricks to read hundreds of files stored in different subdirectories in a Data Lake?

I have to read hundreds of avro files in Databricks from an Azure Data Lake Gen2, extract data from the Body field inside every file, and concatenate all the extracted data in a unique dataframe. The point is that all avro files to read are stored in different subdirectories in the lake, following the pattern:
root/YYYY/MM/DD/HH/mm/ss.avro
This forces me to loop the ingestion and selection of data. I'm using this Python code, in which list_avro_files is the list of paths to all files:
list_data = []
for file_avro in list_avro_files:
df = spark.read.format('avro').load(file_avro)
data1 = spark.read.json(df.select(df.Body.cast('string')).rdd.map(lambda x: x[0]))
list_data.append(data1)
data = reduce(DataFrame.unionAll, list_data)
Is there any way to do this more efficiently? How can I parallelize/speed up this process?
As long as your list_avro_files can be expressed through standard wildcard syntax, you can probably use Spark's own ability to parallelize read operation. All you'd need is to specify a basepath and a filename pattern for your avro files:
scala> var df = spark.read
.option("basepath","/user/hive/warehouse/root")
.format("avro")
.load("/user/hive/warehouse/root/*/*/*/*.avro")
And, in case you find that you need to know exactly which file any given row came from, use input_file_name() built-in function to enrich your dataframe:
scala> df = df.withColumn("source",input_file_name())

Reading multiple parquet files from S3 Bucket [duplicate]

I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
|
------- dir1_2
dir2 ---
|
------- dir2_1
|
------- dir2_2
sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2
Right now I'm reading each dir and merging dataframes using "unionAll".
Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll
Thanks
A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.
Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:
df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
or
df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
In case you have a list of files you can do:
files = ['file1', 'file2',...]
df = spark.read.parquet(*files)
For ORC
spark.read.orc("/dir1/*","/dir2/*")
spark goes inside dir1/ and dir2/ folder and load all the ORC files.
For Parquet,
spark.read.parquet("/dir1/*","/dir2/*")
Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.
from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')
import posixpath as psp
fpaths = [
psp.join("hdfs://localhost:9000" + dpath, fname)
for dpath, _, fnames in client.walk('/eta/myHdfsPath')
for fname in fnames
]
# At this point fpaths contains all hdfs files
parquetFile = sqlContext.read.parquet(*fpaths)
import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf
In Spark-Scala you can do this.
val df = spark.read.option("header","true").option("basePath", "s3://bucket/").csv("s3://bucket/{sub-dir1,sub-dir2}/")

How to read only n rows of large CSV file on HDFS using spark-csv package?

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n and not all rows, is it possible ?
You can use limit(n).
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.
My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).
val numberOfLines = ...
spark.
read.
text("myfile.csv").
limit(numberOfLines).
write.
text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
read.
option("inferSchema", true). // <-- you are in exploration mode, aren't you?
csv(s"myfile-$numberOfLines.csv")
Not inferring schema and using limit(n) worked for me, in all aspects.
f_schema = StructType([
StructField("col1",LongType(),True),
StructField("col2",IntegerType(),True),
StructField("col3",DoubleType(),True)
...
])
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true').schema(f_schema).load(data_path).limit(10)
Note: If we use inferschema='true', its again the same time, and maybe hence the same old thing.
But if we dun have idea of the schema, Jacek Laskowski solutions works well too. :)
The solution given by Jacek Laskowski works well. Presenting an in-memory variation below.
I recently ran into this problem. I was using databricks and had a huge csv directory (200 files of 200MB each)
I originally had
val df = spark.read.format("csv")
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.load("dbfs:/huge/csv/files/in/this/directory/")
display(df)
which took a lot of time (10+ minutes), but then I change it to below and it ran instantly (2 seconds)
val lines = spark.read.text("dbfs:/huge/csv/files/in/this/directory/").as[String].take(1000)
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(spark.createDataset(lines))
display(df)
Inferring schema for text formats is hard and it can be done this way for the csv and json (but not if it's a multi-line json) formats.
Since PySpark 2.3 you can simply load data as text, limit, and apply csv reader on the result:
(spark
.read
.options(inferSchema="true", header="true")
.csv(
spark.read.text("/path/to/file")
.limit(20) # Apply limit
.rdd.flatMap(lambda x: x))) # Convert to RDD[str]
Scala counterpart is available since Spark 2.2:
spark
.read
.options(Map("inferSchema" -> "true", "header" -> "true"))
.csv(spark.read.text("/path/to/file").limit(20).as[String])
In Spark 3.0.0 or later one can also apply limit and use from_csv function, but it requires a schema, so it probably won't fit your requirements.
Since I didn't see that solution in the answers, the pure SQL-approach is working for me:
df = spark.sql("SELECT * FROM csv.`/path/to/file` LIMIT 10000")
If there is no header the columns will be named _c0, _c1, etc. No schema required.
May be this would be helpful who is working in java.
Applying limit will not help to reduce the time. You have to collect the n rows from the file.
DataFrameReader frameReader = spark
.read()
.format("csv")
.option("inferSchema", "true");
//set framereader options, delimiters etc
List<String> dataset = spark.read().textFile(filePath).limit(MAX_FILE_READ_SIZE).collectAsList();
return frameReader.csv(spark.createDataset(dataset, Encoders.STRING()));

How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated with a single command rather than pointing a file can I point a folder?
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/2008.csv")
Use wildcard, e.g. replace 2008 with *:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/*.csv") // <-- note the star (*)
Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
spark.read.option("header", "true").csv("../Downloads/*.csv")
Notes:
Replace format("com.databricks.spark.csv") by using format("csv") or csv method instead. com.databricks.spark.csv format has been integrated to 2.0.
Use spark not sqlContext
Ex1:
Reading a single CSV file. Provide complete file path:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\cars1.csv")
Ex2:
Reading multiple CSV files passing names:
val df=spark.read.option("header","true").csv("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
Ex3:
Reading multiple CSV files passing list of names:
val paths = List("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
val df = spark.read.option("header", "true").csv(paths: _*)
Ex4:
Reading multiple CSV files in a folder ignoring other files:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\*.csv")
Ex5:
Reading multiple CSV files from multiple folders:
val folders = List("C:spark\\sample_data\\tmp", "C:spark\\sample_data\\tmp1")
val df = spark.read.option("header", "true").csv(folders: _*)
Note that you can use other tricks like :
-- One or more wildcard:
.../Downloads20*/*.csv
-- braces and brackets
.../Downloads201[1-5]/book.csv
.../Downloads201{11,15,19,99}/book.csv
Reader's Digest: (Spark 2.x)
For Example, if you have 3 directories holding csv files:
dir1, dir2, dir3
You then define paths as a string of comma delimited list of paths as follows:
paths = "dir1/,dir2/,dir3/*"
Then use the following function and pass it this paths variable
def get_df_from_csv_paths(paths):
df = spark.read.format("csv").option("header", "false").\
schema(custom_schema).\
option('delimiter', '\t').\
option('mode', 'DROPMALFORMED').\
load(paths.split(','))
return df
By then running:
df = get_df_from_csv_paths(paths)
You will obtain in df a single spark dataframe containing the data from all the csvs found in these 3 directories.
===========================================================================
Full Version:
In case you want to ingest multiple CSVs from multiple directories you simply need to pass a list and use wildcards.
For Example:
if your data_path looks like this:
's3://bucket_name/subbucket_name/2016-09-*/184/*,
s3://bucket_name/subbucket_name/2016-10-*/184/*,
s3://bucket_name/subbucket_name/2016-11-*/184/*,
s3://bucket_name/subbucket_name/2016-12-*/184/*, ... '
you can use the above function to ingest all the csvs in all these directories and subdirectories at once:
This would ingest all directories in s3 bucket_name/subbucket_name/ according to the wildcard patterns specified. e.g. the first pattern would look in
bucket_name/subbucket_name/
for all directories with names starting with
2016-09-
and for each of those take only the directory named
184
and within that subdirectory look for all csv files.
And this would be executed for each of the patterns in the comma delimited list.
This works way better than union..
Using Spark 2.0+, we can load multiple CSV files from different directories using
df = spark.read.csv(['directory_1','directory_2','directory_3'.....], header=True). For more information, refer the documentation
here
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\*.csv)
will consider files tmp, tmp1, tmp2, ....

Resources