Spark : Modify CSV file and write to other folder - apache-spark

Folks,
We have one requirement where we wanted to do a minor transformation on CSV file and write the same in to other HDFS folder using spark.
e.g /input/csv1.txt (at least 4 GB file)
ID,Name,Address
100,john,some street
output should be in file (output/csv1.txt). Basically two new columns will be added after analyzing address ( Order of record should be same as input file)
ID,Name,Address,Country,ZipCode
100,Name,Address,India,560001
Looks like there is no easy to do this with spark.

Ehm, I don't know what you mean by no easy way - the spark-csv package makes it very easy IMHO. Depending on which version of Spark you are running, you need to do one of the following:
Spark 2.x
val df = spark.read.csv("/path/to/files/")
df
.withColumn("country", ...)
.withColumn("zip_code", ...)
.write
.csv("/my/output/path/")
Spark 1.x
val df = sqlContext.read.format("com.databricks.spark.csv").load(/path/to/my/files/")
df.
.withColumn("country", ...)
.withColumn("zip_code", ...)
.write
.format("com.databricks.spark.csv")
.save("/my/output/path/")
Note, that I just put withColumn here - you are probably joining with some other dataframe containing the country and zip code, but my example is just to illustrate how you read and write it with the spark-csv package (which has been build into Spark 2.x)

Related

How to convert multiple parquet files into TFrecord files using SPARK?

I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:
Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.
It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:
# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))
sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))
Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?
I've also tried just using the wildcard when reading all the parquet files:
SQLContext(sc).read.parquet('/path/*.parquet')
This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.
Any other suggestions?
Try spark-tfrecord.
Spark-TFRecord is a tool similar to spark-tensorflow-connector but it does partitionBy. The following example shows how to partition a dataset.
import org.apache.spark.sql.SaveMode
// create a dataframe
val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
val tf_output_dir = "/tmp/tfrecord-test"
// dump the tfrecords to files.
df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir)
More information can be found at
Github repo:
https://github.com/linkedin/spark-tfrecord
If I understood your question correctly, you want to write the partitions locally on the workers' disk.
If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.
This is the code that you are looking for (as stated in the documentation linked above):
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.

Spark - Stream kafka to file that changes every day?

I have a kafka stream I will be processing in spark. I want to write the output of this stream to a file. However, I want to partition these files by day, so everyday it will start writing to a new file. Can something like this be done? I want this to be left running and when a new day occurs, it will switch to write to a new file.
val streamInputDf = spark.readStream.format("kafka")
.option("kafka.bootstrapservers", "XXXX")
.option("subscribe", "XXXX")
.load()
val streamSelectDf = streamInputDf.select(...)
streamSelectDf.writeStream.format("parquet)
.option("path", "xxx")
???
Adding partition from spark can be done with partitionBy provided in
DataFrameWriter for non-streamed or with DataStreamWriter for
streamed data.
Below are the signatures :
public DataFrameWriter partitionBy(scala.collection.Seq
colNames)
DataStreamWriter partitionBy(scala.collection.Seq colNames)
Partitions the output by the given columns on the file system.
DataStreamWriter partitionBy(String... colNames) Partitions the
output by the given columns on the file system.
Description :
partitionBy public DataStreamWriter partitionBy(String... colNames)
Partitions the output by the given columns on the file system. If
specified, the output is laid out on the file system similar to Hive's
partitioning scheme. As an example, when we partition a dataset by
year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize
physical data layout. It provides a coarse-grained index for skipping
unnecessary data reads when queries have predicates on the partitioned
columns. In order for partitioning to work well, the number of
distinct values in each column should typically be less than tens of
thousands.
Parameters: colNames - (undocumented) Returns: (undocumented) Since:
2.0.0
so if you want to partition data by year and month spark will save the data to folder like:
year=2019/month=01/05
year=2019/month=02/05
Option 1 (Direct write):
You have mentioned parquet - you can use saving as a parquet format with:
df.write.partitionBy('year', 'month','day').format("parquet").save(path)
Option 2 (insert in to hive using same partitionBy ):
You can also insert into hive table like:
df.write.partitionBy('year', 'month', 'day').insertInto(String tableName)
Getting all hive partitions:
Spark sql is based on hive query language so you can use SHOW PARTITIONS
To get list of partitions in the specific table.
sparkSession.sql("SHOW PARTITIONS partitionedHiveParquetTable")
Conclusion :
I would suggest option 2 ... since Advantage is later you can query data based on partition (aka query on raw data to know what you have received) and underlying file can be parquet or orc.
Note :
Just make sure you have .enableHiveSupport() when you are creating session with SparkSessionBuilder and also make sure whether you have hive-conf.xml etc. configured properly.
Based on this answer spark should be able to write to a folder based on the year, month and day, which seems to be exactly what you are looking for. Have not tried it in spark streaming, but hopefully this example gets you on the right track:
df.write.partitionBy("year", "month", "day").format("parquet").save(outPath)
If not, you might be able to put in a variable filepath based on current_date()

spark parquet too many small files [duplicate]

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.
Job works as follows:
val events = spark.sparkContext
.textFile(s"$stream/$sourcetype")
.map(_.split(" \\|\\| ").toList)
.collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
.toDF()
df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")
It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.
The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.
Ideally I want to create only a handful of parquet files within the partition 'date'.
What would be the best way to control this? Is it by using 'coalesce()'?
How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).
you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter
Try this:
df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
In Python you can rewrite Raphael's Roth answer as:
(df
.repartition("date")
.write.mode("append")
.partitionBy("date")
.parquet("{path}".format(path=path)))
You might also consider adding more columns to .repartition to avoid problems with very large partitions:
(df
.repartition("date", another_column, yet_another_colum)
.write.mode("append")
.partitionBy("date)
.parquet("{path}".format(path=path)))
The simplest solution would be to replace your actual partitioning by :
df
.repartition(to_date($"date"))
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")
You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer.
That actually depends on the amount of data.
You can reduce entropy by partitioning DataFrame and the write with partition by clause.
I came across the same issue and I could using coalesce solved my problem.
df
.coalesce(3) // number of parts/files
.write.mode(SaveMode.Append)
.parquet(s"$path")
For more information on using coalesce or repartition you can refer to the following spark: coalesce or repartition
Duplicating my answer from here: https://stackoverflow.com/a/53620268/171916
This is working for me very well:
data.repartition(n, "key").write.partitionBy("key").parquet("/location")
It produces N files in each output partition (directory), and is (anecdotally) faster than using coalesce and (again, anecdotally, on my data set) faster than only repartitioning on the output.
If you're working with S3, I also recommend doing everything on local drives (Spark does a lot of file creation/rename/deletion during write outs) and once it's all settled use hadoop FileUtil (or just the aws cli) to copy everything over:
import java.net.URI
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
// ...
def copy(
in : String,
out : String,
sparkSession: SparkSession
) = {
FileUtil.copy(
FileSystem.get(new URI(in), sparkSession.sparkContext.hadoopConfiguration),
new Path(in),
FileSystem.get(new URI(out), sparkSession.sparkContext.hadoopConfiguration),
new Path(out),
false,
sparkSession.sparkContext.hadoopConfiguration
)
}
how about trying running scripts like this as map job consolidating all the parquet files into one:
$ hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-2.7.1.2.3.2.0-2950.jar \
-Dmapred.reduce.tasks=1 \
-input "/hdfs/input/dir" \
-output "/hdfs/output/dir" \
-mapper cat \
-reducer cat

Efficient way to read specific columns from parquet file in spark

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
val df = spark.read.parquet("fs://path/file.parquet").select(...)
This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.
case class MyData...
val ds = df.as[MyData]
At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:
spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")
One solution is to provide schema that contains only requested columns to load:
spark.read.format("parquet").load("<path_to_file>",
schema="col1 bigint, col2 float")
Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.
Spark supports pushdowns with Parquet so
load(<parquet>).select(...col1, col2)
is fine.
I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.
This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame
Parquet is a columnar file format. It is exactly designed for these kind of use cases.
val df = spark.read.parquet("<PATH_TO_FILE>").select(...)
should do the job for you.

Spark How to create a DataFrame from a list of file in another DataFrame

In Apache Spark, if I have a DataFrame that is a list of CSV files, how can I create a DataFrame from the content of all the files listed in the first DataFrame?
From your description, I think the number of files should be small. You can just collect the file paths to the driver and use them to create DataFrame. E.g.,
val filePathDF = sc.parallelize(Seq("a.txt", "b.txt", "c.txt")).toDF("path")
val df = sqlContext.read.text(filePathDF.collect().map(_.getString(0)): _*)
df.show()
text is a 1.6 API. If you are using a pre 1.6 Spark, you can use format("text").load(...) instead.

Resources