Can multiple datasets write to the same folder one after another? - apache-spark

I have code which generates a series of datasets from the same SparkSession object and writes them to a folder as Parquet files. I want to see each write materializing a new Parquet file within that folder, but the code seems to hang after the first write.
The code looks like below:
// Called in a loop with different values for the dataset parameter
void writeDataset(Dataset[Row] dataset) {
DataFrameWriter[Row] writer = dataset.write();
writer.format("parquet");
writer.save("/tmp/folder");
}
The first write does a generate a parquet file and a _SUCCESS file within the above /tmp/folder, but the subsequent calls to the method seem to hang at the save() method.
How do I make multiple datasets each generate one Parquet (or Avro or JSON) file in a folder, when called in a loop?

I am able to get it to add new files using the SaveMode.Append option on the writer: https://spark.apache.org/docs/3.2.1/api/java/org/apache/spark/sql/SaveMode.html

Related

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv format.
ID1_FILENAMEA_1.csv
ID1_FILENAMEA_2.csv
ID1_FILENAMEA_3.csv
ID1_FILENAMEA_4.csv
ID2_FILENAMEA_1.csv
ID2_FILENAMEA_2.csv
ID2_FILENAMEA_3.csv
This files are loaded to FILENAMEA in HIVE using HiveWareHouse Connector, with few transformation like adding default values. Similarly we have around 70 tables. Hive tables are created in ORC format. Tables are partitioned on ID. Right now, I'm processing all these files one by one. It's taking much time.
I want to make this process much faster. Files will be in GBs.
Is there is any way to read all the FILENAMEA files at the same time and load it to HIVE tables.
You have two methods to read several CSV files in pyspark. If all CSV files are in the same directory and all have the same schema, you can read then at once by directly passing the path of directory as argument, as follow:
spark.read.csv('hdfs://path/to/directory')
If you have CSV files in different locations or CSV files in same directory but with other CSV/text files in it, you can pass them as string representing a list of path in .csv() method argument, as follow:
spark.read.csv('hdfs://path/to/filename1,hdfs://path/to/filename2')
You can have more information about how to read a CSV file with Spark here
If you need to build this list of paths from the list of files in HDFS directory, you can look at this answer, once you've created your list of paths, you can transform it to a string to pass to .csv() method with ','.join(your_file_list)
Using: spark.read.csv(["path1","path2","path3"...]) you can read multiple files from different paths. But that means you have first to make a list of the paths. A list not a string of comma-separated file paths

Convert CSV files from multiple directory into parquet in PySpark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks
The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

Spark unzip and write CSV as parquet in executor

My issue is that my CSV files come in ZIP .csv.zip format so I cannot just use Sparks .csv directly as I could with .csv.gzip | .csv.gz. Which means I need to decompress the file, read the contents (the files are quite big ~5gb) and write them as parquet files.
My approach is as such:
String paths = "s3a://...,s3a://...,...";
JavaRDD<Tuple2<String, PortableDataStream>> zipRDD = context.binaryFiles(paths, sparkContext.context.defaultParallelism()).toJavaRDD();
JavaRDD<Tuple2<String, List<Row>>> filenameRowsRDD = zipRDD.flatMap(new ConvertLinesToRows());
The first JavaRDD returns a pair of Filename, InputStream. This is then passed to class ConvertLinesToRows which calls ZipInputStream reads the contents of CSV files and for each line creates a new spark Row and finally returning tuple pair of Filename, List<Row> where the list contains all lines from CSV converted to Row.
I now want to save each read CSV as parquet file.
filenameRowsRDD.foreach(tuple -> {
SparkContext newContext = MySparkConfig.createNewContext();
SparkSession newSpark = SparkSession.builder()
.sparkContext(newContext)
.getOrCreate();
Dataset<Row> dataset = newSpark.createDataFrame(tuple._2, datasetSchema.value());
dataset.write().parquet("s3a://...");
});
I recreate the SparkSession in my executor so to be able to use SparkSession.write.
The idea is that this will all run in an executor (I'm hoping). However with this approach, reading file is not an issue problem comes when executor wants to write this output file. Throwing exception for: A master URL must be set in your configuration.
This seems like I'm doing something anti-spark way. It also does not work. I have also tried broadcasting my SparkSession however that throws a NPE inside SparkSession before trying to write.
What would be the correct way to approach my problem here ?
What would be the spark way of doing this.
All of the above code in in my main() method. Am I correct in assuming that first zipRDD is run on master node and than the second filenameRowsRDD runs on executor nodes as well as the .foreach.

Dataframe not able to write on S3

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')
Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Write each row of a spark dataframe as a separate file

I have Spark Dataframe with a single column, where each row is a long string (actually an xml file).
I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.
I cannot seem to find any information or examples on how to do this.
And I am just starting to work with Spark and PySpark.
Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.
When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.
There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.
The repartition and saving can be done as follows:
rows = df.count()
df.repartition(rows).write.csv('save-dir')
I would do it this way in Java and Hadoop FileSystem API. You can write similar code using Python.
List<String> strings = Arrays.asList("file1", "file2", "file3");
JavaRDD<String> stringrdd = new JavaSparkContext().parallelize(strings);
stringrdd.collect().foreach(x -> {
Path outputPath = new Path(x);
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
});

Resources