Write each row of a spark dataframe as a separate file - apache-spark

I have Spark Dataframe with a single column, where each row is a long string (actually an xml file).
I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.
I cannot seem to find any information or examples on how to do this.
And I am just starting to work with Spark and PySpark.
Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.

When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.
There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.
The repartition and saving can be done as follows:
rows = df.count()
df.repartition(rows).write.csv('save-dir')

I would do it this way in Java and Hadoop FileSystem API. You can write similar code using Python.
List<String> strings = Arrays.asList("file1", "file2", "file3");
JavaRDD<String> stringrdd = new JavaSparkContext().parallelize(strings);
stringrdd.collect().foreach(x -> {
Path outputPath = new Path(x);
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
});

Related

Spark remove special characters from column name read from a parquet file [duplicate]

This question already has answers here:
Spark Dataframe validating column names for parquet writes
(7 answers)
Closed 2 years ago.
I have parquet files which I have read using the following spark command
lazy val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")
The column names of a lot of column has special chracter "(". like WA_0_DWHRPD_Purge_Date_(TOD), WA_0_DWHRRT_Record_Type_(80=Index) How can I remove this special character.
My end goal is to remove these special character and write the parquet file back using the following command
df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")
Also, I am using Scala spark shell.
I am new to spark, I saw similar questions but nothing is working in my case. Any help is appreciated.
First thing you can do is read the parquet files into the data frame as you are doing.
val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")
Once you have created the data frame, try to fetch the schema of the data frame and parse through it to remove all the special characters as below :
import org.apache.spark.sql.functions._
val schema = StructType(out.schema.map(
x => StructField(x.name.toLowerCase().replace(" ", "_").replace("#", "").replace("-", "_").replace(")", "").replace("(", "").trim(),
x.dataType, x.nullable)))
Now you can read the data back from the parquet files by specifying the schema that you have created.
val newDF = spark.read.format("parquet").schema(schema).load("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")
Now you can go ahead and save the data frame as you wanted with the cleaned column names.
df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")

Convert CSV files from multiple directory into parquet in PySpark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks
The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

Spark unzip and write CSV as parquet in executor

My issue is that my CSV files come in ZIP .csv.zip format so I cannot just use Sparks .csv directly as I could with .csv.gzip | .csv.gz. Which means I need to decompress the file, read the contents (the files are quite big ~5gb) and write them as parquet files.
My approach is as such:
String paths = "s3a://...,s3a://...,...";
JavaRDD<Tuple2<String, PortableDataStream>> zipRDD = context.binaryFiles(paths, sparkContext.context.defaultParallelism()).toJavaRDD();
JavaRDD<Tuple2<String, List<Row>>> filenameRowsRDD = zipRDD.flatMap(new ConvertLinesToRows());
The first JavaRDD returns a pair of Filename, InputStream. This is then passed to class ConvertLinesToRows which calls ZipInputStream reads the contents of CSV files and for each line creates a new spark Row and finally returning tuple pair of Filename, List<Row> where the list contains all lines from CSV converted to Row.
I now want to save each read CSV as parquet file.
filenameRowsRDD.foreach(tuple -> {
SparkContext newContext = MySparkConfig.createNewContext();
SparkSession newSpark = SparkSession.builder()
.sparkContext(newContext)
.getOrCreate();
Dataset<Row> dataset = newSpark.createDataFrame(tuple._2, datasetSchema.value());
dataset.write().parquet("s3a://...");
});
I recreate the SparkSession in my executor so to be able to use SparkSession.write.
The idea is that this will all run in an executor (I'm hoping). However with this approach, reading file is not an issue problem comes when executor wants to write this output file. Throwing exception for: A master URL must be set in your configuration.
This seems like I'm doing something anti-spark way. It also does not work. I have also tried broadcasting my SparkSession however that throws a NPE inside SparkSession before trying to write.
What would be the correct way to approach my problem here ?
What would be the spark way of doing this.
All of the above code in in my main() method. Am I correct in assuming that first zipRDD is run on master node and than the second filenameRowsRDD runs on executor nodes as well as the .foreach.

I want to process 20 TB of pdf files in hadoop in such a way that there is one output per input for each pdf files

I want to process 20 TB of pdf files in spark using tika in such a way that there is one output per input for each pdf files.
I am able to do it in a sequential manner but it is taking a lot of time. When doing it in parallel manner (by giving the input as the whole directory containing pdf files) it is taking very less time but the output is part files containing overlapping values. Is there any way in which I can do it in a parallel manner and get one output per input.
Below is my code :-
val binRDD = sc.binaryFiles("/data")
val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))}) textRDD.saveAsTextFile("/output/")
Get the list of file names in an RDD and then parallelize over it, something like below. I haven't run the code but should probably work or you can tweak it accordingly
EDIT: I have run the below code and it's working for me
val files = new
File("C:/Users/mavais/Desktop/test").listFiles().filter(_.isFile()).toList
val filesRDD = sc.parallelize(files, 10)
filesRDD.map(r => {
sc.textFile(r.getPath)
.map(x=> x.toInt *x.toInt)
.coalesce(1).saveAsTextFile("C:/Users/mavais/Desktop/test/"+r.getAbsolutePath.split("\\\\").last.split("\\.")(0)))
}).collect()

Apache Spark: Using folder structures to reduce run-time of analyses

I want to optimize the run-time of a Spark application by subdividing a huge csv file into different partitions, dependent of their characteristics.
E.g. I have a column with customer ids (integer, a), a column with dates (month+year, e.g. 01.2015, b), and a column with product ids (integer, c) (and more columns with product specific data, not needed for the partitioning).
I want to build a folder structure like /customer/a/date/b/product/c. When a user wants to know information about products from customer X, sold in January 2016, he could load and analyse the file saved in /customer/X/date/01.2016/*.
Is there a possibility to load such folder structures via wildcards? It should also be possible to load all customer or products of an specific time range, e.g. 01.2015 till 09.2015. Is it possible to use wildcards like /customer/*/date/*.2015/product/c? Or how could a problem like this be solved?
I want to partition the data once, and later load the specific files in the analysis, to reduce the run-time for these jobs (disregarded the additional work for the partitioning).
SOLUTION: Working with Parquet files
I changed my Spark Application to save my data to Parquet files, now everything works fine and I can pre-select the data by giving folder-structure. Here my code snippet:
JavaRDD<Article> goodRdd = ...
SQLContext sqlContext = new SQLContext(sc);
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("keyStore", DataTypes.IntegerType, false));
fields.add(DataTypes.createStructField("textArticle", DataTypes.StringType, false));
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = goodRdd.map(new Function<Article, Row>() {
public Row call(Article article) throws Exception {
return RowFactory.create(article.getKeyStore(), article.getTextArticle());
}
});
DataFrame storeDataFrame = sqlContext.createDataFrame(rowRDD, schema);
// WRITE PARQUET FILES
storeDataFrame.write().partitionBy(fields.get(0).name()).parquet("hdfs://hdfs-master:8020/user/test/parquet/");
// READ PARQUET FILES
DataFrame read = sqlContext.read().option("basePath", "hdfs://hdfs-master:8020/user/test/parquet/").parquet("hdfs://hdfs-master:8020/user/test/parquet/keyStore=1/");
System.out.println("READ : " + read.count());
IMPORTANT
Don't try out with a table with only one column! You will get Exceptions when you try to call the partitionBy method!
So, in Spark you can save and read partitioned data much in the way you are looking for. However, rather than creating the path like you have /customer/a/date/b/product/c Spark will use this convention /customer=a/date=b/product=c when you save data using:
df.write.partitionBy("customer", "date", "product").parquet("/my/base/path/")
When you need to read in the data, you need to specify the basepath-option like this:
sqlContext.read.option("basePath", "/my/base/path/").parquet("/my/base/path/customer=*/date=*.2015/product=*/")
and everything following /my/base/path/ will be interpreted as columns by Spark. In the example given here, Spark would add the three columns customer, date and product to the dataframe. Note that you can use wildcards for any of the columns as you like.
As for reading in data in a specific time range, you should be aware that Spark uses predicate push down, so it will only actually load data into memory that fits the criteria (as specified by some filter-transformation). But if you really want to specify range explicitly, you could generate a list of path names and then pass that to the read function. Like this:
val pathsInMyRange = List("/my/path/customer=*/date=01.2015/product=*",
"/my/path/customer=*/date=02.2015/product=*",
"/my/path/customer=*/date=03.2015/product=*"...,
"/my/path/customer=*/date=09.2015/product=*")
sqlContext.read.option("basePath", "/my/base/path/").parquet(pathsInMyRange:_*)
Anyway, I hope this helps :)

Resources