Convert CSV files from multiple directory into parquet in PySpark

Convert CSV files from multiple directory into parquet in PySpark - apache-spark

I have CSV files from multiple paths that are not parent directories in s3 bucket. All the tables have the same partition keys.
the directory of the s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.csv
...
I need to convert these csv files into parquet files and store them in another s3 bucket that has the same directory structure.
the directory of another s3:
table_name_1/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
table_name_2/partition_key_1 = <pk_1>/partition_key_2 = <pk_2>/file.parquet
...
I have a solution is iterating through the s3 bucket and find the CSV file and convert it to parquet and save to the another S3 path. I find this way is not efficient, because i have a loop and did the conversion one file by one file.
I want to utilize the spark library to improve the efficiency.
Then, I tried:
spark.read.csv('s3n://bucket_name/table_name_1/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/table_name_1')
This way works good for each table, but to optimize it more, I want to take the table_name as a parameter, something like:
TABLE_NAMES = [table_name_1, table_name_2, ...]
spark.read.csv('s3n://bucket_name/{*TABLE_NAMES}/').write.partitionBy('partition_key_1', 'partition_key_2').parquet('s3n://another_bucket/{*TABLE_NAMES}')
Thanks

The mentioned question provides solutions for reading multiple files at once. The method spark.read.csv(...) accepts one or multiple paths as shown here. For reading the files you can apply the same logic. Although, when it comes to writing, Spark will merge all the given dataset/paths into one Dataframe. Therefore it is not possible to generate from one single dataframe multiple dataframes without applying a custom logic first. So to conclude, there is not such a method for extracting the initial dataframe directly into multiple directories i.e df.write.csv(*TABLE_NAMES).
The good news is that Spark provides a dedicated function namely input_file_name() which returns the file path of the current record. You can use it in combination with TABLE_NAMES to filter on the table name.
Here it is one possible untested PySpark solution:
from pyspark.sql.functions import input_file_name
TABLE_NAMES = [table_name_1, table_name_2, ...]
source_path = "s3n://bucket_name/"
input_paths = [f"{source_path}/{t}" for t in TABLE_NAMES]
all_df = spark.read.csv(*input_paths) \
.withColumn("file_name", input_file_name()) \
.cache()
dest_path = "s3n://another_bucket/"
def write_table(table_name: string) -> None:
all_df.where(all_df["file_name"].contains(table_name))
.write
.partitionBy('partition_key_1','partition_key_2')
.parquet(f"{dest_path}/{table_name}")
for t in TABLE_NAMES:
write_table(t)
Explanation:
We generate and store the input paths into input_paths. This will create paths such as: s3n://bucket_name/table1, s3n://bucket_name/table2 ... s3n://bucket_name/tableN.
Then we load all the paths into one dataframe in which we add a new column called file_name, this will hold the path of each row. Notice that we also use cache here, this is important since we have multiple len(TABLE_NAMES) actions in the following code. Using cache will prevent us from loading the datasource again and again.
Next we create the write_table which is responsible for saving the data for the given table. The next step is to filter based on the table name using all_df["file_name"].contains(table_name), this will return only the records that contain the value of the table_name in the file_name column. Finally we save the filtered data as you already did.
In the last step we call write_table for every item of TABLE_NAMES.
Related links
How to import multiple csv files in a single load?
Get HDFS file path in PySpark for files in sequence file format

Related

Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code :
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Into the Voucher folder, there is one folder by date, and one parquet file inside it)
How can I add the creation date of each parquet file into my DataFrame ?
Thanks
EDIT 1:
Thanks rainingdistros, I wrote this:
import os
from datetime import datetime, timedelta
Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)
Now I must find a way to loop through all the files and add a column in the DataFrame.

The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:
When I modify this file, the changes can be observed in the information returned by os.stat.
So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
pdf = pd.read_csv(f"{path}/{file}")
pdf.display()
statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
create_date = datetime.fromtimestamp(statinfo.st_ctime)
pdf['creation_date'] = [create_date.date()] * len(pdf)
pdf.to_csv(f"{path}/{file}", index=False)
These files would have this new column as shown below after running the code:
It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.

See if below steps help....
Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.
Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...

Pyspark NLTK save output

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.
From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.
Now I want to save dataframes into respected file in HDFS. (every time appending output into same file )
So at the end I have three CSV file named unigram.csv, bigram.csv, trigram.csv containing result of thousands of input file.
If this scenario doesn't possible with HDFS, can you suggest it with using local disk as storage path.

File append in normal programming language is not similar to what Dataframe write mode append is. Whenever we ask Dataframe to save to a location folder it will create a new file for every append . Only way you can achieve it by,
Read the old file into dfOld : Dataframe
Combine the old and new Dataframe dfOld.union(dfNewToAppend)
combine to single output file .coalesce(1)
Write to new temporary location /tempWrite
Delete the old HDFS location
Rename the /tempWrite folder your output folder name
val spark = SparkSession.builder.master("local[*]").getOrCreate;
import org.apache.hadoop.fs._
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
/// Write you unigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/unigram.csv")
/// Write you bigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/bigram.csv")
/// Write you trigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000"), new Path("yourNewHDFSDir/trigram.csv")
```

How to load different files into different tables, based on file pattern?

I'm running a simple PySpark script, like this.
base_path = '/mnt/rawdata/'
file_names = ['2018/01/01/ABC1_20180101.gz',
'2018/01/02/ABC2_20180102.gz',
'2018/01/03/ABC3_20180103.gz',
'2018/01/01/XYZ1_20180101.gz'
'2018/01/02/XYZ1_20180102.gz']
for f in file_names:
print(f)
So, just testing this, I can find the files and print the strings just fine. Now, I'm trying to figure out how to load the contents of each file into a specific table in SQL Server. The thing is, I want to do a wildcard search for files that match a pattern, and load specific files into specific tables. So, I would like to do the following:
load all files with 'ABC' in the name, into my 'ABC_Table' and all files with 'XYZ' in the name, into my 'XYZ_Table' (all data starts on row 2, not row 1)
load the file name into a field named 'file_name' in each respective table (I'm totally fine with the entire string from 'file_names' or the part of the string after the last '/' character; doesn't matter)
I tried to use Azure Data Factory for this, and it can recursively loop through all files just fine, but it doesn't get the file names loaded, and I really need the file names in the table to distinguish which records are coming from which files & dates. Is it possible to do this using Azure Databricks? I feel like this is an achievable ETL process, but I don't know enough about ADB to make this work.
Update based on Daniel's recommendation
dfCW = sc.sequenceFile('/mnt/rawdata/2018/01/01/ABC%.gz/').toDF()
dfCW.withColumn('input', input_file_name())
print(dfCW)
Gives me:
com.databricks.backend.daemon.data.common.InvalidMountException:
What can I try next?

You can use input_file_name from pyspark.sql.functions
e.g.
withFiles = df.withColumn("file", input_file_name())
Afterwards you can create multiple dataframes by filtering on the new column
abc = withFiles.filter(col("file").like("%ABC%"))
xyz = withFiles.filter(col("file").like("%XYZ%"))
and then use regular writer for both of them.

Dataframe not able to write on S3

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')

Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Write each row of a spark dataframe as a separate file

I have Spark Dataframe with a single column, where each row is a long string (actually an xml file).
I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.
I cannot seem to find any information or examples on how to do this.
And I am just starting to work with Spark and PySpark.
Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.

When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.
There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.
The repartition and saving can be done as follows:
rows = df.count()
df.repartition(rows).write.csv('save-dir')

I would do it this way in Java and Hadoop FileSystem API. You can write similar code using Python.
List<String> strings = Arrays.asList("file1", "file2", "file3");
JavaRDD<String> stringrdd = new JavaSparkContext().parallelize(strings);
stringrdd.collect().foreach(x -> {
Path outputPath = new Path(x);
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
});

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert CSV files from multiple directory into parquet in PySpark - apache-spark

Related

Add the creation date of a parquet file into a DataFrame

Pyspark NLTK save output

How to load different files into different tables, based on file pattern?

Dataframe not able to write on S3

Write each row of a spark dataframe as a separate file

Categories

Resources