Get source files for directory of parquet tables in spark - apache-spark

I have some code where I read in many parquet tables via a directory and wildcard, like this:
df = sqlContext.read.load("some_dir/*")
Is there some way I can get the source file for each row in the resulting DataFrame, df?

Let's create some dummy data and save it in parquet format.
spark.range(1,1000).write.save("./foo/bar")
spark.range(1,2000).write.save("./foo/bar2")
spark.range(1,3000).write.save("./foo/bar3")
Now we can read the data as desired :
import org.apache.spark.sql.functions.input_file_name
spark.read.load("./foo/*")
.select(input_file_name(), $"id")
.show(3,false)
// +---------------------------------------------------------------------------------------+---+
// |INPUT_FILE_NAME() |id |
// +---------------------------------------------------------------------------------------+---+
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|500|
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|501|
// |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|502|
// +---------------------------------------------------------------------------------------+---+
Since Spark 1.6, you can combine parquet data source and input_file_name function as shown above.
This seem to be buggy before spark 2.x with pyspark, but this is how it's done :
from pyspark.sql.functions import input_file_name
spark.read.load("./foo/*") \
.select(input_file_name(), "id") \
.show(3,truncate=False)
# +---------------------------------------------------------------------------------------+---+
# |INPUT_FILE_NAME() |id |
# +---------------------------------------------------------------------------------------+---+
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|500|
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|501|
# |file:/home/eliasah/foo/bar/part-r-00002-9554d123-23fc-4524-a900-1cdbd9274cc3.gz.parquet|502|
# +---------------------------------------------------------------------------------------+---+

Related

Spark Structured Streaming rate limit

I am Trying to control records per triggers in structured streaming. Is their any function for it. I tried different properties but nothing seems to be working.
import org.apache.spark.sql.streaming.Trigger
val checkpointPath = "/user/akash-singh.bisht#unilever.com/dbacademy/developer-foundations-capstone/checkpoint/orders"
// val outputPath = "/user/akash-singh.bisht#unilever.com/dbacademy/developer-foundations-capstone/raw/orders/stream"
val devicesQuery = df.writeStream
.outputMode("append")
.format("delta")
.queryName("orders")
.trigger(Trigger.ProcessingTime("1 second"))
.option("inputRowsPerSecond", 1)
.option("maxFilesPerTrigger", 1)
// .option("checkpointLocation", checkpointPath)
// .start(orders_checkpoint_path)
.option("checkpointLocation",checkpointPath)
.table("orders")
Delta uses two options maxFilesPerTrigger & maxBytesPerTrigger. You already use the first one, and it takes over the precedence over the second. The real number of records processed per trigger depends on the size of the input files and number of records inside it, as Delta processes complete files, not splitting it into multiple chunks.
But these options needs to be specified on the source Delta table, not on the sink, as you specify right now:
spark.readStream.format("delta")
.option("maxFilesPerTrigger", "1")
.load("/delta/events")
.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "...")
.table("orders")
Update, just to show that option works.
Generate test data in directory /Users/user/tmp/abc/:
for i in {1..100}; do echo "{\"id\":$i}" > $i.json; done
then run the test, but use foreachBatch to map what file was processed in which trigger/batch:
import pyspark.sql.functions as F
df = spark.readStream.format("json").schema("id int") \
.option("maxFilesPerTrigger", "1").load("/Users/user/tmp/abc/")
df2 = df.withColumn("file", F.input_file_name())
def feb(d, e):
d.withColumn("batch", F.lit(e)).write.format("parquet") \
.mode("append").save("2.parquet")
stream = df2.writeStream.outputMode("append").foreachBatch(feb).start()
# wait a minute or so
stream.stop()
bdf = spark.read.parquet("2.parquet")
# check content
>>> bdf.show(5, truncate=False)
+---+----------------------------------+-----+
|id |file |batch|
+---+----------------------------------+-----+
|100|file:///Users/user/tmp/abc/100.json|94 |
|99 |file:///Users/user/tmp/abc/99.json |19 |
|78 |file:///Users/user/tmp/abc/78.json |87 |
|81 |file:///Users/user/tmp/abc/81.json |89 |
|34 |file:///Users/user/tmp/abc/34.json |69 |
+---+----------------------------------+-----+
# check that each file came in a separate batch
>>> bdf.select("batch").dropDuplicates().count()
100
If I increase maxFilesPerTrigger to 2, then I'll get 50 batches, etc.

How to Convert Pyspark DF to fixedwidth and save

I have a requirement to scan a FixedWidth file using a specific schema and once this is done, the resulted DF with filters applied needs to be converted back to fixed width. How can we apply such transformations before the file being saved to s3. Below is what I have done.
df = spark.read.text(dataset_path)
# Dataframe with applied selection logic
df = df.select(
df.value.substr(1, 10).alias('name'),
df.value.substr(11, 20).alias('another_name'),
df.value.substr(31, 60).alias('address')
)
df = df.filter(df.name.isin('some_name'))
# Here is the dataframe which I need to convert to FixedWidth before saving.
df.save('s3a://somebucket/somepath')
Is there a way to get this done in PySpark?

Can I get metadata of files reading by Spark

Let's suppose we have 2 files, file#1 created at 12:55 and file#2 created at 12:58. While reading these two files I want to add a new column "creation_time". Rows belong to file#1 have 12:55 in "creation_time" column and Rows belong to file#2 have 12:58 in "creation_time".
new_data = spark.read.option("header", "true").csv("s3://bucket7838-1/input")
I'm using above code snippet to read the files in "input" directory.
Use input_file_name() function to get the filename and then use hdfs file api to get the file timestamp finally join both dataframes on filename.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://<namenode_address>:8020"), Configuration())
status = fs.listStatus(Path('<hdfs_directory>'))
filestatus_df=spark.createDataFrame([[str(i.getPath()),i.getModificationTime()/1000] for i in status],["filename","modified_time"]).\
withColumn("modified_time",to_timestamp(col("modified_time")))
input_df=spark.read.csv("<hdfs_directory>").\
withColumn("filename",input_file_name())
#join both dataframes on filename to get filetimestamp
df=input_df.join(filestatus_df,['filename'],"left")
Here are the steps
Use sparkcontext.wholeTextFiles("/path/to/folder/containing/all/files")
The above returns an RDD where key is the path of the file, and value is the content of the file
rdd.map(lambda x:x[1]) - this give you an rdd with only file contents
rdd.map(lambda x: customeFunctionToProcessFileContent(x))
since map function works in parallel, any operations you do, would be faster and not sequential - as long as your tasks don't depend on each other, which is the main criteria for parallelism
import os
import time
import pyspark
from pyspark.sql.functions import udf
from pyspark.sql.types import *
# reading all the files to create PairRDD
input_rdd = sc.wholeTextFiles("file:///home/user/datatest/*",2)
#convert RDD to DF
input_df=spark.createDataFrame(input_rdd)
input_df.show(truncate=False)
'''
+---------------------------------------+------------+
|_1 |_2 |
+---------------------------------------+------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|
+---------------------------------------+------------+
'''
input_df.select("_2").take(2)
#[Row(_2=u'1,2,3\n1,2,3\n'), Row(_2=u'4,5,6\n6,7,6\n')]
# function to get a creation time of a file
def time_convesion(filename):
return time.ctime(os.path.getmtime(filename.split(":")[1]))
#udf registration
time_convesion_udf = udf(time_convesion, StringType())
#udf apply over the DF
final_df = input_df.withColumn("created_time", time_convesion_udf(input_df['_1']))
final_df.show(2,truncate=False)
'''
+---------------------------------------+------------+------------------------+
|_1 |_2 |created_time |
+---------------------------------------+------------+------------------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|Sat Jul 11 18:31:03 2020|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|Sat Jul 11 18:32:43 2020|
+---------------------------------------+------------+------------------------+
'''
# proceed with the next steps for the implementation
The above works with default partition though. So you might not get input files count equal to output file count(as output is number of partitions).
You can re-partition the RDD based on count or any other unique value based on your data, so you end up with output files count equal to input count. This approach will have only parallelism but will not have the performance achieved with optimal number of partitions

Problem in reading string NULL values from BigQuery

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', <bq_dataset> + <bq_table>) \
.load()
bqdf.createOrReplaceTempView('bqdf')
# Select required data into another df
bqdf2 = spark.sql(
'SELECT * FROM bqdf')
# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')
I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.
I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!
I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:
According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:
nullValue – sets the string representation of a null value. If None is
set, it uses the default value, empty string.
Which is what you are looking for. Then, I just implemented nullValue option below.
sc = SparkContext()
spark = SparkSession(sc)
# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
"table", "dataset.table").load()
# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")
# Select required data into another df
data_view2 = spark.sql(
'SELECT * FROM data_view')
df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')
data_view2.show()
Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read. The output was:
+------+---+
|name |age|
+------+---+
|Robert| 25|
|null | 23|
+------+---+
Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:
name,age
Robert,25
,23
As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.
Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.

Spark: Reading CSV files from list of paths in a DataFrame Row

I have a Spark DataFrame as follows:
# ---------------------------------
# - column 1 - ... - column 5 -
# ---------------------------------
# - ... - Array of paths
Columns 1 to 4 contain strings and the fifth column contains list of strings, that are actually paths to CSV files I wish to read as Spark Dataframes. I cannot find anyway to read them. Here's a simplified version with just a single column and the column with the list of paths:
from pyspark.sql import SparkSession,Row
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
simpleRDD = spark.sparkContext.parallelize(range(10))
simpleRDD = simpleRDD.map(lambda x: Row(**{'a':x,'paths':['{}_{}.csv'.format(y**2,y+1) for y in range(x+1)]}))
simpleDF = spark.createDataFrame(simpleRDD)
print(simpleDF.head(5))
This gives:
[Row(a=0, paths=['0_1.csv']),
Row(a=1, paths=['0_1.csv', '1_2.csv']),
Row(a=2, paths=['0_1.csv', '1_2.csv', '4_3.csv']),
Row(a=3, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv']),
Row(a=4, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv', '16_5.csv'])]
I would like then to do something like this:
simpleDF = simpleDF.withColumn('data',spark.read.csv(simpleDF.paths))
...but this of course, does not work.
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
inp=[['a','b','c','d',['abc\t1.txt','abc\t2.txt','abc\t3.txt','abc\t4.txt','abc\t5.txt',]],
['f','g','h','i',['def\t1.txt','def\t2.txt','def\t3.txt','def\t4.txt','def\t5.txt',]],
['k','l','m','n',['ghi\t1.txt','ghi\t2.txt','ghi\t3.txt','ghi\t4.txt','ghi\t5.txt',]]
]
inp_data=spark.sparkContext.parallelize(inp)
##Defining the schema
schema = StructType([StructField('field1',StringType(),True),
StructField('field2',StringType(),True),
StructField('field3',StringType(),True),
StructField('field4',StringType(),True),
StructField('field5',ArrayType(StringType(),True))
])
## Create the Data frames
dataframe=spark.createDataFrame(inp_data,schema)
dataframe.createOrReplaceTempView("dataframe")
dataframe.select("field5").filter("field1='a'").show()
I'm not sure how you intend to store the DataFrame objects once you read them in from their path, but if it's a matter of accessing the values in your DataFrame column, you can use the .collect() method to return your DataFrame as a list of Row objects (just like an RDD).
Each Row object has a .asDict() method that converts it to a Python dictionary object. Once you're there, you can access the values by indexing the dictionary using its key.
Assuming you're content storing the returned DataFrames in a list, you could try the following:
# collect the DataFrame into a list of Rows
rows = simpleRDD.collect()
# collect all the values in your `paths` column
# (note that this will return a list of lists)
paths = map(lambda row: row.asDict().get('paths'), rows)
# flatten the list of lists
paths_flat = [path for path_list in paths for path in path_list]
# get the unique set of paths
paths_unique = list(set(paths_flat))
# instantiate an empty dictionary in which to collect DataFrames
dfs_dict = []
for path in paths_unique:
dfs_dict[path] = spark.read.csv(path)
Your dfs_dict will now contain all of your DataFrames. To get the DataFrame of a particular path, you can access it using the path as the dictionary key:
df_0_01 = dfs_dict['0_1.csv']

Resources