Spark s3 csv files read order - apache-spark

Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ?
If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.
File name
s3 file uploaded/Last modified time
s3:bucketname/folder1/file1.csv
01:00:00
s3:bucketname/folder1/file2.csv
01:10:00
s3:bucketname/folder1/file3.csv
01:20:00

You can achive this using following
Iterate over all the files in the bucket and load that csv with adding a new column last_modified. Keep a list of all the dfs that will be loaded in dfs_list. Since pyspark does lazy evaluation it will not load the data instantly.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
dfs_list = []
for file_object in my_bucket.objects.filter(Prefix="folder1/"):
df = spark.read.parquet('s3a://' + file_object.name).withColumn("modified_date", file_object.last_modified)
dfs_list.append(df)
Now take the union of all the dfs using pyspark unionAll function and then sort the data according to modified_date.
from functools import reduce
from pyspark.sql import DataFrame
df_combined = reduce(DataFrame.unionAll, dfs_list)
df_combined = df_combined.orderBy('modified_date')

Related

Filter thresold Size data to read using Pyspark or Python

I have n number orc files in a path, among them around 150 files are null or incomplete size, I want to ignore all those while reading through pyspark.
I have written the following, but I need some help as it's not working.
path = "/home/data/raw_data/"
file_list = os.listdir(path)
for file in file_list:
size=os.path.getsize(os.path.join(path, file))
if size > 6500: # want to import which is greater than 6.5 Mb
file_list.append(size)
raw_df = spark.read.format("orc").load(path)
the issue in your above code is
file_list.append(size) ---> which is not required and
reading data from spark should be inside loop.
from pyspark.sql import DataFrame
from functools import reduce
df_list =[]
path = "/home/data/raw_data/"
file_list = os.listdir(path)
for file in file_list:
size=os.path.getsize(os.path.join(path, file))
if size > 6500:
raw_df = spark.read.format("orc").load(path+file)
df_list.append(raw_df)
df_fnl = reduce(DataFrame.unionByName,df_list)
Kindly upvote of you like my solution.
The amount of files can be very big and the loop inefficient.
An alternative is to load all files and then filter only the files needed.
You can see the source of the file with the function input_file_name().
Then if you have a df all filenames you need, you can inner join on the input_file_name and your helper df and then only entries from the files you required are kept.

how to copy data from variable list of parquet files using pyspark

I have saved the list of parquet files(to be read) in a variable list, say listOffilteredFiles()
Now I want to read all the files from this list and write all the data into a single parquet file in another path. How can I do this. I have written the below code and I'm stuck here. Any help would be appreciated
import time
import datetime
from datetime import datetime
import pandas as pd
import glob
import pyspark
from pyspark.sql import SQLContext
dirName = 'dbfs:/mnt/abc/def/efg'
now = datetime.utcnow()
# Get the list of all files in directory tree at given path
listOfFiles = list()
listOffilteredFiles = list()
for (dirpath, dirnames, filenames) in os.walk(dirName):
listOfFiles += [os.path.join(dirpath, file) for file in filenames]
listOffilteredFiles = filter(lambda x: datetime.utcfromtimestamp(os.path.getmtime(x)) < now, listOfFiles)
Let's assume that the files you're trying to read are parquet files.
You can read all the parquet files from a directory using * syntax.
Suppose you have a directory like this:
/abc/def/[file1.parquet, file2.parquet, file3.parquet]
/abc/ghi/[file1.parquet, file2.parquet, file3.parquet]
and you wanna read all the parquet files under /abc directory. The spark read statement would be:
df = spark.read.parquet('/abc/*/*')
In another scenario, when you need to read some files after filtering them, you can do:
listOffilteredFiles = list(filter(lambda x: datetime.utcfromtimestamp(os.path.getmtime(x)) < now, listOfFiles))
df = spark.read.parquet(*listOffilteredFiles)

Can I get metadata of files reading by Spark

Let's suppose we have 2 files, file#1 created at 12:55 and file#2 created at 12:58. While reading these two files I want to add a new column "creation_time". Rows belong to file#1 have 12:55 in "creation_time" column and Rows belong to file#2 have 12:58 in "creation_time".
new_data = spark.read.option("header", "true").csv("s3://bucket7838-1/input")
I'm using above code snippet to read the files in "input" directory.
Use input_file_name() function to get the filename and then use hdfs file api to get the file timestamp finally join both dataframes on filename.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://<namenode_address>:8020"), Configuration())
status = fs.listStatus(Path('<hdfs_directory>'))
filestatus_df=spark.createDataFrame([[str(i.getPath()),i.getModificationTime()/1000] for i in status],["filename","modified_time"]).\
withColumn("modified_time",to_timestamp(col("modified_time")))
input_df=spark.read.csv("<hdfs_directory>").\
withColumn("filename",input_file_name())
#join both dataframes on filename to get filetimestamp
df=input_df.join(filestatus_df,['filename'],"left")
Here are the steps
Use sparkcontext.wholeTextFiles("/path/to/folder/containing/all/files")
The above returns an RDD where key is the path of the file, and value is the content of the file
rdd.map(lambda x:x[1]) - this give you an rdd with only file contents
rdd.map(lambda x: customeFunctionToProcessFileContent(x))
since map function works in parallel, any operations you do, would be faster and not sequential - as long as your tasks don't depend on each other, which is the main criteria for parallelism
import os
import time
import pyspark
from pyspark.sql.functions import udf
from pyspark.sql.types import *
# reading all the files to create PairRDD
input_rdd = sc.wholeTextFiles("file:///home/user/datatest/*",2)
#convert RDD to DF
input_df=spark.createDataFrame(input_rdd)
input_df.show(truncate=False)
'''
+---------------------------------------+------------+
|_1 |_2 |
+---------------------------------------+------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|
+---------------------------------------+------------+
'''
input_df.select("_2").take(2)
#[Row(_2=u'1,2,3\n1,2,3\n'), Row(_2=u'4,5,6\n6,7,6\n')]
# function to get a creation time of a file
def time_convesion(filename):
return time.ctime(os.path.getmtime(filename.split(":")[1]))
#udf registration
time_convesion_udf = udf(time_convesion, StringType())
#udf apply over the DF
final_df = input_df.withColumn("created_time", time_convesion_udf(input_df['_1']))
final_df.show(2,truncate=False)
'''
+---------------------------------------+------------+------------------------+
|_1 |_2 |created_time |
+---------------------------------------+------------+------------------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|Sat Jul 11 18:31:03 2020|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|Sat Jul 11 18:32:43 2020|
+---------------------------------------+------------+------------------------+
'''
# proceed with the next steps for the implementation
The above works with default partition though. So you might not get input files count equal to output file count(as output is number of partitions).
You can re-partition the RDD based on count or any other unique value based on your data, so you end up with output files count equal to input count. This approach will have only parallelism but will not have the performance achieved with optimal number of partitions

Reading Unzipped Shapefiles stored in AWS S3 from AWS EMR Cluster using PySpark in Jupyter Notebook

I'm completely new to AWS EMR and apache spark. I'm trying to assign GeoID's to residential properties using shapefiles. I'm not able to read the shapefiles from my s3 bucket. Please help me in understanding what is going on as I couldn't find any answer on the internet that explains the exact problem.
<!-- language: python 3.4 -->
import shapefile
import pandas as pd
def read_shapefile(shp_path):
"""
Read a shapefile into a Pandas dataframe with a 'coords' column holding
the geometry information. This uses the pyshp package
"""
#read file, parse out the records and shapes
sf = shapefile.Reader(shp_path)
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10")
Files That I want to read
The error that I'm getting while reading from the bucket
I really want to read these shapefiles in AWS EMR cluster, as it's not possible for me to work locally on them individually. Any kind of help is appreciated.
I was able to read my shape files from s3 bucket as a binary object in the beginning and then build a wrapper function around it, finally parsed the individual file objects to shapefile.reader() method in .dbf, .shp ,.shx formats separately.
This was happening because PySpark cannot read formats that are not provided in SparkContext. Found this link helpful Using pyshp to read a file-like object from a zipped archive.
My solution
def read_shapefile(shp_path):
import io
import shapefile
blocks = sc.binaryFiles(shp_path)
block_dict = dict(blocks.collect())
sf = shapefile.Reader(shp=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shp")][0]]),
shx=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".shx")][0]]),
dbf=io.BytesIO(block_dict[[i for i in block_dict.keys() if i.endswith(".dbf")][0]]))
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
center = [shape(s).centroid.coords[0] for s in sf.shapes()]
#write into a dataframe
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps, centroid=center)
return df
block_shapes = read_shapefile("s3a://uim-raw-datasets/census-bureau/tabblock-2010/tabblock-by-fips/tl_2010_01001_tabblock10*")
This works fine without breaking.

Reading multiple parquet files from S3 Bucket [duplicate]

I need to read parquet files from multiple paths that are not parent or child directories.
for example,
dir1 ---
|
------- dir1_1
|
------- dir1_2
dir2 ---
|
------- dir2_1
|
------- dir2_2
sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2
Right now I'm reading each dir and merging dataframes using "unionAll".
Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll
Thanks
A little late but I found this while I was searching and it may help someone else...
You might also try unpacking the argument list to spark.read.parquet()
paths=['foo','bar']
df=spark.read.parquet(*paths)
This is convenient if you want to pass a few blobs into the path argument:
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
's3://bucket/partition_value1=*/partition_value2=2017-05-*'
]
df=spark.read.option("basePath",basePath).parquet(*paths)
This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.
Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:
df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
or
df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
In case you have a list of files you can do:
files = ['file1', 'file2',...]
df = spark.read.parquet(*files)
For ORC
spark.read.orc("/dir1/*","/dir2/*")
spark goes inside dir1/ and dir2/ folder and load all the ORC files.
For Parquet,
spark.read.parquet("/dir1/*","/dir2/*")
Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.
from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')
import posixpath as psp
fpaths = [
psp.join("hdfs://localhost:9000" + dpath, fname)
for dpath, _, fnames in client.walk('/eta/myHdfsPath')
for fname in fnames
]
# At this point fpaths contains all hdfs files
parquetFile = sqlContext.read.parquet(*fpaths)
import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf
In Spark-Scala you can do this.
val df = spark.read.option("header","true").option("basePath", "s3://bucket/").csv("s3://bucket/{sub-dir1,sub-dir2}/")

Resources