Can I get metadata of files reading by Spark - apache-spark

Let's suppose we have 2 files, file#1 created at 12:55 and file#2 created at 12:58. While reading these two files I want to add a new column "creation_time". Rows belong to file#1 have 12:55 in "creation_time" column and Rows belong to file#2 have 12:58 in "creation_time".
new_data = spark.read.option("header", "true").csv("s3://bucket7838-1/input")
I'm using above code snippet to read the files in "input" directory.

Use input_file_name() function to get the filename and then use hdfs file api to get the file timestamp finally join both dataframes on filename.
Example:
from pyspark.sql.types import *
from pyspark.sql.functions import *
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("hdfs://<namenode_address>:8020"), Configuration())
status = fs.listStatus(Path('<hdfs_directory>'))
filestatus_df=spark.createDataFrame([[str(i.getPath()),i.getModificationTime()/1000] for i in status],["filename","modified_time"]).\
withColumn("modified_time",to_timestamp(col("modified_time")))
input_df=spark.read.csv("<hdfs_directory>").\
withColumn("filename",input_file_name())
#join both dataframes on filename to get filetimestamp
df=input_df.join(filestatus_df,['filename'],"left")

Here are the steps
Use sparkcontext.wholeTextFiles("/path/to/folder/containing/all/files")
The above returns an RDD where key is the path of the file, and value is the content of the file
rdd.map(lambda x:x[1]) - this give you an rdd with only file contents
rdd.map(lambda x: customeFunctionToProcessFileContent(x))
since map function works in parallel, any operations you do, would be faster and not sequential - as long as your tasks don't depend on each other, which is the main criteria for parallelism
import os
import time
import pyspark
from pyspark.sql.functions import udf
from pyspark.sql.types import *
# reading all the files to create PairRDD
input_rdd = sc.wholeTextFiles("file:///home/user/datatest/*",2)
#convert RDD to DF
input_df=spark.createDataFrame(input_rdd)
input_df.show(truncate=False)
'''
+---------------------------------------+------------+
|_1 |_2 |
+---------------------------------------+------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|
+---------------------------------------+------------+
'''
input_df.select("_2").take(2)
#[Row(_2=u'1,2,3\n1,2,3\n'), Row(_2=u'4,5,6\n6,7,6\n')]
# function to get a creation time of a file
def time_convesion(filename):
return time.ctime(os.path.getmtime(filename.split(":")[1]))
#udf registration
time_convesion_udf = udf(time_convesion, StringType())
#udf apply over the DF
final_df = input_df.withColumn("created_time", time_convesion_udf(input_df['_1']))
final_df.show(2,truncate=False)
'''
+---------------------------------------+------------+------------------------+
|_1 |_2 |created_time |
+---------------------------------------+------------+------------------------+
|file:/home/user/datatest/test.txt |1,2,3 1,2,3|Sat Jul 11 18:31:03 2020|
|file:/home/user/datatest/test.txt1 |4,5,6 6,7,6|Sat Jul 11 18:32:43 2020|
+---------------------------------------+------------+------------------------+
'''
# proceed with the next steps for the implementation
The above works with default partition though. So you might not get input files count equal to output file count(as output is number of partitions).
You can re-partition the RDD based on count or any other unique value based on your data, so you end up with output files count equal to input count. This approach will have only parallelism but will not have the performance achieved with optimal number of partitions

Related

Spark s3 csv files read order

Let's say three files in s3 folder and whether read through spark.read.csv(s3:bucketname/folder1/*.csv) reads the files in order or not ?
If not, is there way to order the files while reading the whole folder with multiple files received at different time internal.
File name
s3 file uploaded/Last modified time
s3:bucketname/folder1/file1.csv
01:00:00
s3:bucketname/folder1/file2.csv
01:10:00
s3:bucketname/folder1/file3.csv
01:20:00
You can achive this using following
Iterate over all the files in the bucket and load that csv with adding a new column last_modified. Keep a list of all the dfs that will be loaded in dfs_list. Since pyspark does lazy evaluation it will not load the data instantly.
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname')
dfs_list = []
for file_object in my_bucket.objects.filter(Prefix="folder1/"):
df = spark.read.parquet('s3a://' + file_object.name).withColumn("modified_date", file_object.last_modified)
dfs_list.append(df)
Now take the union of all the dfs using pyspark unionAll function and then sort the data according to modified_date.
from functools import reduce
from pyspark.sql import DataFrame
df_combined = reduce(DataFrame.unionAll, dfs_list)
df_combined = df_combined.orderBy('modified_date')

Pyspark - Index from monotonically_increasing_id changes after list aggregation

I'm creating an index using the monotonically_increasing_id() function in Pyspark 3.1.1.
I'm aware of the specific characteristics of that function, but they don't explain my issue.
After creating the index I do a simple aggregation applying the collect_list() function on the created index.
If I compare the results the index changes in certain cases, that is specifically on the upper end of the long-range when the input data is not too small.
Full example code:
import random
import string
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '8')\
.getOrCreate()
# Create random input data of around length 100000:
input_data = []
ii = 0
while ii <= 100000:
L = random.randint(1, 3)
B = ''.join(random.choices(string.ascii_uppercase, k=5))
for i in range(L):
C = random.randint(1,100)
input_data.append((B,))
ii += 1
# Create Spark DataFrame:
input_rdd = sc.parallelize(tuple(input_data))
schema = StructType([StructField("B", StringType())])
dg = spark.createDataFrame(input_rdd, schema=schema)
# Create id and aggregate:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id())
dg2 = dg.groupBy("B").agg(f.collect_list("ID0"))
Output:
dg.sort('B', ascending=False).show(10, truncate=False)
dg2.sort('B', ascending=False).show(5, truncate=False)
This of course creates different data with every run, but if the length is large enough (problem appears slightly at 10000, but not at 1000), it should appear everytime. Here's an example result:
+-----+-----------+
|B |ID0 |
+-----+-----------+
|ZZZVB|60129554616|
|ZZZVB|60129554617|
|ZZZVB|60129554615|
|ZZZUH|60129554614|
|ZZZRW|60129554612|
|ZZZRW|60129554613|
|ZZZNH|60129554611|
|ZZZNH|60129554609|
|ZZZNH|60129554610|
|ZZZJH|60129554606|
+-----+-----------+
only showing top 10 rows
+-----+---------------------------------------+
|B |collect_list(ID0) |
+-----+---------------------------------------+
|ZZZVB|[60129554742, 60129554743, 60129554744]|
|ZZZUH|[60129554741] |
|ZZZRW|[60129554739, 60129554740] |
|ZZZNH|[60129554736, 60129554737, 60129554738]|
|ZZZJH|[60129554733, 60129554734, 60129554735]|
+-----+---------------------------------------+
only showing top 5 rows
The entry ZZZVB has the three IDs 60129554615, 60129554616, and 60129554617 before aggregation, but after aggregation the numbers have changed to 60129554742, 60129554743, 60129554744.
Why? I can't imagine this is supposed to happen. Isn't the result of monotonically_increasing_id() a simple long that keeps its value after having been created?
EDIT: As expected a workaround is to coalesce(1) the DataFrame before creating the id.
dg and df2 are two different dataframes, each with their own DAG. These DAGs are executed independently from each other when an action on one of the dataframes is called. So each time show() is called, the DAG of the respective dataframe is evaluated and during that evaluation, f.monotonically_increasing_id() is called.
To prevent f.monotonically_increasing_id() being called twice, you could add a cache after the withColumn transformation:
dg = dg.sort("B").withColumn("ID0", f.monotonically_increasing_id()).cache()
With the cache, the result of the first evaluation of f.monotonically_increasing_id() is cached and reused when evaluating the second dataframe.

A quick way to get the mean of each position in large RDD

I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like
[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]
Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:
(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4
which is:
[(3745.75,2729.25,4397.0,3981.0)]
Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.
I don't think there is anything faster than calculating the mean (or sum) for each column
If you are using the DataFrame API you can simply aggregate multiple columns:
import os
import time
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
# start local spark session
spark = SparkSession.builder.getOrCreate()
# load as rdd
def localpath(path):
return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
rdd = spark._sc.textFile(localpath('myPosts/'))
# create data frame from rdd
df = spark.createDataFrame(rdd)
means_df = df.agg(*[f.avg(c) for c in df.columns])
means_dict = means_df.first().asDict()
print(means_dict)
Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command

Spark: Reading CSV files from list of paths in a DataFrame Row

I have a Spark DataFrame as follows:
# ---------------------------------
# - column 1 - ... - column 5 -
# ---------------------------------
# - ... - Array of paths
Columns 1 to 4 contain strings and the fifth column contains list of strings, that are actually paths to CSV files I wish to read as Spark Dataframes. I cannot find anyway to read them. Here's a simplified version with just a single column and the column with the list of paths:
from pyspark.sql import SparkSession,Row
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
simpleRDD = spark.sparkContext.parallelize(range(10))
simpleRDD = simpleRDD.map(lambda x: Row(**{'a':x,'paths':['{}_{}.csv'.format(y**2,y+1) for y in range(x+1)]}))
simpleDF = spark.createDataFrame(simpleRDD)
print(simpleDF.head(5))
This gives:
[Row(a=0, paths=['0_1.csv']),
Row(a=1, paths=['0_1.csv', '1_2.csv']),
Row(a=2, paths=['0_1.csv', '1_2.csv', '4_3.csv']),
Row(a=3, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv']),
Row(a=4, paths=['0_1.csv', '1_2.csv', '4_3.csv', '9_4.csv', '16_5.csv'])]
I would like then to do something like this:
simpleDF = simpleDF.withColumn('data',spark.read.csv(simpleDF.paths))
...but this of course, does not work.
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
inp=[['a','b','c','d',['abc\t1.txt','abc\t2.txt','abc\t3.txt','abc\t4.txt','abc\t5.txt',]],
['f','g','h','i',['def\t1.txt','def\t2.txt','def\t3.txt','def\t4.txt','def\t5.txt',]],
['k','l','m','n',['ghi\t1.txt','ghi\t2.txt','ghi\t3.txt','ghi\t4.txt','ghi\t5.txt',]]
]
inp_data=spark.sparkContext.parallelize(inp)
##Defining the schema
schema = StructType([StructField('field1',StringType(),True),
StructField('field2',StringType(),True),
StructField('field3',StringType(),True),
StructField('field4',StringType(),True),
StructField('field5',ArrayType(StringType(),True))
])
## Create the Data frames
dataframe=spark.createDataFrame(inp_data,schema)
dataframe.createOrReplaceTempView("dataframe")
dataframe.select("field5").filter("field1='a'").show()
I'm not sure how you intend to store the DataFrame objects once you read them in from their path, but if it's a matter of accessing the values in your DataFrame column, you can use the .collect() method to return your DataFrame as a list of Row objects (just like an RDD).
Each Row object has a .asDict() method that converts it to a Python dictionary object. Once you're there, you can access the values by indexing the dictionary using its key.
Assuming you're content storing the returned DataFrames in a list, you could try the following:
# collect the DataFrame into a list of Rows
rows = simpleRDD.collect()
# collect all the values in your `paths` column
# (note that this will return a list of lists)
paths = map(lambda row: row.asDict().get('paths'), rows)
# flatten the list of lists
paths_flat = [path for path_list in paths for path in path_list]
# get the unique set of paths
paths_unique = list(set(paths_flat))
# instantiate an empty dictionary in which to collect DataFrames
dfs_dict = []
for path in paths_unique:
dfs_dict[path] = spark.read.csv(path)
Your dfs_dict will now contain all of your DataFrames. To get the DataFrame of a particular path, you can access it using the path as the dictionary key:
df_0_01 = dfs_dict['0_1.csv']

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Resources