Conditional loading of partitions from file-system - apache-spark

I am aware that there have been questions regarding wildcards in pySparks .load()-function like here or here.
Anyhow, none of the questions/answers I found dealt with my variation of it.
Context
In pySpark I want to load files directly from HDFS because I have to use databricks avro-library for Spark 2.3.x. I'm doing so like this:
partition_stamp = "202104"
df = spark.read.format("com.databricks.spark.avro") \
.load(f"/path/partition={partition_stamp}*") \
.select("...")
As you can see the partitions are deriving from timestamps in the format yyyyMMdd.
Question
Currently I only get all partitions used for April 2021 (partition_stamp = "202104").
However, I need all partitions starting from April 2021.
Written in pseudo-code, I'd need a solution something alike this:
.load(f"/path/partition >= {partition_stamp}*")
Since there actually exist several hundred partitions it is no use to do it in any fashion that requires hard-coding.
So my question is: Is there a function for conditional file-loading?

As I learned there exist only the following options to dynamically process paths inside the .load()-function:
*: Wildcard for any character or sequence of characters until the end of the line or a new sub-directory ('/') -> (/path/20200*)
[1-3]: Regex-like inclusion of a defined character-range -> (/path/20200[1-3]/...)
{1,2,3}: Set-like inclusion of a defined set of characters -> (/path/20200{1,2,3}/...)
Thus, to answer my question: There is no built-in function for conditional file-loading.
Anyhow, I want to provide you my solution:
import pandas as pd # Utilize pandas date-functions
partition_stamp = ",".join((set(
str(_range.year) + "{:02}".format(_range.month)
for _range in pd.date_range(start=start_date, end=end_date, freq='D')
)))
df = spark.read.format("com.databricks.spark.avro") \
.load(f"/path/partition={{{partition_stamp}}}*") \
.select("...")
This way the restriction for a timestamp of format yyyyMM is generated dynamically for a given start- and end-date and the string-based .load() is still usable.

Related

read only non-merged files in pyspark

I have N deltas in N folders (ex. /user/deltas/1/delta1.csv, /user/deltas/2/delta2csv,.../user/deltas/n/deltaN.csv)
all deltas have same columns, only information in columns is different.
i have a code for reading my csv files from folder "deltas"
dfTable = spark.read.format("csv").option("recursiveFileLookup","true")\
.option("header", "true).load("/home/user/deltas/")
and i gonna use deltaTable.merge to merge and update information from deltas and write updated information in table (main_table.csv)
For example tommorow i will have new delta with another updated information, and i will run my code again to refresh data in my main_table.csv .
How to avoid deltas that have already been used by deltaTable.merge earlier to the file main_table.csv ?
is it possible maybe to change file type after delta's run for example to parquet and thats how to avoid re-using deltas again? because im reading csv files, not parquet, or something like log files etc..
I think a time path filter might work well for your use case. If you are running your code daily (either manually or with a job), then you could use the modifiedAfter parameter to only load files that were modified after 1 day ago (or however often you are rerunning this code).
from datetime import datetime, timedelta
timestamp_last_run = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%dT-%H:%M:%S")
dfTable = spark.read.format("csv").option("recursiveFileLookup","true")\
.option("header", "true).load("/home/user/deltas/", modifiedAfter=timestamp_last_run)
## ...perform merge operation and save data in main_table.csv

es.read.source.filter v.s. es.read.field.include when reading data with elasticsearch-hadoop

When reading data from Elasticsearch with elasticsearch-hadoop, there are two options two specify how to reading a subset of fields from the source, according to the offical documents, i.e,.
es.read.field.include: Fields/properties that are parsed and considered when reading the documents from Elasticsearch...;
es.read.source.filter: ...this property allows you to specify a comma delimited string of field names that you would like to return from Elasticsearch.
Both can be set as a comma-separated string.
I have tested the two options, and find that only es.read.field.include works with expected results, but es.read.source.filter takes no effect (with all data fields returned).
Question: what is the difference of these two options? and why the option es.read.source.filter does not take effect?
Here is the code to reading data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('TestSparkElasticsearch').getOrCreate()
es_options = {"nodes": "node1,node2,node3", "port": 9200}
df = spark.read\
.format("org.elasticsearch.spark.sql")\
.options(**es_options)\
.option("es.read.source.filter", 'sip,dip,sport,dport,protocol,tcp_flag')\
.load("test_flow_tcp")
df.printSchema() # return all the fields, the option not work
df = spark.read\
.format("org.elasticsearch.spark.sql")\
.options(**es_options)\
.option("es.read.field.include", 'sip,dip,sport,dport,protocol,tcp_flag')\
.load("test_flow_tcp")
df.printSchema() # only return specified fields as expected
Updated: es.read.source.filter is added in Elasticsearch 5.4 and I am using version 5.1. Hence the option is ignored. Nevertheless, the difference of these two options in newer version is still not clearly explained in the document?

Different delimiters on different lines in the same file for Databricks Spark

I have a file that has a mix of comma delimited lines and pipe delimited lines I need to import into Databricks.
Is it possible to indicate the use of two or more different separators when creating a sql table in Databricks/Spark?
I see lots of posts for multiple character separators, but nothing on different separators.
https://forums.databricks.com/questions/10743/how-to-read-file-in-pyspark-with-delimiter.html
Possible to handle multi character delimiter in spark
http://blog.madhukaraphatak.com/spark-3-introduction-part-1
etc.
I'm currently using something like this.
create table myschema.mytable (
foo string,
bar string
)
using csv
options (
header = "true",
delimiter = ","
);
One methood you could try is to create spark dataframe first and then make a table out of it. Giving example for a hypothetical case below using pyspark where delimiters were | and -
BEWARE: we are using split and it means that it will split everything, e.g. 2000-12-31 is a value yest it will be split. Therefor we should be very sure that no such case would ever occur in data. As general advice, one should never accept these types of files as there are accidents waiting to happen.
How sample data looks: in this case we have 2 files in our directory with | and - occurring randomly as delimiters
# Create RDD. Basically read as simple text file.
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/")
rddRead.collect() # For debugging
import re # Import for usual python regex
# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality.
# But this a price we have to pay for not having good data).
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)
rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging
# This block is applicable only if you have headers
lsHeader = rddSplit.first() # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in
# all files are same. If not, then will have to filter by manually specifying
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader)
rddData.collect() # For debugging
# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging

Get HDFS file path in PySpark for files in sequence file format

My data on HDFS is in Sequence file format. I am using PySpark (Spark 1.6) and trying to achieve 2 things:
Data path contains a timestamp in yyyy/mm/dd/hh format that I would like to bring into the data itself. I tried SparkContext.wholeTextFiles but I think that might not support Sequence file format.
How do I deal with the point above if I want to crunch data for a day and want to bring in the date into the data? In this case I would be loading data like yyyy/mm/dd/* format.
Appreciate any pointers.
If stored types are compatible with SQL types and you use Spark 2.0 it is quite simple. Import input_file_name:
from pyspark.sql.functions import input_file_name
Read file and convert to a DataFrame:
df = sc.sequenceFile("/tmp/foo/").toDF()
Add file name:
df.withColumn("input", input_file_name())
If this solution is not applicable in your case then universal one is to list files directly (for HDFS you can use hdfs3 library):
files = ...
read one by one adding file name:
def read(f):
"""Just to avoid problems with late binding"""
return sc.sequenceFile(f).map(lambda x: (f, x))
rdds = [read(f) for f in files]
and union:
sc.union(rdds)

Existing column can't be found by DataFrame#filter in PySpark

I am using PySpark to perform SparkSQL on my Hive tables.
records = sqlContext.sql("SELECT * FROM my_table")
which retrieves the contents of the table.
When I use the filter argument as a string, it works okay:
records.filter("field_i = 3")
However, when I try to use the filter method, as documented here
records.filter(records.field_i == 3)
I am encountering this error
py4j.protocol.Py4JJavaError: An error occurred while calling o19.filter.
: org.apache.spark.sql.AnalysisException: resolved attributes field_i missing from field_1,field_2,...,field_i,...field_n
eventhough this field_i column clearly exists in the DataFrame object.
I prefer to use the second way because I need to use Python functions to perform record and field manipulations.
I am using Spark 1.3.0 in Cloudera Quickstart CDH-5.4.0 and Python 2.6.
From Spark DataFrame documentation
In Python it’s possible to access a DataFrame’s columns either by attribute (df.age) or by indexing (df['age']). While the former is convenient for interactive data exploration, users are highly encouraged to use the latter form, which is future proof and won’t break with column names that are also attributes on the DataFrame class.
It seems that the name of your field can be a reserved word, try with:
records.filter(records['field_i'] == 3)
What I did was to upgrade my Spark from 1.3.0 to 1.4.0 in Cloudera Quick Start CDH-5.4.0 and the second filtering feature works. Although I still can't explain why 1.3.0 has problems on that.

Resources