Generate multiple outputs from a dataframe, reading data only once - apache-spark

I have an input of multiple files and I need to apply different processing rules and output to multiple files.
How can do it so that the files are only read once, but in a way that I can apply different filtering, grouping, etc and save to different output files?
Something similar to the diagram below:
INPUT
FILES
|
|
/ \
/ \
/ \
FILTER X FILTER Y
| |
| |
GROUP BY A GROUP BY B
| |
| |
OUTPUT OUTPUT
FILE 1 FILE 2
I tried to use a code similar as below, but it seems to be reading the input files multiple times.
rd = spark.read.format('dbf').load(os.path.join(
sih_data,
'RD??{10,11,12,13,14,15,16,17,18,19,20,21}??.{DBC,dbc}'
))
out1 = rd.where(rd['UF_ZI'] == '52')
out1 = out1.groupBy('UF_ZI').count()
out1.write.format("com.databricks.spark.csv")\
.option("header", "true").save("out1")
out2 = rd.where(rd['IDENT'] == '1')
out2 = out2.groupBy('MUNIC_RES').count()
out2.write.format("com.databricks.spark.csv")\
.option("header", "true").save("out2")

To answer your question specifically, as mentioned in the comments, you can use .cache() once you have read the data and use your exact same code. cache() is just a shortcut for persist('MEMORY_AND_DISK') so your data will be stored in memory if possible, and the rest on the disk.
That being said, it is worth thinking about what you are doing before deciding anything. I don't know about the dbf format but if it allows predicate pushdown and that your filters are quite restrictive, it could be better to leave your code as it is to load less data (predicate pushdown on cached data is less effective in spark). If the source does not allow predicate pushdown or if you have many filters, other approaches could be interesting.
For instance, you could wrap all your filters into one job to 1. avoid the overhead of each job 2. read the data only once.
# you could define as many filters as a list of 3-tuples.
# the 1st element is the id of the filter, the second the filter itself
# and the third, the grouping column.
filters = [(1, rd['UF_ZI'] == '52', 'UF_ZI'), (2, rd['IDENT'] == '1', 'MUNIC_RES')]
# Then you define an array of structs. If the filter passes, the element is
# a struct with the grouping column and the filter. If not, it is null.
cols = [F.when(filter,
F.struct(F.col(group).alias("group"),
F.lit(id).alias("filter")))
for (id, filter, group) in filters]
# Finally, we explode the array of filters defined just before and filter out
# null values that correspond to non matching filters.
rd\
.withColumn("s", F.explode(F.array(*cols)))\
.where(F.col('s').isNotNull())\
.groupBy(F.col("s.group").alias("group"), F.col("s.filter").alias("filter"))\
.count()\
.write.partitionBy("filter").csv("test.csv")
# In spark>=2.4, you could avoid the filter and use array_except to remove null
# values before exploding the array. It might be more efficient.
rd\
.withColumn("s", F.explode(F.array_except(
F.array(*cols),
F.array(F.lit(None))
)))\
.groupBy(F.col("s.group").alias("group"), F.col("s.filter").alias("filter"))\
.count()\
.write.partitionBy("filter").csv("test.csv")
Having a look at test.csv, it looks like this:
> ls test.csv
'filter=1' 'filter=2'

Related

(KNN ) row compute use outer DataFrame on pyspark

question
my data structure is like this:
train_info:(over 30000 rows)
----------
odt:string (unique)
holiday_type:string
od_label:string
array:array<double> (with variable length depend on different odt and holiday_type )
useful_index:array<int> (length same as vectors)
...(other not important cols)
label_data:(over 40000 rows)
----------
holiday_type:string
od_label: string
l_origin_array:array<double> (with variable length)
...(other not important cols)
my expected result is like this(length same with train_info):
--------------
odt:string
holiday_label:string
od_label:string
prediction:int
my solution is like this:
if __name__=='__main __'
loop_item = train_info.collect()
result = knn_for_loop(spark, loop_item,train_info.schema,label_data)
----- do something -------
def knn_for_loop(spark, predict_list, schema, label_data):
result = list()
for i in predict_list:
# turn this Row col to Data Frame and join on label data
# across to this row data pick label data array data
predict_df = spark.sparkContext.parallelize([i]).toDF(schema) \
.join(label_data, on=['holiday_type', "od_label"], how='left') \
.withColumn("l_array",
UDFuncs.value_from_array_by_index(f.col('l_origin_array'), f.col("useful_index"))) \
.toPandas()
# pandas execute
train_x = predict_df.l_array.values
train_y = predict_df.label.values
test_x = predict_df.array.values[0]
test_y = KNN(train_x, train_y, test_x)
result.append((i['odt'], i['holiday_type'], i['od_label'], test_y))
return result
it's worked but is really slow, I estimate each row need 18s.
in R language I can do this easily using do function:
train_info%>%group_by(odt)%>%do(.,knn_loop,label_data)
something my tries
I tried to join them before use,and query them when I compute, but the data is too large to run (these two df have 400 million rows after join and It takes up 180 GB disk space on hive and query really slowly).
I tried to use pandas_udf, but it only allows one pd.data.frame parameter and slow).
I tried to use UDF, but UDF can't receive data frame obj.
I tried to use spark-knn package ,but I run with error,may be my offline
installation is wrong .
thanks for your help.

How to dynamically know if pySpark DF has a null/empty value for given columns?

I have to check if incoming data is having any null or "" or " " value or not. The column for which I have to check is not fixed. I am reading from a config where the column name is stored for different files with permissible null-ability.
+----------+------------------+--------------------------------------------+
| FileName | Nullable | Columns |
+----------+------------------+--------------------------------------------+
| Sales | Address2,Phone2 | OrderID,Address1,Address2,Phone1,Phone2 |
| Invoice | Bank,OfcAddress | InvoiceNo,InvoiceID,Amount,Bank,OfcAddress |
+----------+------------------+--------------------------------------------+
So for each data/file I have to see which field shouldn't contain null. On basis of that process/error out the file. Is there any pythonic way to do this?
The table structure you’re showing makes me believe you have read the file containing these job details as a Spark DataFrame. You probably shouldn’t, as it’s very likely not big data. If you have it as a Spark DataFrame, collect it to the driver, so that you can create separate Spark jobs for each file.
Then, each job is fairly straightforward: you have a certain file location from which you must read. That info is captured by the FileName, I presume. Now, I will also presume the file format for each of these files is identical. If not, you’ll have to add meta data indicating the file format. For now, I assume it’s CSV.
Next, you must determine the subset of columns that needs to be checked for the presence of nulls. That’s easy: given that you have a list of all columns in the DataFrame (which could’ve been derived from the DataFrame generated by the previous step (the loading)) and a list of all columns that can contain nulls, the list of columns that can’t contain nulls is simply the difference between these two.
Finally, you aggregate over the DataFrame the number of nulls within each of these columns. As this is a DataFrame aggregate, there’s only one row in the result set, so you can take head to bring it to the driver. Cast is to a dict for easier access to the attributes.
I’ve added a function, summarize_positive_counts, that returns the columns where there was at least one null record found, thereby invalidating the claim in the original table.
df.show(truncate=False)
# +--------+---------------+------------------------------------------+
# |FileName|Nullable |Columns |
# +--------+---------------+------------------------------------------+
# |Sales |Address2,Phone2|OrderID,Address1,Address2,Phone1,Phone2 |
# |Invoice |Bank,OfcAddress|InvoiceNo,InvoiceID,Amount,Bank,OfcAddress|
# +--------+---------------+------------------------------------------+
jobs = df.collect() # bring it to the driver, to create new Spark jobs from its
from pyspark.sql.functions import col, sum as spark_sum
def report_null_counts(frame, job):
cols_to_verify_not_null = (set(job.Columns.split(","))
.difference(job.Nullable.split(",")))
null_counts = frame.agg(*(spark_sum(col(_).isNull().cast("int")).alias(_)
for _ in cols_to_verify_not_null))
return null_counts.head().asDict()
def summarize_positive_counts(filename, null_counts):
return {filename: [colname for colname, nbr_of_nulls in null_counts.items()
if nbr_of_nulls > 0]}
for job in jobs: # embarassingly parallellizable
frame = spark.read.csv(job.FileName, header=True)
null_counts = report_null_counts(frame, job)
print(summarize_positive_counts(job.FileName, null_counts))

SPARK Combining Neighbouring Records in a text file

very new to SPARK.
I need to read a very large input dataset, but I fear the format of the input files would not be amenable to read on SPARK. Format is as follows:
RECORD,record1identifier
SUBRECORD,value1
SUBRECORD2,value2
RECORD,record2identifier
RECORD,record3identifier
SUBRECORD,value3
SUBRECORD,value4
SUBRECORD,value5
...
Ideally what I would like to do is pull the lines of the file into a SPARK RDD, and then transform it into an RDD that only has one item per record (with the subrecords becoming part of their associated record item).
So if the example above was read in, I'd want to wind up with an RDD containing 3 objects: [record1,record2,record3]. Each object would contain the data from their RECORD and any associated SUBRECORD entries.
The unfortunate bit is that the only thing in this data that links subrecords to records is their position in the file, underneath their record. That means the problem is sequentially dependent and might not lend itself to SPARK.
Is there a sensible way to do this using SPARK (and if so, what could that be, what transform could be used to collapse the subrecords into their associated record)? Or is this the sort of problem one needs to do off spark?
There is a somewhat hackish way to identify the sequence of records and sub-records. This method assumes that each new "record" is identifiable in some way.
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.expressions.Window
val df = Seq(
("RECORD","record1identifier"),
("SUBRECORD","value1"),
("SUBRECORD2","value2"),
("RECORD","record2identifier"),
("RECORD","record3identifier"),
("SUBRECORD","value3"),
("SUBRECORD","value4"),
("SUBRECORD","value5")
).toDS().rdd.zipWithIndex.map(r => (r._1._1, r._1._2, r._2)).toDF("record", "value", "id")
val win = Window.orderBy("id")
val recids = df.withColumn("newrec", ($"record" === "RECORD").cast(LongType))
.withColumn("recid", sum($"newrec").over(win))
.select($"recid", $"record", $"value")
val recs = recids.where($"record"==="RECORD").select($"recid", $"value".as("recname"))
val subrecs = recids.where($"record" =!= "RECORD").select($"recid", $"value".as("attr"))
recs.join(subrecs, Seq("recid"), "left").groupBy("recname").agg(collect_list("attr").as("attrs")).show()
This snippet will first zipWithIndex to identify each row, in order, then add a boolean column that is true every time a "record" is identified, and false otherwise. We then cast that boolean to a long, and then can do a running sum, which has the neat side-effect of essentially labeling every record and it's sub-records with a common identifier.
In this particular case, we then split to get the record identifiers, re-join only the sub-records, group by the record ids, and collect the sub-record values to a list.
The above snippet results in this:
+-----------------+--------------------+
| recname| attrs|
+-----------------+--------------------+
|record1identifier| [value1, value2]|
|record2identifier| []|
|record3identifier|[value3, value4, ...|
+-----------------+--------------------+

Spark - Performing union of Dataframes inside a for loop starting from empty DataFrame

I have a Dataframe with a column called "generationId" and other fields. Field "generationId" takes a range of integer values from 1 to N (upper bound to N is known and is small, between 10 and 15) and I want to process the DataFrame in the following way (pseudo code):
results = emptyDataFrame <=== how do I do this ?
for (i <- 0 until getN(df)) {
val input = df.filter($"generationId" === i)
results.union(getModel(i).transform(input))
}
Here getN(df) gives the N for that data frame based on some criteria. In the loop, input is filtered based on matching against "i" and then fed to some model (some internal library) which transforms the input by adding 3 more columns to it.
Ultimately I would like to get union of all those transformed data frames, so I have all columns of the original data frame plus the 3 additional columns added by the model for each row. I am not able to figure out how to initialize results and unionize the results in each iteration. I do know the exact schema of the result ahead of time. So I did
val newSchema = ...
but I am not sure how to pass that to emptyRDD function and build a empty Dataframe and use it inside the loop.
Also, if there is a much efficient way to do this inside map operation, please suggest.
you can do something like this:
(0 until getN(df))
.map(i => {
val input = df.filter($"generationId" === i)
getModel(i).transform(input)
})
.reduce(_ union _)
that way you don't need to worry about the empty df

How to loop through each row of dataFrame in pyspark

E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.

Resources