Would a forced Spark DataFrame materialization work as a checkpoint?

Would a forced Spark DataFrame materialization work as a checkpoint? - apache-spark

I have a large and complex DataFrame with nested structures in Spark 2.1.0 (pySpark) and I want to add an ID column to it. The way I did it was to add a column like this:
df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID')
So it goes e.g. from this:
File A B
a.txt valA1 [valB11,valB12]
a.txt valA2 [valB21,valB22]
to this:
File A B ID
a.txt valA1 [valB11,valB12] 1
a.txt valA2 [valB21,valB22] 2
After I add this column, I don't immediately trigger a materialization in Spark, but I first branch the DataFrame to a new variable:
dfOutput = df.select('A','ID')
with only columns A and ID and I write dfOutput to Hive, so I get e.g. Table1:
A ID
valA1 1
valA2 2
So far so good. Then I continue using df for further transformations, namely I explode some of the nested arrays in the columns and drop the original, like this:
df = df.withColumn('Bexpl',explode('B')).drop('B')
and I get this:
File A Bexpl ID
a.txt valA1 valB11 1
a.txt valA1 valB12 1
a.txt valA2 valB21 2
a.txt valA2 valB22 2
and output other tables from it, sometimes after creating a second ID column since there are more rows from the exploded arrays. E.g. I create Table2:
df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID2')
to get:
File A Bexpl ID ID2
a.txt valA1 valB11 1 1
a.txt valA1 valB12 1 2
a.txt valA2 valB21 2 3
a.txt valA2 valB22 2 4
and output as earlier:
dfOutput2 = df.select('Bexpl','ID','ID2')
to get:
Bexpl ID ID2
valB11 1 1
valB12 1 2
valB21 2 3
valB22 2 4
I would expect that the values of the first ID column remain the same and match the data for each row from the point that this column was created. This would allow me to keep a relation between Table1 created from dfOutput and subsequent tables from df, like dfOutput2 and the resulting Table2.
The problem is that ID and ID2 are not as they should be in the example above, but mixed up, and I'm trying to find out why. My guess is that the values of the first ID column are not deterministic because df is not materialized before branching to dfOutput. So when the data is actually materialized when saving the table from dfOutput, the rows are shuffled and IDs are different from the data that is materialized on a later point from df, as in dfOutput2. I am however not sure, so my questions are:
Is my assumption correct, that IDs are generated differently for the different branches although I add the ID column before branching?
Would materializing the DataFrame before branching to dfOutput, e.g. through df.cache().count(), ensure a fixed ID column which I can later branch however I want from df, so that I can use this as a checkpoint?
If not, how can I solve this?
I would appreciate any help or at least quick confirmation because I can't test it properly. Spark would shuffle the data only if it doesn't have enough memory, and reaching that point would mean loading a lot of data and in turn need a long time, and may still provide coincidentally good results (already tried with smaller datasets).

I don't full understand the problem but the address some points:
With window definition as:
(PARTITION BY ... ORDER BY NULL)
order of values in the window is effectively random. There is no reason for it to stable.
Spark would shuffle the data only if it doesn't have enough memory,
Spark will shuffle every time it is required by the execution plan (grouping, window functions). You confused this with disk spills, which don't induce randomness.
Would materializing the DataFrame before branching to dfOutput, e.g. through df.cache().count()
No. You cannot depend on caching for correctness.

Related

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.

It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

How can I groupby rows by the columns in which they actually posses a data point?

I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.

This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.

Splitting dataframe based on multiple column values

I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!

You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.

Difference (if there is one) between spark.sql.shuffle.partitions and df.repartition

I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).
The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid
skew, but to have each partition hold as much of the same key
information as possible. How do I achieve that with set
spark.sql.shuffle.partitions and df.repartiton?
Is there a link
between set spark.sql.shuffle.partitions and df.repartition? If
so, what is that link?
Thanks!

I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
No
5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
and no.
This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).
Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?
Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.

How to filter out duplicate rows based on some columns in spark dataframe?

Suppose, I have a Dataframe like below:
Here, you can see that transaction number 1,2 and 3 have same value for columns A,B,C but different value for column D and E. Column E has date entries.
For same A,B and C combination (A=1,B=1,C=1), we have 3 rows. I want to take only one row based on the recent transaction date of column E means the rows which have the most recent date. But for the most recent date, there are 2 transactions. But i want to take just one of them if two or more rows found for the same combination of A,B,C and most recent date in column E.
So my expected output for this combination will be row number 3 or 4(any one will do).
For same A,B and C combination (A=2,B=2,C=2), we have 2 rows. But based on column E, the most recent date is the date of row number 5. So we will just take this row for this combination of A,B and C.
So my expected output for this combination will be row number 5
So the final output will be (3 and 5) or (4 and 5).
Now how should i approach:
I read this:
Both reduceByKey and groupByKey can be used for the same purposes but
reduceByKey works much better on a large dataset. That’s because Spark
knows it can combine output with a common key on each partition before
shuffling the data.
I tried with groupBy on Column A,B,C and max on column E. But it can't give me the head of the rows if multiple rows present for the same date.
What is the most optimized approach to solve this? Thanks in advance.
EDIT: I need get back my filtered transactions. How to do it also?

I have used spark window functions to get my solution:
val window = Window
.partitionBy(dataframe("A"), dataframe("B"),dataframe("C"))
.orderBy(dataframe("E") desc)
val dfWithRowNumber = dataframe.withColumn("row_number", row_number() over window)
val filteredDf = dfWithRowNumber.filter(dfWithRowNumber("row_number") === 1)

Link possible by several steps. Agregated Dataframe:
val agregatedDF=initialDF.select("A","B","C","E").groupBy("A","B","C").agg(max("E").as("E_max"))
Link intial-agregated:
initialDF.join(agregatedDF, List("A","B","C"))
If initial DataFrame comes from Hive, all can be simplified.

val initialDF = Seq((1,1,1,1,"2/28/2017 0:00"),(1,1,1,2,"3/1/2017 0:00"),
(1,1,1,3,"3/1/2017 0:00"),(2,2,2,1,"2/28/2017 0:00"),(2,2,2,2,"2/25/20170:00"))
This will miss out on corresponding col(D)
initialDF
.toDS.groupBy("_1","_2","_3")
.agg(max(col("_5"))).show
In case you want the corresponding colD for the max col:
initialDF.toDS.map(x=>x._1,x._2,x._3,x._5,x._4))).groupBy("_1","_2","_3")
.agg(max(col("_4")).as("_4")).select(col("_1"),col("_2"),col("_3"),col("_4._2"),col("_4._1")).show
For ReduceByKey you can convert the dataset to pairRDD and then work off it.Should be faster in case the Catalyst is not able to optimize the groupByKey in the first one. Refer Rolling your own reduceByKey in Spark Dataset

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string