My data is as shown below
Store ID Amount,...
1 1 10
1 2 20
2 1 10
3 4 50
I have to create separate directory for each store
Store 1/accounts
ID Amount
1 10
2 20
store 2/accounts directory:
ID Amount
1 10
For this purpose Can I use loops in Spark dataframe. It is working in local machine. Will it be a problem in cluster
while storecount<=50:
query ="SELECT * FROM Sales where Store={}".format(storecount)
DF =spark.sql(query)
DF.write.format("csv").save(path)
count = count +1
If I correctly understood the problem , you really want to do is partitioning in the dataframe.
I would suggest to do this
df.write.partitionBy("Store").mode(SaveMode.Append).csv("..")
This will write the dataframe into several partitions like
store=2/
store=1/
....
Yes you can run a loop here as it is not an Nested operation on data frame.
Nested operation on RDD or Data frame is not allowed as Spark Context is not Serializable.
Related
I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.
You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union
I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.
I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions
I have 2 spark dataframes that I read from hive using the sqlContext. Lets call these dataframes as df1 and df2. The data in both the dataframes is sorted on a Column called PolicyNumber at hive level. PolicyNumber also happens to be the primary key for both the dataframes. Below are the sample values for both the dataframes although in reality, both my dataframes are huge and spread across 5 executors as 5 partitions. For simplity sake, I will assume that each partition will have one record.
Sample df1
PolicyNumber FirstName
1 A
2 B
3 C
4 D
5 E
Sample df2
PolicyNumber PremiumAmount
1 450
2 890
3 345
4 563
5 2341
Now, I want to join df1 and df2 on PolicyNumber column. I can run the below piece of code and get my required output.
df1.join(df2,df1.PolicyNumber=df2.PolicyNumber)
Now, I want to avoid as much shuffle as possible to make this join efficient. So to avoid shuffle, while reading from hive, I want to partition df1 based on values of PolicyNumber Column in such a way that the row with PolicyNumber 1 will go to Executor 1, row with PolicyNumber 2 will go to Executor 2, row with PolicyNumber 3 will go to Executor 3 and so on. And I want to partition df2 in the exact same way I did for df1 as well.
This way, Executor 1 will now have the row from df1 with PolicyNumber=1 and also the row from df2 with PolicyNumber=1 as well.
Similarly, Executor 2 will have the row from df1 with PolicyNumber=2 and also the row from df2 with PolicyNumber=2 ans so on.
This way, there will not be any shuffle required as now, the data is local to that executor.
My question is, is there a way to control the partitions in this granularity? And if yes, how do I do it.
Unfortunately there is no direct control on the data which is floating into each executor, however, while you reading data into each dataframe, use the CLUSTER BY on join column which helps sorted data distributing into right executors.
ex:
df1 = sqlContext.sql("select * from CLSUTER BY JOIN_COLUMN")
df2 = sqlContext.sql("SELECT * FROM TABLE2 CLSUTER BY JOIN_COLUMN")
hope it helps.
I'm having a bit of difficulty reconciling the difference (if one exists) between sqlContext.sql("set spark.sql.shuffle.partitions=n") and re-partitioning a Spark DataFrame utilizing df.repartition(n).
The Spark documentation indicates that set spark.sql.shuffle.partitions=n configures the number of partitions that are used when shuffling data, while df.repartition seems to return a new DataFrame partitioned by the number of key specified.
To make this question clearer, here is a toy example of how I believe df.reparition and spark.sql.shuffle.partitions to work:
Let's say we have a DataFrame, like so:
ID | Val
--------
A | 1
A | 2
A | 5
A | 7
B | 9
B | 3
C | 2
Scenario 1: 3 Shuffle Partitions, Reparition DF by ID:
If I were to set sqlContext.sql("set spark.sql.shuffle.partitions=3") and then did df.repartition($"ID"), I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
Scenario 2: 5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
Is my understanding off base here? In general, my questions are:
I am trying to optimize my partitioning of a dataframe as to avoid
skew, but to have each partition hold as much of the same key
information as possible. How do I achieve that with set
spark.sql.shuffle.partitions and df.repartiton?
Is there a link
between set spark.sql.shuffle.partitions and df.repartition? If
so, what is that link?
Thanks!
I would expect my data to be repartitioned into 3 partitions, with one partition holding 3 vals of all the rows with ID "A", another holding 2 vals of all the rows with ID "B", and the final partition holding 1 val of all the rows with ID "C".
No
5 shuffle partitions, Reparititon DF by ID: In this scenario, I would still expect each partition to ONLY hold data tagged with the same ID. That is to say, there would be NO mixing of rows with different IDs within the same partition.
and no.
This is not how partitioning works. Partitioners map values to partitions, but mapping in general case is not unique (you can check How does HashPartitioner work? for a detailed explanation).
Is there a link between set spark.sql.shuffle.partitions and df.repartition? If so, what is that link?
Indeed there is. If you df.repartition, but don't provide number of partitions then spark.sql.shuffle.partitions is used.