Suppose we have a dataset with the following structure:
df = sc.parallelize([['a','2015-11-27', 1], ['a','2015-12-27',0], ['a','2016-01-29',0], ['b','2014-09-01', 1], ['b','2015-05-01', 1] ]).toDF(("user", "date", "category"))
What I want to analyze is the users' attributes with regard to their lifetime in months. For example, I want to sum up the column "category" for each month of a user's lifetime. For user 'a', this would look like:
output = sc.parallelize([['a',0, 1], ['a',1,0], ['a',2,0]]).toDF(("user", "user_lifetime_in_months", "sum(category)"))
What is the most efficient way in Spark to do that? E.g., window functions?
Related
I am learning python and trying to understand the best practices of data queries.
Here is some dummy data (customer sales) to test
import pandas as pd
df = pd.DataFrame({'Name':['tom', 'bob', 'bob', 'jack', 'jack', 'jack'],'Amount':[3, 2, 5, 1, 10, 100], 'Date':["01.02.2022", "02.02.2022", "03.02.2022", "01.02.2022", "03.02.2022", "05.02.2022"]})
df.Date = pd.to_datetime(df.Date, format='%d.%m.%Y')
I want to investigate 2 kinds of queries:
How long is a person our customer?
What is the period between first
and last purchase.
How can I run the first query without writing loops manually?
What I have done so far for the second part is this
result = df.groupby("Name").max() - df.groupby("Name").min()
Is it possible to combine these two groupby queries into one to improve the performance?
P.S. I am trying to understand pandas and key concepts how to optimize queries. Different approaches and explanations are highly appreciated.
You can use GroupBy.agg with a custom function to get the difference between the max and min date.
df.groupby('Name')['Date'].agg(lambda x: x.max()-x.min())
As you already have datetime type, this will nicely yield a Timedelta object, which by default is shown as a string in the form 'x days'.
You can also save the GroupBy object in a variable and reuse it. This way, computation of the groups occurs only once:
g = df.groupby("Name")['Date']
g.max() - g.min()
output:
Name
bob 1 days
jack 4 days
tom 0 days
Name: Date, dtype: timedelta64[ns]
I am trying to read parquet files using spark,
if I want to read the data for June, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the data for all the months, I'll do the following:
"gs://bucket/Data/year=2021/month=6/file.parquet"
if I want to read the first two days of May:
"gs://bucket/Data/year=2021/month=5/day={1,2}file.parquet"
if I want to read November and December:
"gs://bucket/Data/year=2021/month={11,12}/file.parquet"
you get the idea... but what if I have a dictionary of month, days key, value pairs..
for example {1: [1,2,3], 4: [10,11,12,13]} --> which means that I need to read the days [1,2,3] from January, and the days [10,11,12,13] from April. how would I reflect that as a wildcard to the path.
Thank you
You can pass a list of paths to DataFrameReader:
months_dict = {1: [1, 2, 3], 4: [10, 11, 12, 13]}
paths = [
f"gs://bucket/Data/year=2021/month={k}/day={{{','.join([str(d) for d in v])}}}/*.parquet"
for k, v in months_dict.items()
]
print(paths)
# ['gs://bucket/Data/year=2021/month=1/day={1,2,3}/*.parquet', 'gs://bucket/Data/year=2021/month=4/day={10,11,12,13}/*.parquet']
df = spark.read.parquet(*paths)
I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using
from pyspark.sql.functions import rand
new_df = my_df.withColumn("rand_index", rand(seed = 7))
but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.
I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set
import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())
but to no avail.
Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :)
Thanks for the help!
I suspect that you are getting the same common values for the seed, but in different order based on your partitioning which is influenced by the data distribution when reading from disk and there could be more or less data per time. But I am not privy to your code in reality.
The rand function generates the same random data (what is the point of the seed otherwise) and somehow the partitions get a slice of it. If you look you should guess the pattern!
Here is an an example of 2 different cardinality dataframes. You can see that the seed gives the same or a superset of results. So, ordering and partitioning play a role imo.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import col
df1 = spark.range(1, 5).select(col("id").cast("double"))
df1 = df1.withColumn("rand_index", rand(seed = 7))
df1.show()
df1.rdd.getNumPartitions()
print('Partitioning distribution: '+ str(df1.rdd.glom().map(len).collect()))
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0|0.06498948189958098|
|2.0|0.41371264720975787|
|3.0|0.12030715258495939|
|4.0| 0.2731073068483362|
+---+-------------------+
8 partitions & Partitioning distribution: [0, 1, 0, 1, 0, 1, 0, 1]
The same again with more data:
...
df1 = spark.range(1, 10).select(col("id").cast("double"))
...
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0| 0.9147159860432812|
|2.0|0.06498948189958098|
|3.0| 0.7069655052310547|
|4.0|0.41371264720975787|
|5.0| 0.1982919638208397|
|6.0|0.12030715258495939|
|7.0|0.44292918521277047|
|8.0| 0.2731073068483362|
|9.0| 0.7784518091224375|
+---+-------------------+
8 partitions & Partitioning distribution: [1, 1, 1, 1, 1, 1, 1, 2]
You can see 4 common random values - within a Spark session or out of session.
I know it's a bit late, but have you considered using hashing of IDs, dates etc. that are deterministic, instead of using built-in random functions? I'm encountering similar issue but I believe my problem can be solved using for example xxhash64, which is a PySpark built-in hash function. You can then use the last few digits, or normalize if you know the total range of the hash values, which I couldn't find in its documentations.
I have a for loop generating three values, age(data type int64),dayofMonth(data type numpy.ndarray) and Gender(data type str). I would like to store these three values of each iteration in a pandas data frame with columns as age,Day & Gender.Can you suggest how to do that?I'm using python 3.X
I tried this code inside for loop
df = pd.DataFrame(columns=["Age","Day", "Gender"])
for i in range(100):
df.loc[i]=[age,day,gender]
I can't able to share sample data but can give one example,
age=38,day=array([[1],
[3],
[5],
...,
[25],
[26],
[30]], dtype=int64) and Gender='M'
But I'm getting error message as ValueError: setting an array element with a sequence
I intend to implement the Apriori algorithm according to YAFIM article with pySpark. It contains with two phases in processing workflow:
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support.
[1]: http://bayanbox.ir/download/2410687383784490939/phase1.png phase-1
Phase 2: in this phase, we iteratively using k-frequent itemsets to generate (k+1)-frequent itemsets.
[2]: http://bayanbox.ir/view/6380775039559120652/phase2.png phase-2
To implement the first phase I have written the following code:
from operator import add
transactions = sc.textFile("/FileStore/tables/wo7gkiza1500361138466/T10I4D100K.dat").cache()
minSupport = 0.05 * transactions.count()
items = transactions.flatMap(lambda line:line.split(" "))
itemCount = items.map(lambda item:(item, 1)).reduceByKey(add)
l1 = itemCount.filter(lambda (i,c): c > minSupport)
l1.take(5)
output: [(u'', 100000), (u'494', 5102), (u'829', 6810), (u'368', 7828), (u'766', 6265)]
My problem is that I have no idea for implementing the second phase, and especially getting the candidate set item.
For example, suppose we have the following RDD (frequent 3-itemsets):
([1, 4, 5], 7), ([1, 4, 6], 6), ...
We want to find the candidate 4-itemsets so that if in the 3-itemsets, the two first items is the same, you will have four items available as follows:
[1, 4, 5, 6], ...