pyspark rdd taking the max frequency with the least age - apache-spark
I have an rdd like the following:
[{'age': 2.18430371791803,
'code': u'"315.320000"',
'id': u'"00008RINR"'},
{'age': 2.80033330216659,
'code': u'"315.320000"',
'id': u'"00008RINR"'},
{'age': 2.8222365762732,
'code': u'"315.320000"',
'id': u'"00008RINR"'},
{...}]
I am trying to reduce each id to just 1 record by taking the highest frequency code using code like:
rdd.map(lambda x: (x["id"], [(x["age"], x["code"])]))\
.reduceByKey(lambda x, y: x + y)\
.map(lambda x: [i[1] for i in x[1]])\
.map(lambda x: [max(zip((x.count(i) for i in set(x)), set(x)))])
There is one problem with this implementation, it doesn't consider age, so if for example one id had multiple codes with a frequency of 2, it would take the last code.
To illustrate this issue, please consider this reduced id:
(u'"000PZ7S2G"',
[(4.3218651186303, u'"388.400000"'),
(4.34924421126357, u'"388.400000"'),
(4.3218651186303, u'"389.900000"'),
(4.34924421126357, u'"389.900000"'),
(13.3667102491139, u'"794.310000"'),
(5.99897016368982, u'"995.300000"'),
(6.02634923989903, u'"995.300000"'),
(4.3218651186303, u'"V72.19"'),
(4.34924421126357, u'"V72.19"'),
(13.3639723398581, u'"V81.2"'),
(13.3667102491139, u'"V81.2"')])
my code would output:
[(2, u'"V81.2"')]
when I would like for it to output:
[(2, u'"388.400000"')]
because although the frequency is the same for both of these codes, code 388.400000 has a lesser age and appears first.
by adding this line after the .reduceByKey():
.map(lambda x: (x[0], [i for i in x[1] if i[0] == min(x[1])[0]]))
I'm able to filter out those with greater than min age, but then I'm only considering those with min age and not all codes to calculate their frequency. I can't apply the same/ similar logic after [max(zip((x.count(i) for i in set(x)), set(x)))] as the set(x) is the set of x[1], which doesn't consider the age.
I should add, I don't want to just take the first code with the highest frequency, I'd like to take the highest frequency code with the least age, or the code that appears first, if this is possible, using only rdd actions.
equivalent code in SQL of what I'm trying to get would be something like:
SELECT code, count(*) as code_frequency
FROM (SELECT id, code, age
FROM (SELECT id, code, MIN(age) AS age, COUNT(*) as cnt,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY COUNT(*) DESC, MIN(age)) as seqnum
FROM tbl
GROUP BY id, code
) t
WHERE seqnum = 1) a
GROUP BY code
ORDER by code_frequency DESC
LIMIT 5;
and as a DF (though trying to avoid this):
wc = Window().partitionBy("id", "code").orderBy("age")
wc2 = Window().partitionBy("id")
df = rdd.toDF()
df = df.withColumn("count", F.count("code").over(wc))\
.withColumn("max", F.max("count").over(wc2))\
.filter("count = max")\
.groupBy("id").agg(F.first("age").alias("age"),
F.first("code").alias("code"))\
.orderBy("id")\
.groupBy("code")\
.count()\
.orderBy("count", ascending = False)
I'd really appreciate any help with this.
Based on the SQL equivalent of your code, I converted the logic into the following rdd1 plus some post-processing (starting from the original RDD):
rdd = sc.parallelize([{'age': 4.3218651186303, 'code': '"388.400000"', 'id': '"000PZ7S2G"'},
{'age': 4.34924421126357, 'code': '"388.400000"', 'id': '"000PZ7S2G"'},
{'age': 4.3218651186303, 'code': '"389.900000"', 'id': '"000PZ7S2G"'},
{'age': 4.34924421126357, 'code': '"389.900000"', 'id': '"000PZ7S2G"'},
{'age': 13.3667102491139, 'code': '"794.310000"', 'id': '"000PZ7S2G"'},
{'age': 5.99897016368982, 'code': '"995.300000"', 'id': '"000PZ7S2G"'},
{'age': 6.02634923989903, 'code': '"995.300000"', 'id': '"000PZ7S2G"'},
{'age': 4.3218651186303, 'code': '"V72.19"', 'id': '"000PZ7S2G"'},
{'age': 4.34924421126357, 'code': '"V72.19"', 'id': '"000PZ7S2G"'},
{'age': 13.3639723398581, 'code': '"V81.2"', 'id': '"000PZ7S2G"'},
{'age': 13.3667102491139, 'code': '"V81.2"', 'id': '"000PZ7S2G"'}])
rdd1 = rdd.map(lambda x: ((x['id'], x['code']),(x['age'], 1))) \
.reduceByKey(lambda x,y: (min(x[0],y[0]), x[1]+y[1])) \
.map(lambda x: (x[0][0], (-x[1][1] ,x[1][0], x[0][1]))) \
.reduceByKey(lambda x,y: x if x < y else y)
# [('"000PZ7S2G"', (-2, 4.3218651186303, '"388.400000"'))]
Where:
use map to initialize the pair-RDD with key=(x['id'], x['code']), value=(x['age'], 1)
use reduceByKey to calculate min_age and count
use map to reset the pair-RDD with key=id and value=(-count, min_age, code)
use reduceByKey to find the min value of tuples (-count, min_age, code) for the same id
The above steps are similar to:
Step (1) + (2): groupby('id', 'code').agg(min('age'), count())
Step (3) + (4): groupby('id').agg(min(struct(negative('count'),'min_age','code')))
You can then get the derived table a in your SQL by doing rdd1.map(lambda x: (x[0], x[1][2], x[1][1])), but this step is not necessary. the code can be counted directly from the above rdd1 by another map function + countByKey() method and then sort the result:
sorted(rdd1.map(lambda x: (x[1][2],1)).countByKey().items(), key=lambda y: -y[1])
# [('"388.400000"', 1)]
However, if what you are looking for is the sum(count) across all ids, then do the following:
rdd1.map(lambda x: (x[1][2],-x[1][0])).reduceByKey(lambda x,y: x+y).collect()
# [('"388.400000"', 2)]
If converting the rdd to a dataframe is an option, I think this approach may solve your problem:
from pyspark.sql.functions import row_number, col
from pyspark.sql import Window
df = rdd.toDF()
w = Window.partitionBy('id').orderBy('age')
df = df.withColumn('row_number', row_number.over(w)).where(col('row_number') == 1).drop('row_number')
Related
Pyspark Dataframe Lambda Map Function of SQL Query
Suppose we have a pyspark.sql.dataframe.DataFrame object: df = sc.parallelize([['John', 'male', 26], ['Teresa', 'female', 25], ['Jacob', 'male', 6]]).toDF(['name', 'gender', 'age']) I have a function that runs sql query for each rows of the DataFrame: def getInfo(data): param_name = data['name'] param_gender = data['gender'] param_age = data['age'] sql_query = "SELECT * FROM people_info WHERE name = '{0}' AND gender = '{1}' AND age = {2}".format(param_name, param_gender, param_age) info = info.append(spark.sql(sql_query)) return info I am trying to run function each rows by map: df_info = df.rdd.map(lambda x: getInfo(x)) I got errors PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
The error message is actually telling you what exactly what is wrong. Your function is trying to access SparkContext(sparck.sql(sql_query)) from inside a transformation( df.rdd.map(lambda x: getInfo(x))). Here's what I think you are trying to do: df = sc.parallelize([['John', 'male', 26], ['Teresa', 'female', 25], ['Jacob', 'male', 6]]).toDF(['name', 'gender', 'age']) people = spark.table("people_info") people.join(df, on=[people.name == df.name, people.gender == df.gender, people.age == df.age], how="inner") Here's a couple other ways to do a join.
error with adding new list typed column using pyspark.sql.functions.arry
i get error when i tried to add new array typed column using the pyspark.sql.functions.arry column = ['id', 'fname', 'age', 'avg_wh'] data= [('1', 'user_1', '40', 8.5), ('2', 'user_2', '6', 1.5), ('3', 'user_3', '4', 5.5), ('10', 'user_10', '4', 2.5)] from pyspark.sql import functions as F df = spark.createDataFrame(data,column) df.withColumn("lsitColumn" ,F.array(["1","2","3"])) df.show() the Error raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.AnalysisException: cannot resolve '1' given input columns: [age, avg_wh, fname, id];; 'Project [id#0, fname#1, age#2, avg_wh#3, array('1, '2, '3) AS lsitColumn#8] +- LogicalRDD [id#0, fname#1, age#2, avg_wh#3], false could you please assist what is the roote cause for this error , i managed to create the column by using UDF but i don't understand why this basic failed the UDF extract = f.udf(lambda x: list(["1","2","3"]), ArrayType(StringType())) percentielDF = df.withColumn("lsitColumn", extract("id")) i expected to get new DF with list typed column and i get error
How to Perform a Complicated Join in Pandas with Interaction Terms from Statsmodel output
This is an extension of this quetion: To join complicated pandas tables I have three different interactions in a statsmodels GLM. I need a final table that pairs coeficients with other univariate analysis results. Below is an example of what the tables look like with a marital status and age interaction in the model. The final_table is the table that has the univariate results in. I want to join coefficient values (among other statistics, p_values, standard_error etc) from the model results to that final table (this is model_results in the code below). df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model' ,'married_age','married_age','married_age', 'class_cc', 'class_cc', 'class_cc', 'class_cc', 'class_v_age' ,'class_v_age','class_v_age', 'class_v_age'], 'level': [0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62', 'Harley_100', 'Harley_1200', 'Sport_1500', 'other_100' ,'Street_10', 'other_20', 'Harley_15', 'Sport_10'], 'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36,36388470.44 ,31805316.37, 123.4, 4546.50, 439854.23, 2134.4, 2304.5, 2032.30, 159.80, 22]} final_table1 = pd.DataFrame(df) final_table1 Join the above with it's different ways statsmodels communicates the results to: df2 = {'variable': ['intercept','driver_age_model:C(marital_status_model)[M]', 'driver_age_model:C(marital_status_model)[S]' , 'CLded_model','C(class_model)[Harley]:v_age_model', 'C(class_model)[Sport]:v_age_model' ,'C(class_model)[Street]:v_age_model', 'C(class_model)[other]:v_age_model' , 'C(class_model)[Harley]:cc_model', 'C(class_model)[Sport]:cc_model' , 'C(class_model)[Street]:cc_model' , 'C(class_model)[other]:cc_model'] ,'coefficient': [-2.36E-14,-1.004648e-02,-1.071730e-02, 0.00174356,-0.07222433,-0.146594998,-0.168168491,-0.084420399 ,-0.000181233,0.000872798,0.001229771,0.001402564]} model_results = pd.DataFrame(df2) model_results With the desired final result: df3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model' ,'married_age','married_age','married_age', 'class_cc', 'class_cc', 'class_cc', 'class_cc', 'class_v_age' ,'class_v_age','class_v_age', 'class_v_age'], 'level': [None,0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62', 'Harley_100', 'Harley_1200', 'Sport_1500', 'other_100' ,'Street_10', 'other_20', 'Harley_15', 'Sport_10'], 'value': [None, 460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36,36388470.44 ,31805316.37, 123.4, 4546.50, 439854.23, 2134.4, 2304.5, 2032.30, 159.80, 22], 'coefficient': [-2.36E-14, 0.00174356, 0.00174356, 0.00174356, 0.00174356, 0.00174356 ,0.00174356 , 0.00174356 ,-1.004648e-02, -1.004648e-02,-1.071730e-02,-1.812330e-04,-1.812330e-04,8.727980e-04,1.402564e-03 ,-1.681685e-01, -8.442040e-02, -1.812330e-04, -1.465950e-01]} results = pd.DataFrame(df3) results When I implement the first answer, it effected this answer.
df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','married_age','married_age','married_age'], 'level': [0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62'], 'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36,36388470.44,31805316.37]} df2 = {'variable': ['intercept','driver_age_model:C(marital_status_model)[M]', 'driver_age_model:C(marital_status_model)[S]', 'CLded_model'], 'coefficient': [-2.36E-14,-1.004648e-02,-1.071730e-02, 0.00174356]} df3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','married_age','married_age','married_age'], 'level': [None, 0,100,200,250,500,750,1000, 'M_60', 'M_61', 'S_62'], 'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36, 36388470.44,31805316.37], 'coefficient': [-2.36E-14, 0.00174356, 0.00174356, 0.00174356, 0.00174356, 0.00174356 ,0.00174356 , 0.00174356,-1.004648e-02, -1.004648e-02,-1.071730e-02]} final_table = pd.DataFrame(df) model_results = pd.DataFrame(df2) results = pd.DataFrame(df3) # Change slightly df to match what we're going to merge final_table.loc[final_table['variable'] == 'married_age', 'variable'] = 'married_age-'+final_table.loc[final_table['variable'] == 'married_age', 'level'].str[0] # Clean df2 and get it ready for merge model_results['variable'] = model_results['variable'].str.replace('driver_age_model:C\(marital_status_model\)\[', 'married_age-')\ .str.strip('\]') # Merge df4 = final_table.merge(model_results, how = 'outer', left_on = 'variable', right_on = 'variable') #Clean df4['variable'] = df4['variable'].str.replace('-.*', '', regex = True) Pretty much the same thing as last time, the only difference was how you clean df2.
Python dictionary to sqlite database
I'm trying to write a dictionary into an existing sql database, but without success giving me: sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type. Based on my minimal example, has anzbody some useful hints? (python3) Command to create the empty db3 anywhere on your machine: CREATE TABLE "testTable" ( sID INTEGER NOT NULL UNIQUE PRIMARY KEY, colA REAL, colB TEXT, colC INTEGER); And the code for putting my dictionary into the database looks like: import sqlite3 def main(): path = '***anywhere***/test.db3' data = {'sID': [1, 2, 3], 'colA': [0.3, 0.4, 0.5], 'colB': ['A', 'B', 'C'], 'colC': [4, 5, 6]} db = sqlite3.connect(path) c = db.cursor() writeDict2Table(c, 'testTable', data) db.commit() db.close() return def writeDict2Table(cursor, tablename, dictionary): qmarks = ', '.join('?' * len(dictionary)) cols = ', '.join(dictionary.keys()) values = tuple(dictionary.values()) query = "INSERT INTO %s (%s) VALUES (%s)" % (tablename, cols, qmarks) cursor.execute(query, values) return if __name__ == "__main__": main() I had already a look at Python : How to insert a dictionary to a sqlite database? but unfortunately I did not succeed.
You must not use a dictionary with question marks as parameter markers, because there is no guarantee about the order of the values. To handle multiple rows, you must use executemany(). And executemany() expects each item to contain the values for one row, so you have to rearrange the data: >>> print(*zip(data['sID'], data['colA'], data['colB'], data['colC']), sep='\n') (1, 0.3, 'A', 4) (2, 0.4, 'B', 5) (3, 0.5, 'C', 6) cursor.executemany(query, zip(data['sID'], data['colA'], data['colB'], data['colC']))
How to substring the column name in python
I have a column named 'comment1abc' I am writing a piece of code where I want to see that if a column contains certain string 'abc' df['col1'].str.contains('abc') == True Now, instead of hard coding 'abc', I want to use a substring like operation on column 'comment1abc' (to be precise, column name, not the column values)so that I can get the 'abc' part out of it. For example below code does a similar job x = 'comment1abc' x[8:11] But how do I implement that for a column name ? I tried below code but its not working. for col in ['comment1abc']: df['col123'].str.contains('col.names[8:11]') Any suggestion will be helpful. Sample dataframe: f = {'name': ['john', 'tom', None, 'rock', 'dick'], 'DoB': [None, '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'location': ['NY', 'NJ', 'PA', 'NY', None], 'code': ['abc1xtr', '778abc4', 'a2bcx98', None, 'ab786c3'], 'comment1abc': ['99', '99', '99', '99', '99'], 'comment2abc': ['99', '99', '99', '99', '99']} df1 = pd.DataFrame(data = f) and sample code: for col in ['comment1abc', 'comment2abc']: df1[col][df1['code'].str.contains('col.names[8:11]') == True] = '1'
I think the answer would be simple like this: for col in ['comment1abc', 'comment2abc']: x = col[8:11] df1[col][df1['code'].str.contains('x') == True] = '1' Trying to use a column name within .str.contains() wasn't a good idea. Better use a string.