Spark : Find preceding value form the list - databricks

i'm struggling to find the preceding item from a list based on a variable.
Let say i have a list
date = ['20190501','20190502','20190503','20190507','20190508']
and i have another variable stored as:
start_date = ['20190507']
what i would like to find is the preceding value of the start_date and store it as previous_date which i'm calling further down in my code.
So in this case, the previous_date would be ['20190503'].
In another case if my start_date = ['20190503'] and the list is the same, the
previous_date would be ['20190502'].

You have a List structure so you can use its built-in methods like index. It returns a 0-based index of the first item equal to its x argument, something like:
%py
date = ['20190501','20190502','20190503','20190507','20190508']
start_date = ['20190507']
## So in this case, the previous_date would be ['20190503'].
x = ['20190501','20190502','20190503','20190507','20190508'].index(start_date[0])
date[x - 1]
My results:

Related

Pandas groupby: coordinates of current group

Suppose I have a data frame
import pandas as pd
df = pd.DataFrame({'group':['A','A','B','B','C','C'],'score':[1,2,3,4,5,6]})
At first, say, I want to compute the groups' sums of scores. I usually do
def group_func(x):
d = {}
d['sum_scores'] = x['score'].sum()
return pd.Series(d)
df.groupby('group').apply(group_func).reset_index()
Now suppose I want to modify group_func but this modification requires that I know the group identity of the current input x. I tried x['group'] and x[group].iloc[0] within the function's definition and neither worked.
Is there a way for the function group_func(x) to know the defining coordinates of the current input x?
In this toy example, say, I just want to get:
pd.DataFrame({'group':['A','B','C'],'sum_scores':[3,7,11],'name_of_group':['A','B','C']})
where obviously the last column just repeats the first one. I'd like to know how to make this last column using a function like group_func(x). Like: as group_func processes the x that corresponds to group 'A' and generates the value 3 for sum_scores, how do I extract the current identity 'A' within the local scope of group_func?
Just add .name
def group_func(x):
d = {}
d['sum_scores'] = x['score'].sum()
d['group_name'] = x.name # d['group_name'] = x['group'].iloc[0]
return pd.Series(d)
df.groupby('group').apply(group_func)
Out[63]:
sum_scores group_name
group
A 3 A
B 7 B
C 11 C
Your code fix see about marked line adding ''

Iterate and return based on priority - Python 3

I have a df with multiple rows. What I need is to check for a specific value in a column's value and return if there is a matching. I have a set of rules, which takes priority based on order.
My Sample df:
file_name fil_name
0 02qbhIPSYiHmV_sample_file-MR-job1 02qbhIPSYiHmV
1 02qbhIPSYiHmV_sample_file-MC-job2 02qbhIPSYiHmV
2 02qbhIPSYiHmV_sample_file-job3 02qbhIPSYiHmV
For me MC takes the first priority. If MC is present in file_name value, take that record. If MC is not there, then take the record that has MR in it. If no MC or MR, then just take what ever is there in my case just the third row.
I came up with a function like this,
def choose_best_record(df_t):
file_names = df_t['file_name']
for idx, fn in enumerate(file_names):
lw_fn = fn.lower()
if '-mc-' in lw_fn:
get_mc_row = df_t.iloc[idx:idx+1]
print("Returning MC row")
return get_mc_row
else:
if '-mr-' in lw_fn:
get_mr_row = df_t.iloc[idx:idx+1]
print('Returning MR row')
return get_mr_row
else:
normal_row = df_t.iloc[idx:idx+1]
print('Reutrning normal row')
return normal_row
However, this does not behave the way I want. I need MC (row index 1), instead, it returns MR row.
If I have my rows in the dataframe in order like this, ...file-MR-job1, ...file-MR-job1, ....file-MR-job1, then it works. How can I change my function to work based on how I need my out put?

How to get index based on value in pyspark list

I have a list like below
[[[Row(cola=53273831, colb=1197), Row(cola=15245438, colb=1198)], [Row(cola=53273831, colb=1198)]]]
Here I want to search for a particular element and get the index value of it. Ex:
mylist.index((([['53273831', '1198']])))
should give me index as 1. But I'm getting error
ValueError: [['53273831', '1198']] is not in list.
This the code I'm using
df2=df.groupBy("order").agg(collect_list(struct(["id","node_id"])).alias("res"))
newrdd = df2.rdd.map(lambda x : (x))
order_info = newrdd.collectAsMap()
dict_values=(list(order_info.values()))
dict_keys=(list(order_info.keys()))
a=[[53273831, 1198]]
k2= dict_keys[dict_values.index(((a)))] # This line is givin
g me the error :ValueError: [['53273831', '1198']] is not in list
order_info dict looks like this
{10160700: [Row(id=53273831, node_id=1197), Row(id=15245438, node_id=1198)], 101600201: [Row(iid=53273831, node_id=1198)]}
Can you please help me to get the index value from this struct type list?
The element is a Row object, not a list, so you need to specify the Row object. Also you should get the index from mylist[0] because mylist is a multilayer array.
from pyspark.sql import Row
mylist = [[[Row(cola=53273831, colb=1197), Row(cola=15245438, colb=1198)], [Row(cola=53273831, colb=1198)]]]
id = mylist[0].index([Row(cola=53273831, colb=1198)])
will give you an id of 1.

How to populate a dataframe column based on the value of another column

Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)

How do I build a string of variable names?

I'm trying to build a string that contains all attributes of a class-object. The object name is jsonData and it has a few attributes, some of them being
jsonData.Serial,
jsonData.InstrumentSerial,
jsonData.Country
I'd like to build a string that has those attribute names in the format of this:
'Serial InstrumentSerial Country'
End goal is to define a schema for a Spark dataframe.
I'm open to alternatives, as long as I know order of the string/object because I need to map the schema to appropriate values.
You'll have to be careful about filtering out unwanted attributes, but try this:
' '.join([x for x in dir(jsonData) if '__' not in x])
That filters out all the "magic methods" like __init__ or __new__.
To include those, do
' '.join(dir(jsonData))
These take advantage of Python's dir method, which returns a list of all attributes of an object.
I don't quite understand why you want to group the attribute names in a single string.
You could simply have a list of attribute names as the order of a python list is persist.
attribute_names = [x for x in dir(jsonData) if '__' not in x]
From there you can create your dataframe. If you don't need to specify the SparkTypes, you can just to:
df = SparkContext.createDataFrame(data, schema = attribute_names)
You could also create a StructType and specify the types in your schema.
I guess that you are going to have a list of jsonData records that you want to consider as Rows.
Let's considered it as a list of objects, but the logic would still be the same.
You can do that as followed:
my_object_list = [
jsonDataClass(Serial = 1, InstrumentSerial = 'TDD', Country = 'France'),
jsonDataClass(Serial = 2, InstrumentSerial = 'TDI', Country = 'Suisse'),
jsonDataClass(Serial = 3, InstrumentSerial = 'TDD', Country = 'Grece')]
def build_record(obj, attr_names):
from operator import attrgetter
return attrgetter(*attr_names)(obj)
So the data attribute referred previously would be constructed as:
data = [build_record(x, attribute_names) for x in my_object_list]

Resources