How to get index based on value in pyspark list - apache-spark

I have a list like below
[[[Row(cola=53273831, colb=1197), Row(cola=15245438, colb=1198)], [Row(cola=53273831, colb=1198)]]]
Here I want to search for a particular element and get the index value of it. Ex:
mylist.index((([['53273831', '1198']])))
should give me index as 1. But I'm getting error
ValueError: [['53273831', '1198']] is not in list.
This the code I'm using
df2=df.groupBy("order").agg(collect_list(struct(["id","node_id"])).alias("res"))
newrdd = df2.rdd.map(lambda x : (x))
order_info = newrdd.collectAsMap()
dict_values=(list(order_info.values()))
dict_keys=(list(order_info.keys()))
a=[[53273831, 1198]]
k2= dict_keys[dict_values.index(((a)))] # This line is givin
g me the error :ValueError: [['53273831', '1198']] is not in list
order_info dict looks like this
{10160700: [Row(id=53273831, node_id=1197), Row(id=15245438, node_id=1198)], 101600201: [Row(iid=53273831, node_id=1198)]}
Can you please help me to get the index value from this struct type list?

The element is a Row object, not a list, so you need to specify the Row object. Also you should get the index from mylist[0] because mylist is a multilayer array.
from pyspark.sql import Row
mylist = [[[Row(cola=53273831, colb=1197), Row(cola=15245438, colb=1198)], [Row(cola=53273831, colb=1198)]]]
id = mylist[0].index([Row(cola=53273831, colb=1198)])
will give you an id of 1.

Related

Replace items like A2 as AA in the dataframe

I have a list of items, like "A2BCO6" and "ABC2O6". I want to replace them as A2BCO6--> AABCO6 and ABC2O6 --> ABCCO6. The number of items are much more than presented here.
My dataframe is like:
listAB:
Finctional_Group
0 Ba2NbFeO6
1 Ba2ScIrO6
3 MnPb2WO6
I create a duplicate array and tried to replace with following way:
B = ["Ba2", "Pb2"]
C = ["BaBa", "PbPb"]
for i,j in range(len(B)), range(len(C)):
listAB["Finctional_Group"]= listAB["Finctional_Group"].str.strip().str.replace(B[i], C[j])
But it does not produce correct output. The output is like:
listAB:
Finctional_Group
0 PbPbNbFeO6
1 PbPbScIrO6
3 MnPb2WO6
Please suggest the necessary correction in the code.
Many thanks in advance.
I used for simplicity purpose chemparse package that seems to suite your needs.
As always we import the required packages, in this case chemparse and pandas.
import chemparse
import pandas as pd
then we create a pandas.DataFrame object like in your example with your example data.
df = pd.DataFrame(
columns=["Finctional_Group"], data=["Ba2NbFeO6", "Ba2ScIrO6", "MnPb2WO6"]
)
Our parser function will use chemparse.parse_formula which returns a dict of element and their frequency in a molecular formula.
def parse_molecule(molecule: str) -> dict:
# initializing empty string
molecule_in_string = ""
# iterating over all key & values in dict
for key, value in chemparse.parse_formula(molecule).items():
# appending number of elements to string
molecule_in_string += key * int(value)
return molecule_in_string
molecule_in_string contains the molecule formula without numbers now. We just need to map this function to all elements in our dataframe column. For that we can do
df = df.applymap(parse_molecule)
print(df)
which returns:
0 BaBaNbFeOOOOOO
1 BaBaScIrOOOOOO
2 MnPbPbWOOOOOO
dtype: object
Source code for chemparse: https://gitlab.com/gmboyer/chemparse

Get dict values based on a filtered list with a varying number of returned elements

I am filtering elements from a list that include '"' by the following code:
def sizes():
new_list = [x for x in mid_item_size_one if '"' in x]
return new_list
This will return any element with '"' as desired. Example strings below.
Random Text 0.5" Random Text
0.25" Random Text
1.5" x 0.5" Random Text
I .split then apply above function to return:
['0.5"']
['0.25"']
['1.5"', '0.5"']
I now need to lookup each of the elements in a dictionary and return the value from the key:value pair as new individual variables so I will be able to add them to a new string that contains a number of other variables. See example of desired Result below:
val_1 = '0.5"'
val_1 = '0.5"'
val_1 = '1.5"' and val_2 = '0.5"'
Random Text val_1 Random Text
Random Text val_1 Random Text
Random Text val_1 Random Text val_2
I already have my function to lookup/retrieve the value from dictionary however since I started retrieving the values via filter, I haven't been able to figure out how to retrieve the dict value.
def item_size_one_final(size_dict):
for x in sizes():
for key in size_dict:
if key in sizes():
return size_dict[key]
return "Hmmmm"
return"Not Working"
The above for loops result in ['Hmmmm'] on all of it. Does anyone have any suggestions for how to do this?
#Riedler - Sure, hopefully this example helps.
size_dict = {
'1/4"' : '8mm - 1/4"',
'1.5"' : '40mm - 11/2"',
'0.5"' : '15mm - 1/2"',}
Raw Data Input:
0.5" Pipe
1/4" Flange
1.5" x 0.5" Reducer
My company uses SAP with set item codes and item description formats so I am taking those three descriptions and putting them in our format. This:
def sizes():
new_list = [x for x in mid_item_size_one if '"' in x]
return new_list
add a step in between to split into a list then this is returned:
['0.5"']
['0.25"']
['1.5"', '0.5"']
From this point, I need to run these elements through my dictionary and get the value (key:value pair). My current for loop doesn't work and I'm not sure why or what I can alter to correct it.
The final result should be:
var_1 = 15mm - 1/2"
var_1 = 8mm - 1/4"
var_1 = 40mm - 11/2", var_2 = 15mm - 1/2"
If I understand correctly, your mistake is that you are cycling over the dictionary for every value in the list of sizes, why?
The idea of a dictionary is that when you have a value that matches a key, you can get the matching value from the dictionary.
Also, there is no need to call sizes twice to receive the same values, save the result before and then use that.
def item_size_one_final(size_dict):
sizes_lst = sizes()
res = []
for x in sizes_lst:
if x in size_dict:
res.append(size_dict[x])
return res

How can I make my dictionary be able to be indexed by a function in python 3.x

I am trying to make a program that finds out how many integers in a list are not the integer that is represented the most in that list. To do that I have a command which creates a dictionary with every value in the list and the number of times it is represented in it. Next I try to create a new list with all items from the older list except the most represented value so I can count the length of the list. The problem is that I cannot access the most represented value in the dictionary as I get an error code.
import operator
import collections
a = [7, 155, 12, 155]
dictionary = collections.Counter(a).items()
b = []
for i in a:
if a != dictionary[max(iter(dictionary), key=operator.itemgetter(1))[0]]:
b.append(a)
I get this error code: TypeError: 'dict_items' object does not support indexing
The variable you called dictionary is not a dict but a dict_items.
>>> type(dictionary)
<class 'dict_items'>
>>> help(dict.items)
items(...)
D.items() -> a set-like object providing a view on D's items
and sets are iterable, not indexable:
for di in dictionary: print(di) # is ok
dictionary[0] # triggers the error you saw
Note that Counter is very rich, maybe using Counter.most_common would do the trick.

Remove duplicates form a list in pandas

I have a list like this :
['35UP\nPLx', '35UP']
I need a list of unique elements:
['PLx', '35UP']
i have tried this :
veh_line = list(dict.fromkeys(filter['p_Mounting_Location'].replace('\n',',', regex=True).tolist()))
This is one approach using str.splitlines with set.
Ex:
data = ['35UP\nPLx', '35UP']
result = list(set(j for i in data for j in i.splitlines()))
print(result)
Output:
['35UP', 'PLx']

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Resources