Read indices stored in a list - python-3.x

I have a dataset which is stored as a list. I want to be able to retrieve different pieces of the data and alter them. The indices of pieces I need are stored in a different list.
For example:
data_list = [[[1,2],[3,4]],[5,6]]
indices = [[0,0,1],[1,0]]
In this case I might want to retrieve data_list[0][0][1] and data_list[1][0] and change them to value 6, but I cannot simply do data_list[indices[0]] = 6. Is there a good way to do this?

You can try to loop over all the keys/sub-keys until you get the data you need.
What you can do is set a variable to a reference to the data_list and loop over the indexes and shift the reference until it's pointing to the lowest nested list.
Then you can set the value in that lowest list to whatever value you need.
data_list = [[[1,2],[3,4]],[5,6]]
indices = [[0,0,1],[1,0]]
for *path, final in indices:
val = data_list
for i in path:
val = val[i]
val[final] = 6
print(data_list)

Related

How to avoid the use of DataFrame.iterrows() in the following situations

I have the following code,
temp = dict()
for _, row in df_A.iterrows():
if row["anchor"] not in temp: # anchor, id, and name are columns in df_A
temp[row["anchor"]] = [row["id"], row["name"]]
''' Will do the same on df_B, df_B, etc... '''
for index, row in df_Main.iterrows():
if row["anchor"] in temp:
self.df_Main.at[index, "id"] = temp[row["anchor"]][0]
self.df_Main.at[index, "name"] = temp_map[row["anchor"]][1]
But here, df_Main can have more than 1 million rows and df_A, df_B, etc... can have 50,000 to 100,000 entries. In this case, will it be inefficient to use iterrows()?
Also, how can I do the following operations in a single line? I am fairly new to python and I don't know how to achieve my requirement using lambda and apply.
In the first loop, you are looking for the first occurrence of anchor and their corresponding id and nameand save them in a dictionary. You can achieve it by this line:
df = df_A.drop_duplicates('A', keep='first').set_index('A')
The second loop can be optimized like this:
bool_index = df_main['A'].isin(df.index)
values_A = df_main[bool_index]['A']
df_main.loc[bool_index,'B'] = df.loc[values_A,'B'].values
df_main.loc[bool_index,'C'] = df.loc[values_A,'C'].values

Iterating thru a not so ordinary Dictionary in python 3.x

Maybe it is ordinary issue regarding iterating thru a dict. Please find below imovel.txt file, whose content is as follows:
{'Andar': ['primeiro', 'segundo', 'terceiro'], 'Apto': ['101','201','301']}
As you can see this is not a ordinary dictionary, with a key value pair; but a key with a list as key and another list as value
My code is:
#/usr/bin/python
def load_dict_from_file():
f = open('../txt/imovel.txt','r')
data=f.read()
f.close()
return eval(data)
thisdict = load_dict_from_file()
for key,value in thisdict.items():
print(value)
and yields :
['primeiro', 'segundo', 'terceiro'] ['101', '201', '301']
I would like to print a key,value pair like
{'primeiro':'101, 'segundo':'201', 'terceiro':'301'}
Given such txt file above, is it possible?
You should use the builtin json module to parse but either way, you'll still have the same structure.
There are a few things you can do.
If you know both of the base key names('Andar' and 'Apto') you can do it as a one line dict comprehension by zipping the values together.
# what you'll get from the file
thisdict = {'Andar': ['primeiro', 'segundo', 'terceiro'], 'Apto': ['101','201','301']}
# One line dict comprehension
newdict = {key: value for key, value in zip(thisdict['Andar'], thisdict['Apto'])}
print(newdict)
If you don't know the names of the keys, you could call next on an iterator assuming they're the first 2 lists in your structure.
# what you'll get from the file
thisdict = {'Andar': ['primeiro', 'segundo', 'terceiro'], 'Apto': ['101','201','301']}
# create an iterator of the values since the keys are meaningless here
iterator = iter(thisdict.values())
# the first group of values are the keys
keys = next(iterator, None)
# and the second are the values
values = next(iterator, None)
# zip them together and have dict do the work for you
newdict = dict(zip(keys, values))
print(newdict)
As other folks have noted, that looks like JSON, and it'd probably be easier to parse it read through it as such. But if that's not an option for some reason, you can look through your dictionary this way if all of your lists at each key are the same length:
for i, res in enumerate(dict[list(dict)[0]]):
ith_values = [elem[i] for elem in dict.values()]
print(ith_values)
If they're all different lengths, then you'll need to put some logic to check for that and print a blank or do some error handling for looking past the end of the list.

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)

Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.

Display and save contents of a data frame with multi-dimensional array elements

I have created and updated a pandas dataframe to fill details of a section of an image and its corresponding features.
slice_sq_dim = 200
df_slice = pd.DataFrame({'Sample': str,
'Slice_ID':int,
'Slice_Array': [np.zeros((slice_sq_dim,slice_sq_dim))],
'Interface_Array': [np.zeros((slice_sq_dim,slice_sq_dim))],
'Slice_Array_Threshold': [np.zeros((slice_sq_dim,slice_sq_dim))]})
I added individual elements of this dataframe by updating the value of each cell through row by row iteration. Once I have completed my dataframe (with around 200 rows), I cannot seem to display more than the first row of its contents. I assume that this is due to the inclusion of multi-dimensional numpy arrays (image slices) as a component. I have also exported this data into a JSON file so that it can act as an input file during the next run. The following code shows how I exactly tried this and also how I fill my dataframe.
Slices_data_file = os.path.join(os.getcwd(), "Slices_dataframe.json")
if os.path.isfile(Slices_data_file):
print("Using the saved data of slices from previous run..")
df_slice = pd.read_json(Slices_data_file, orient='records')
else:
print("No previously saved slice data found..")
no_of_slices = 20
for index, row in df_files.iterrows(): # df_files is the previous dataframe with image path details
path = row['image_path']
slices, slices_thresh, slices_interface = slice_image(path, slice_sq_dim, no_of_slices)
# each of the output is a list of 20 image slices
for n, arr in enumerate(slices):
indx = (indx_row - 1 ) * no_of_slices + n
df_slice.Sample[indx] = path
df_slice.Slice_ID[indx] = n+1
df_slice.Slice_Array[indx] = arr
df_slice.Interface_Array[indx] = slices_interface[n]
df_slice.Slice_Array_Threshold[indx] = slices_thresh[n]
df_slice.to_json(Slices_data_file, orient='records')
I would like to do the following things:
Complete the dataframe with the possibility to add further columns of scalar values
View the dataframe normally with multiple rows and iterate using functions such as df_slice.iterrows() which is currently not supported
Save and reuse the database so as to avoid the repeated and time-consuming operations
Any advice or better suggestions?
After some while of searching, I found some topics that helped. pd.Series was very appropriate here. Also, I think that there was a "SettingwithCopyWarning" thatI chose to ignore somewhere in between. Final code is given below:
Slices_data_file = os.path.join(os.getcwd(), "Slices_dataframe.json")
if os.path.isfile(Slices_data_file):
print("Using the saved data of slices from previous run..)")
df_slice = pd.read_json(Slices_data_file, orient = 'columns')
else:
print("No previously saved slice data found..")
Sample_col = []
Slice_ID_col = []
Slice_Array_col = []
Interface_Array_col = []
Slice_Array_Threshold_col = []
no_of_slices = 20
slice_sq_dim = 200
df_slice = pd.DataFrame({'Sample': str,
'Slice_ID':int,
'Slice_Array': [],
'Interface_Array': [],
'Slice_Array_Threshold': []})
for index, row in df_files.iterrows():
path = row['image_path']
slices, slices_thresh, slices_interface = slice_image(path, slice_sq_dim, no_of_slices)
for n, arr in enumerate(slices):
Sample_col.append(Image_Unique_ID)
Slice_ID_col.append(n+1)
Slice_Array_col.append(arr)
Interface_Array_col.append(slices_interface[n])
Slice_Array_Threshold_col.append(slices_thresh[n])
print("Sicing -> ", Image_Unique_ID, " Complete")
df_slice['Sample'] = pd.Series(Sample_col)
df_slice['Slice_ID'] = pd.Series(Slice_ID_col)
df_slice['Slice_Array'] = pd.Series(Slice_Array_col)
df_slice['Interface_Array'] = pd.Series(Interface_Array_col)
df_slice['Slice_Array_Threshold'] = pd.Series(Slice_Array_Threshold_col)
df_slice.to_json(os.path.join(os.getcwd(), "Slices_dataframe.json"), orient='columns')

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Resources