Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF - python-3.x

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:

I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])

This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

Related

Fill csv data lists with for loop

I am manipulating .csv files. I have to loop through each column of numeric data in the file and enter them into different lists. The code I have is the following:
import csv
salto_linea = "\n"
csv_file = "02_CSV_data1.csv"
with open(csv_file, 'r') as csv_doc:
doc_reader = csv.reader(csv_doc, delimiter = ",")
mpg = []
cylinders = []
displacement = []
horsepower = []
weight = []
acceleration = []
year = []
origin = []
lt = [mpg, cylinders, displacement, horsepower,
weight, acceleration, year, origin]
for i,ln in zip(range (0,9),lt):
print(f"{i} -> {ln}")
for row in doc_reader:
y = row[i]
ln.append(y)
In the loop, try to have range() serve me as an index so that in the nested for loop, it loops through the first column (the first element of each row in the csv) and feeds it into the first list of 'lt'. The problem I have is that I go through the data column and enter it, but range() continues to advance in the first loop, ending the nesting, thinking that it would iterate i = 1, and that new value of 'i' would enter again. the nested loop traversing the next column and vice versa. I also tried it with some other while loop to iterate a counter that adds to each iteration and serves as an index but it didn't work either.
How I can fill the sublists in 'lt' with the data which is inside the csv file??
without seing the ontents of the CSV file itself, the best way of reading the data into a table is with the pandas module, which can be done in one line of code.
import pandas as pd
df = pd.read_csv('02_CSV_data1.csv')
this would have read all the data into a dataframe and you can work with this.
Alternatively, ammend the for loop like this:
for row in doc_reader:
for i, ln in enumerate(lt):
ln.append(row[i])
for bigger data, i would prefer pandas which has vectorised methods.

How to avoid the use of DataFrame.iterrows() in the following situations

I have the following code,
temp = dict()
for _, row in df_A.iterrows():
if row["anchor"] not in temp: # anchor, id, and name are columns in df_A
temp[row["anchor"]] = [row["id"], row["name"]]
''' Will do the same on df_B, df_B, etc... '''
for index, row in df_Main.iterrows():
if row["anchor"] in temp:
self.df_Main.at[index, "id"] = temp[row["anchor"]][0]
self.df_Main.at[index, "name"] = temp_map[row["anchor"]][1]
But here, df_Main can have more than 1 million rows and df_A, df_B, etc... can have 50,000 to 100,000 entries. In this case, will it be inefficient to use iterrows()?
Also, how can I do the following operations in a single line? I am fairly new to python and I don't know how to achieve my requirement using lambda and apply.
In the first loop, you are looking for the first occurrence of anchor and their corresponding id and nameand save them in a dictionary. You can achieve it by this line:
df = df_A.drop_duplicates('A', keep='first').set_index('A')
The second loop can be optimized like this:
bool_index = df_main['A'].isin(df.index)
values_A = df_main[bool_index]['A']
df_main.loc[bool_index,'B'] = df.loc[values_A,'B'].values
df_main.loc[bool_index,'C'] = df.loc[values_A,'C'].values

split time series dataframe when value change

I'have a Dataframe, that correspond to lat/long of an object in movement.
This object go from one place to another, and I created a column that reference what place he is at every second.
I want to split that dataframe, so when the object go in one place, the leave to another, I'll have two separate dataframe.
'None' mean he is between places
My actual code :
def cut_df2(df):
df_copy = df.copy()
#check if change of place
df_copy['changed'] = df_copy['place'].ne(df_copy['place'].shift().bfill()).astype(int)
last = 0
dfs= []
for num, line in df_copy.iterrows():
if line.changed:
dfs.append(df.iloc[last:num,:])
last = num
# Check if last line was in a place
if line.place != 'None':
dfs.append(df.iloc[last:,:])
df_outs= []
# Delete empty dataframes
for num, dataframe in enumerate(dfs):
if not dataframe.empty :
if dataframe.reset_index().place.iloc[0] != 'None':
df_outs.append(dataframe)
return df_outs
It won't work on big dataset, but work on simple examples and I've no idea why, anyone can help me?
Try using this instead:
https://www.geeksforgeeks.org/split-pandas-dataframe-by-rows/
iloc can be a good way to split a dataframe
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]

How do I extract specific values from a DataFrame and add them to a list?

Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Resources