Expanding a DataFrame from a col full of Dictionaries - python-3.x

I have a big data frame, where one of the columns (event_properties) has one dictionary per row (with different keys).
I would like to expand the data frame with one col per unique key with a None value in the rows where that particular key does not exist.
I tried something like:
p = df.copy()
y = df.event_properties.values.tolist()
event_properties = pd.DataFrame(y)
df = pd.concat([p, event_properties])
But there is something wrong with this code because if I do:
df[['event_properties', 'author']][:2]
I visualize something like:
event_properties author
120 {'author': 'Larsson, Stieg', 'product_id': '6a... NaN
128 {'author': 'Larsson, Stieg', 'product_id': '40... NaN
and this is clearly wrong.
I am quite annoyed by this totally unexpected behaviour, can anyone clarify why is this happening and how should I fix it?

Related

pandas: group near similar string data

I am trying to use groupby on a column with str type of data that has near similar values and get a count of it:
for example:
col A col year col C
abc 2009 no plan today
abc2 2009 wrong plan today
I'd like to get a count of 2 in this case.
I thought of something like:
df.groupby(['col year', 'col C'], as_index = False)
but this would not work considering there's a differnce in the col C values as well. What could possibly be an elegant way of handling this?
I saw an answer with cosine similarity here: Calculate similarity between list of words
and perhaps this could be used somehow?
I will point you to the right direction, but will keep the actual implementation to you.
You can use the Levenshtein distance. There is a python package for this that gets as an input 2 strings and returns a number of how "close" those strings are. Simple as that:
from Levenshtein import distance
text_distance = distance(text_1, text_2)
Then what you can do is, you iterate over the rows of the DataFrame and for each row, you check if the Levenshtein distance between the current text value and the text column of any previously group is less than a given threshold. If it is, the row is appended to that group, if not, a new group is created with the current row as the first member of the group.
The threshold is something you need to experiment and understand what value will give you the best results.

Can I use pandas.DataFrame.apply() to make more than one column at once?

I have been given a function that takes the values in a row in a dataframe and returns a bunch of additional values.
It looks a bit like this:
my_func(row) -> (value1, value2, value3... valueN)
I'd like each of these values to become assigned to new columns in my dataframe. Can I use DataFrame.apply() do add multiple columns in one go, or do I have to add columns one at a time?
It's obvious how I can use apply to generate one column at a time:
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"] = df.apply(axis=1, func=lambda row:(row.A + row.B))
df["Y"] = df.apply(axis=1, func=lambda row:(row.A - row.B))
But what if the two columns I am adding are something that are more easily calculated together? In this case, I already have a function that gives me everything I need in one shot. I'd rather not have to call it multiple times or add a load of caching.
Is there a syntax I can use that would allow me to use apply to generate 2 columns at the same time? Something like this, but less broken:
# Broken Pseudo-code
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"], df["Y"] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
What is the correct way to do something like this?
You can assign list of columns names like:
df = pd.DataFrame(np.random.randint(10, size=(2,2)),columns=list('AB'))
df[["X", "Y"]] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
print (df)
A B X Y
0 2 8 10 7
1 4 3 6 -1

Pandas: get first datetime-in and last datetime-out in one row

First of all thanks in advance, there are always answers here so we learn a lot from the experts. I'm a noob using "pandas" (it's super handie for what i tried and achieved so far).
I have these data, handed to me like this (don't have access to the origin), 20k rows or more sometimes. The 'in' and 'out' columns may have one or more data per date, so when i get a 'in' the next data could be a 'out' or a 'in', depending, leaving me a blank cell, that's the problem (see first image).
I want to filter the first datetime-in, to left it in one column and the last datetime-out in another but the two in one row (see second image); the data comes in a csv file. I am doing this particular work manually with LibreOffice Calc (yeap).
So far, I have tried locating and relocating, tried merging, grouping... nothing works for me so i feel frustrated, ¿would you please lend me a hand? here is a minimal sample of the file
By the way english is not my language. ¡Thanks so much!
First:
out_column = df["out"].tolist()
This gives you all the out dates as a list, we will need that later.
in_column = df["in"].tolist() # in is used by python so I suggest renaming that row
I treat NaT as NaN (Null) in this Case.
Now we have to find what rows to keep, which we do by going through the in column and only keeping the rows after a NaN (and the first one):
filtered_df = []
tracker = False
for index, element in enumerate(in):
if index == 0 or tracker is True:
filtered_df.append(True)
tracker = False
continue
if element is None:
tracker = True
filtered_df.append(False)
Then you filter your df by this Boolean List:
df = df[filtered_df]
Now you fix up your out column by removing the null values:
while null in out_column:
out_column.remove(null)
Last but not least you overwrite your old out column with the new one:
df["out"] = out_column

Dropna is surprisingly not woking?

It is super wired, I was going to drop the NAN for features with missing less than 5%. After I dropped it, I wanted to see if it worked or not, I surprisingly found out that I couldn't drop NAN for these variables and the NAN values are even more ??
please tell me where I am wrong
Thank you so much
Why the code doesn't work: you have more columns than in dropna_list. Then you tell pandas to delete the rows in train_set[drop_list]. But these rows are still kept in train_set because it contains more columns. Basically, train_set tries to merge the data from columns outside of drop_list and with modified columns. Missing subrows are imputed by zeros.
How to fix it? Use masks. You can determine what rows are deleted:
mask = training_set[drop_list.pop(0)].isna()
for col in drop_list:
mask = mask & training_set[col].isna()
training_set = training_set[mask]

Adding a zero columns to a dataframe

I have a strange problem which i am not able to figure out. I have a dataframe subset that looks like this
in the dataframe, I add "zero" columns using the following code:
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
and i get a result similar to this
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
subset['IRNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
subset['IPNotional]=pd.DataFrame(numpy.zeros(shape=(len(subset),1)))
I dont understand why sometimes i get zeros the other i get either NaNs or a mix of NaNs and zeros. Please help if you can
Thanks
I believe you need assign with dictionary for set new columns names:
subset = subset.assign(**dict.fromkeys(['IRNotional','IPNotional'], 0))
#you can define each column separately
#subset = subset.assign(**{'IRNotional': 0, 'IPNotional': 1})
Or simplier:
subset['IRNotional'] = 0
subset['IPNotional'] = 0
Now when i do similar things to another dataframe i get zeros columns with a mix NaN and zeros rows as shown below. This is really strange.
I think problem is different index values, so is necessary create same indices, else for not matched indices get NaNs:
subset['IPNotional']=pd.DataFrame(numpy.zeros(shape=(len(subset),1)), index=subset.index)

Resources