How to extract several dataframes from dictionary - python-3.x

I currently trying to extract several dataframes from a dictionary. The problem is, that the number of dataframes will vary, sometimes I'll have two dataframes in there and sometimes 30.
At the beginning I create a dictionary (dict_of_exceptions) from a dataframe (exceptions_df). In this dictionary I'll have several dataframes depending on how many different 'Source Wells' I have. With the current code I can extract the first dataframe from the dictionary which is j:
dict_of_exceptions = {k: v for k, v in exceptions_df.groupby('Source Well') }
print (dict_of_exceptions)
for k in dict_of_exceptions.keys():
j = dict_of_exceptions[k]
Could someone help me modify the last line to go trough the dictionary and extract each dataframe (and name them like the corresponding key)?

I think I get your intention, but could not really read your intentions from your code though. Currently, as #cyrilb38 stated in comments, your loop is overriding j, so you would only be able to see the result of last iteration. Anyways, rather transforming use dataframe instead, and I think (may be wrong) that you call your row a dataframe. Replacing a groupby object with dict is not what you wanted, or it is just prolonging the process for nothing.
If you want to see the info of Well X only for example, try this
exceptions_df[exceptins_df['Source Well'] == 'Well X']

Related

I'm trying to remove certain words from a column on each row in my dataframe

I'm still trying to understand how pandas works please bear with me. In this exercise, I,m trying to access a particular column ['Without Stop Words'] on each row which has a list of words. I wish to remove certain words from each row of that column. the words to be removed have been specified in a dictionary called {'stop_words_dict'}. here's my code, but the dataframe seems to be unchanged after running it.
def stop_words_remover(df):
# your code here
df['Without Stop Words']= df['Tweets'].str.lower().str.split()
for i, r in df.iterrows():
for word in df['Without Stop Words']:
if word in stop_words_dict.items():
df['Without Stop Words'][i] = df['Without Stop Words'].str.remove(word)
return df
this is how the input looks like
INPUT
EXPECTED OUTPUT
In Pandas, it's generally a bad idea to loop over your dataframe row by row to try to change rows. Instead, try using methods like .apply().
An example for stopwords, together with list comprehension:
test['Tweets'].apply(lambda x: [item for item in x if item not in stop_words_dict.items()])
See https://stackoverflow.com/a/29523440/12904151 for more context.

Iterate over 4 pandas data frame columns and store them into 4 lists with one for loop instead of 4 for loops

I am currently working on pandas structure in Python. I wrote a function that extracts data from Pandas data frame and stores it in lists. The code is working but I feel like there is a part that I could write in one for loop instead four for loops. I will give you an example below. The idea of this part of the code is to extract four columns from a pandas data frame into four lists. I did it with 4 separate for loops but I want to have one loop that does the thing.
col1,col1,col1,col1 = [],[],[],[]
for j in abc['col1']:
col1.append(j)
for k in abc['col2']:
col2.append(k)
for l in abc['col3']:
col3.append(l)
for n in abc['col4']:
col4.append(n)
And my idea is to write a one for loop that does all the code. I tried to do something like this, but it doesn't work.
col1,col1,col1,col1 = [],[],[],[]
for j,k,l,n in abc[['col1','col2','col3','col4']]
col1.append(j)
col2.append(k)
col3.append(l)
col4.append(n)
Can you help me with this idea to wrap four for loops into the one? I would appreciate your help!
You don't need to use loops at all; you can just convert each column into a list directly.
list_1 = df["col"]to_list()
Have a look at this previous question.
Treating a panda dataframe like a list usually works, but is very bad for performance. I'd consider using the iterrows() function instead.
This would work as in the following example:
col1,col2,col3,col4 = [],[],[],[]
for index, row in df.iterrows():
col1.append(row['col1'])
col2.append(row['col2'])
col3.append(row['col3'])
col4.append(row['col4'])
It's probably easier to use pandas.values and then numpy.ndarray.to_list():
col = ['col1','col2','col3']
data = []*len(col)
for i in range(len(col)):
data[i] = df[col(i)].values.to_list()

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Selecting a specific row from an rpy2 DataFrame

My data frame is survey data that I have got from a .csv file. One of the columns is age and I am looking to remove all respondents under 18 years of age. I'll then need to isolate age groups (18-24, 25-35, etc) into their own dataframes that I can do frequency distributions for.
The R code is simple enough:
x.sub <- subset(x.df, y > 2)
But I can't figure out how to use the r() function to get my dataframe variable from python into an R statement. It feels as though there ought to be a .subset() function in the rpy2 DataFrame class. But if it exists, I can't find it.
Using rpy2 2.2.0-dev (should be the same with 2.1.x)
from rpy2.robjects.vectors import DataFrame
dataf = DataFrame.from_csvfile("my/file.csv")
dataf_subset = dataf.rx(dataf.rx2("age").ro >= 18, True)
That one exact example is not in the documentation (and may be should be there), but it's constituting elements are:extracting elements and R operators on vectors

Resources