I am in Python using Pandas for this manipulation.
The shape of my data is (1488000, 10).
I need to create a loop that will groupBy each separate category in the 3rd column, WORKLOAD_NAME , and create its own separate DF. The 3rd column has about 66 categories and so I need 66 respectively named and separate DFs as a result from this loop.
Here is what the data looks like:
HERE IS WHAT I WANT FOR EACH SEPARATE DF-- separated DFs by WORKLOAD_NAME:
Please note:
1) I did this for 1 single df, but this would be unsatisfactory to do it manually 65 more times:
EDWARDLOAD_WL = data[data.WORKLOAD_NAME == 'EDWARDLOAD']
2) I created a set of the unique names of the categories and then tried to create a loop like this:
for i in workload_set:
[i]_WL = data[data.WORKLOAD_NAME == i ]
but it didnt do anything for me. Any thoughts?
3) Lasly, I tried this .groupBY():
data_grouped = tuple(data.groupBy('WORKLOAD_NAME'))
data_grouped.head()
But it didn't work either-- "AttributeError: 'DataFrame' object has no attribute 'groupBy'"
I managed to save separate files to excel:
names = df['WORKLOAD_NAME'].unique()
for n in names:
df[df['WORKLOAD_NAME']==n].to_excel(n+'.xlsx')
Related
So I am trying to combine multiple CSV files. I have one csv with a current part number list of products we stock. Sorry, I can't embedded images as I am new. I've seen many similar posts but not any with both a merge and a groupby together.
current_products
I have another csv with a list of image files that are associated with that part but are split up on to multiple rows. This list also has many more parts listed than we offer so merging based on the current_products sku is important.
product_images
I would like to reference the first csv for parts I currently use and combine the images files in the following format.
newestproducts
I get a AttributeError: 'function' object has no attribute 'to_csv', although when I just print the output in the terminal it appears to be the way I want it.
current_products = 'currentproducts.csv'
product_images = 'productimages.csv'
image_list = 'newestproducts.csv'
df_currentproducts = pd.read_csv(currentproducts)
df_product_images = pd.read_csv(product_images)
df_current_products['sku'] = df_currentproducts['sku'].astype(str)
df_product_images['sku'] = df_product_images['sku'].astype(str)
df_merged = pd.merge(df_current_products, df_product_images[['sku','images']], on = 'sku', how='left')
df_output = df_merged.groupby(['sku'])['images_y'].apply('&&'.join).reset_index
#print(df_output)
df_output.to_csv(image_list, index=False)
Your are missing () after reset_index:
df_output = df_merged.groupby(['sku']['images_y'].apply('&&'.join).reset_index()
That resulted df_output type to method rather then a dataframe (just print type(df_output) to see that), so obviously he doesn't know any method named to_csv
I want to create a function that creates 3 data frames then takes the element wise-average of the three. The data frames are generated from a loop using a dictionary that was defined in an earlier step, like this:
# extracting and organizing data
def density_dataP(filenames):
datasets = ["df_1", "df_2", "df_3"]
for num in filenames:
for index in range(len(datasets)):
datasets[index] = pd.DataFrame({
#excluding the edges b/c nothing interesting happens there
"z-coordinate (nm)": mda.auxiliary.XVG.XVGReader(filenames[num]["water"])._auxdata_values[7:43:1,0],
"water": mda.auxiliary.XVG.XVGReader(filenames[num]["water"])._auxdata_values[7:43:1,1],
"acyl": mda.auxiliary.XVG.XVGReader(filenames[num]["acyl"])._auxdata_values[7:43:1,1],
"headgroups": mda.auxiliary.XVG.XVGReader(filenames[num]["head"])._auxdata_values[7:43:1,1],
"ester": mda.auxiliary.XVG.XVGReader(filenames[num]["ester"])._auxdata_values[7:43:1,1],
"protein": mda.auxiliary.XVG.XVGReader(filenames[num]["proa"])._auxdata_values[7:43:1,1]
})
master_data = (df_1 + df_2 + df_3)/3
return master_data
However, when I try to run the function with a valid input I get the following error:
---> 16 master_data = (df_1 + df_2 + df_3)/3
17 return master_data
NameError: name 'df_1' is not defined
The inputs into the XVGReader method needs the path to an XVG file for the input and I have those contained in the dictionary. The first layer of the dictionary has a number, the second layer has the path to the file. Each number is associated with all the paths in one of the three dataframes.(i.e. all paths in key 1 are for df_1, etc.) The dictionary I am using looks roughly like this:
{1: {'water': $PATH_TO_water1.xvg', 'acyl': $PATH_TO_acyl1.xvg', 'head': $PATH_TO_head1.xvg', 'ester': $PATH_TO_ester1.xvg', 'proa': $PATH_TO_proa1.xvg'},
2: {'water': $PATH_TO_water2.xvg', 'acyl': $PATH_TO_acyl2.xvg', 'head': $PATH_TO_head2.xvg', 'ester': $PATH_TO_ester2.xvg', 'proa': $PATH_TO_proa2.xvg'},
3: {'water': $PATH_TO_water3.xvg', 'acyl': $PATH_TO_acyl3.xvg', 'head': $PATH_TO_head3.xvg', 'ester': $PATH_TO_ester3.xvg', 'proa': $PATH_TO_proa3.xvg'}}
How do I get python to recognize the DataFrames created in this loop or at least get the final result of master_data?
In your example, "df_1" is a string in the list datasets, not a variable. If you want to access by name, then you would want datasets to be a dict with a key of df_1, etc. and the value a dataframe.
But you don't need to name items here because all you want is an average.
So I think you should simplify the function. For example, the inner loop over datasets seems to create three copies of the same value; that seems like it can be omitted. Also, if filenames is a dict where each value is a dataframe, so you can iterate over the values directly.
def density_dataP(filenames):
datasets = []
for df in filenames.values():
datasets.append(pd.DataFrame({
"z-coordinate (nm)": mda.auxiliary.XVG.XVGReader(df["water"])._auxdata_values[7:43:1,0],
"water": mda.auxiliary.XVG.XVGReader(df["water"])._auxdata_values[7:43:1,1],
"acyl": mda.auxiliary.XVG.XVGReader(df["acyl"])._auxdata_values[7:43:1,1],
"headgroups": mda.auxiliary.XVG.XVGReader(df["head"])._auxdata_values[7:43:1,1],
"ester": mda.auxiliary.XVG.XVGReader(df["ester"])._auxdata_values[7:43:1,1],
"protein": mda.auxiliary.XVG.XVGReader(df["proa"])._auxdata_values[7:43:1,1]
})
return pd.concat([datasets]).mean()
I have 2 data frames(final_combine_df & acs_df) that have a shared column ('CBG'). Dataframe acs_df has 2 additional columns that I want to add to the combined dataframe (acs_total_persons & cs_total_building_units) . For the 'CBG' column values in acs_df that match those in final_combine_df, I want to add the acs_total_persons & acs_total_housing_units values to that row.
acs_df.head()
CBG acs_total_persons acs_total_housing_units
10010211001 1925.0 1013.0 1
10030114011 2668.0 1303.0 2
10070100043 930.0 532.0 3
10139534001 1570.0 763.0 4
10150021023 1059.0 379.0
I tried combine_acs_merge = pd.concat([final_combine,acs_df], sort=True) but it did not seem to match them up. I also tried combine_acs_merge = final_combine.merge(acs_df, on='CBG') and got
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
What do I need to do here?
Note: Column acs_df['CBG'] is type numpy.float64, not a string but it should still return. Oddly, when I run the following: print(acs_df.loc[acs_df['CBG'] == '01030114011']) it returns an empty dataframe. I created the acs_df from a csv file (see below). Is that creating a problem?
acs_df = pd.read_csv(acs_data)
I am trying to create a list, which will be fed as input to the neural network of a Deep Reinforcement Learning model.
What I would like to achieve:
This list should have the properties of this code's output
vec = []
lines = open("data/" + "GSPC" + ".csv", "r").read().splitlines()
for line in lines[1:]:
vec.append(float(line.split(",")[4]))
i.e. just a list of values like this [enter image description here][1]
The original dataframe looks like:
Out[0]:
Close sma15
0 1.26420 1.263037
1 1.26465 1.263193
2 1.26430 1.263350
3 1.26450 1.263533
but by using df.transpose() i obtained the following:
0 1 2 3
Close 1.264200 1.264650 1.26430 1.26450
sma15 1.263037 1.263193 1.26335 1.263533
from here I would like to obtain a list grouped by column, of the type:
[1.264200, 1.263037, 1.264650, 1.263193, 1.26430, 1.26335, 1.26450, 1.263533]
I tried
x = np.array(df.values.tolist(), dtype = np.float32).reshape(1,-1)
but this gives me a float with 1 row and 6 columns, how could I achieve a result that has the properties I am looking for?
From what I can understand, you just want a flattened version of the DataFrame's values. That can be done simply with the ndarray.flatten() method rather than reshaping it.
# Creating your DataFrame object
a = [[1.26420, 1.263037],
[1.26465, 1.263193],
[1.26430, 1.263350],
[1.26450, 1.263533]]
df = pd.DataFrame(a, columns=['Close', 'sma15'])
df.values.flatten()
This gives array([1.2642, 1.263037, 1.26465, 1.263193, 1.2643, 1.26335, 1.2645, 1.263533]) as is (presumably) desired.
PS: I am not sure why you have not included the last row of the DataFrame as the output of your transpose operation. Is that an error?
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column