Combining multiple data rows on key lookup

Combining multiple data rows on key lookup - python-3.x

So I am trying to combine multiple CSV files. I have one csv with a current part number list of products we stock. Sorry, I can't embedded images as I am new. I've seen many similar posts but not any with both a merge and a groupby together.
current_products
I have another csv with a list of image files that are associated with that part but are split up on to multiple rows. This list also has many more parts listed than we offer so merging based on the current_products sku is important.
product_images
I would like to reference the first csv for parts I currently use and combine the images files in the following format.
newestproducts
I get a AttributeError: 'function' object has no attribute 'to_csv', although when I just print the output in the terminal it appears to be the way I want it.
current_products = 'currentproducts.csv'
product_images = 'productimages.csv'
image_list = 'newestproducts.csv'
df_currentproducts = pd.read_csv(currentproducts)
df_product_images = pd.read_csv(product_images)
df_current_products['sku'] = df_currentproducts['sku'].astype(str)
df_product_images['sku'] = df_product_images['sku'].astype(str)
df_merged = pd.merge(df_current_products, df_product_images[['sku','images']], on = 'sku', how='left')
df_output = df_merged.groupby(['sku'])['images_y'].apply('&&'.join).reset_index
#print(df_output)
df_output.to_csv(image_list, index=False)

Your are missing () after reset_index:
df_output = df_merged.groupby(['sku']['images_y'].apply('&&'.join).reset_index()
That resulted df_output type to method rather then a dataframe (just print type(df_output) to see that), so obviously he doesn't know any method named to_csv

Related

How to load multiple excel files with multiple sheets in to one dataframe in python

We're trying to make an automatic program, that can take multiple excel files with multiple sheets from a folder, and append them to one data frame.
Our problem is that we're not quite sure how to do this, so the process becomes most automatic. And since the sheets varies in names, we can't specify any variable for them.
Alle of the files are *.xlsx, and the code has to load a arbitrary number of files.
We have tried with different types of codes, primarily using pandas, but we can't seem to append them in one data frame.
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer)
writer.save()
sheet1 = xls.parse(0)
We expect to have one data frame with all data, such that we can use data and extract different features and make statistics.

The documentation of pandas.read_excel states:
*sheet_name : str, int, list, or None, default 0
Strings are used for sheet names. Integers are used in zero-indexed sheet positions. Lists of strings/integers are used to request multiple sheets. Specify None to get all sheets.
Available cases:
Defaults to 0: 1st sheet as a DataFrame
1: 2nd sheet as a DataFrame
"Sheet1": Load sheet with name “Sheet1”
[0, 1, "Sheet5"]: Load first, second and sheet named “Sheet5” as a dict of DataFrame
None: All sheets.*
I would suggest to try the last option, being pd.read_excel(f,sheet_name = None). Otherwise you might want to create a loop and pass indexes vs. the actual sheet names, this way you don't have to have prior knowledge of the .xlsx files.

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).

I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

How to loop to create separate PD dataframes in Python 3

I am in Python using Pandas for this manipulation.
The shape of my data is (1488000, 10).
I need to create a loop that will groupBy each separate category in the 3rd column, WORKLOAD_NAME , and create its own separate DF. The 3rd column has about 66 categories and so I need 66 respectively named and separate DFs as a result from this loop.
Here is what the data looks like:
HERE IS WHAT I WANT FOR EACH SEPARATE DF-- separated DFs by WORKLOAD_NAME:
Please note:
1) I did this for 1 single df, but this would be unsatisfactory to do it manually 65 more times:
EDWARDLOAD_WL = data[data.WORKLOAD_NAME == 'EDWARDLOAD']
2) I created a set of the unique names of the categories and then tried to create a loop like this:
for i in workload_set:
[i]_WL = data[data.WORKLOAD_NAME == i ]
but it didnt do anything for me. Any thoughts?
3) Lasly, I tried this .groupBY():
data_grouped = tuple(data.groupBy('WORKLOAD_NAME'))
data_grouped.head()
But it didn't work either-- "AttributeError: 'DataFrame' object has no attribute 'groupBy'"

I managed to save separate files to excel:
names = df['WORKLOAD_NAME'].unique()
for n in names:
df[df['WORKLOAD_NAME']==n].to_excel(n+'.xlsx')

Python adding lists

I am importing data from a file, which is working correctly. I have appended the data from this file into 3 different lists, name, mark, mark2 although I don't understand how or if i can make a new list called total_marks and add a calculation appending mark + mark2 into total_marks. Tried looking about for help on this and couldn't find much relating to it. The plan is to actually add the two lists together and work out a percentage which the total marks would be 150.

To add the two lists item by item:
combined = []
for m1, m2 in zip(mark, mark2):
combined.append(m1+m2)
The zip function returns an item pair from the two lists for each pair in the lists.:
https://docs.python.org/3/library/functions.html#zip
Then you can perform the final operation this way:
final = []
for m in combined:
final.append(m/150*100)
As I said in my comment, I highly recommend that once you've gotten past learning the basics that you then take the time to learn two libraries: pandas and xlwings. These will greatly help your ability to interact between python and excel. An operation like you have here becomes much simpler once you learn pandas dataframes.

Here is a better way, using pandas.
import pandas
df = pandas.read_csv('Classmarks.csv', index_col = 'student_name', names = ('student_name', 'mark1', 'mark2'), header = None)
df['combined'] = df['mark1'] + df['mark2']
df['final'] = df['combined'] / 150 * 100
print(df)
Don't have to do any loops using pandas. And you can then write it back to a csv file:
df.to_csv('Classmarksout.csv')

How to use Pandas to display specific columns from csv file?

I have a csv file with a number of columns in it. It is for students. I want to display only male students and their names. I used 1 for male students and 0 for female students. My code is:
import pandas as pd
data = pd.read_csv('normalizedDataset.csv')
results = pd.concat([data['name'], ['students']==1])
print results
I have got this error:
TypeError: cannot concatenate a non-NDFrame object
Can anyone help please. Thanks.

You can specify to read only certain column names of your data when you load your csv. Then use loc to locate all values where students equals 1.
data = pd.read_csv('normalizedDataset.csv', usecols=['name', 'students'])
data = data.loc[data.students == 1, :]
BTW, your original error is because you are trying to concatenate a dataframe with False.
>>> ['students']==1
False

No need to concat, you're stripping things away, not building.
Try:
data[data['friends']==1]['name']

To provide clarity on why you were getting the error:
The second thing you were trying to concat was:
['students']==1
Which is not an NDFrame object. You'd want to replace that with.
data[data['students']==1]['students']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Combining multiple data rows on key lookup - python-3.x

Your are missing () after reset_index: df_output = df_merged.groupby(['sku']['images_y'].apply('&&'.join).reset_index() That resulted df_output type to method rather then a dataframe (just print type(df_output) to see that), so obviously he doesn't know any method named to_csv

Related

How to load multiple excel files with multiple sheets in to one dataframe in python

Slow loop aggregating rows and columns

How to loop to create separate PD dataframes in Python 3

Python adding lists

How to use Pandas to display specific columns from csv file?

Categories

Resources