Save big lists to csv/pickle files - python-3.x

I'm learning to use pandas to save data to csv and pickle files, using the following script:
data = {'Product': [['Desktop Computer' * 30]],
'Price': [['850' * 30]]
}
df = pd.DataFrame(data, columns= ['Product', 'Price'])
df.to_csv('sample_csv.csv')
df.to_pickle('sample_pickle.pkl')
The csv file could be saved correctly, but the pickle file had some trash in it. Please see the attached pictures "correct_small_csv.png" and "pickle_withtrash.png".
Another thing I found is if the list size in data increases from 30 to 3000. The saved csv file would be messed up also. Basically the list of the 3000 'Desktop Computer' will be saved in two cells in the csv file. Please see the picture "Messed_big_csv.png".

If you want to have a list of strings do this 'Price': [['10'] * 30] instead of this 'Price': [['10' * 30]]. In other case you will have a list with one string element like ['10101010101....'] instead of ['10', '10', ....]
I've tried your solution with saving and it worked well for me for 30 and 3k items. Maybe there is a glitch or something like this? For sure everything is fine with code so I suggest there's something with an excel file visualization. I'm not sure but check the max length of a cell in csv/excel. Maybe it can't store 3k of 'Desktop Computer' in one cell.
Moreover pickle is a binary file so maybe you have no need to open it as text file.
Try to read both in pandas and see what will happen (everything should be just fine)

Related

How to read the most recent Excel export into a Pandas dataframe without specifying the file name?

I frequent a real estate website that shows recent transactions, from which I will download data to parse within a Pandas dataframe. Everything about this dataset remains identical every time I download it (regarding the column names, that is).
The name of the Excel output may change, though. For example, if I already have download a few of these in my Downloads folder, the file that's exported may read "Generic_File_(3)" or "Generic_File_(21)" if I already have a few older "Generic_File" exports in that folder from a previous export.
Ideally, I'd like my workflow to look like this: export this Excel file of real estate sales, then run a Python script to read in the most recent export as a Pandas dataframe. The catch is, I don't want to have to go in and change the filename in the script to match the appending number of the Excel export everytime. I want the pd.read_excel method to simply read the "Generic_File" that is appended with the largest number (which will obviously correspond to the most rent export).
I suppose I could always just delete old exports out of my Downloads folder so the newest, freshest export is always named the same ("Generic_File", in this case), but I'm looking for a way to ensure I don't have to do this. Are wildcards the best path forward, or is there some other method to always read in the most recently downloaded Excel file from my Downloads folder?
I would use the OS package and create a method to read to file names in the downloads folder. Parsing string filenames you could then find the file following your specified format with the highest copy number. Something like the following might help you get started.
import os
downloads = os.listdir('C:/Users/[username here]/Downloads/')
is_file = [True if '.' in item else False for item in downloads]
files = [item for keep, item in zip(is_file, downloads) if keep]
** INSERT CODE HERE TO IDENTIFY THE FILE OF INTEREST **
Regex might be the best way to find matches if you have a diverse listing of files in your downloads folder.

Problem with .xls file validation on e-commerce platform

you may have noted that this is a long question, that was because I really put an effort to explain how many WTF's I am facing, and, maybe, is not that good yet, anyway, I appreciate your help!
Context
I'm doing an integration project for a client that handles a bunch of data to generate Excel files in .xls format, notice that extension!
While developing the project I was using the xlrd and xlwt python extensions, because, again, I need to create a .xls file. But at some time I had to download and extract a file and was in .csv format (but, in reality, the file contains an HTML table :c).
So I decided to use padas to read the HTML, create a data frame so I can manipulate and return a .xls excel file.
The Problem
after coding the logic and checking that the data was correct, I tried to upload this file to the e-commerce plataform.
What happened is that the platform doesn't validate my archive.
First I will briefly explain how the site work: He accepts .xls and only .xls file, probably manipulate and use them to update the database, I have access to nothing from the code source.
When I upload the file, the site takes me to a configuration page where, if I want or the site didn't relate right, I could relate excel columns to be the id or values that would be updated on the database.
The 'generico4' field expects 'smallint(5) unsigned' on the type.
An important fact is that I sent the file to my client so he could validate the data, and after many conversations between us was discovered that if he, just by downloading my file, opening, and saving, the upload works fine (the second image from my slide), important to note that he has a MacBook and me, Ubuntu. I tried to do the same thing but not worked.
He sent me this file and I tried to see the difference between both and I found nothing, the type of the numbers are the same, that is 'float', and printed via excel with the formula =TYPE(cell) returned 1.
I already tried many other things but nothing works :c
The code
Follow the code so you can have an idea of the logic
def stock_xls(data_file_path):
# This is my logic to manipulate the data
df = pd.read_html(data_file_path)[0]
df = df[[1,2]]
df.rename(columns={1:'sku', 2:'stock'}, inplace=True)
df = df.groupby(['sku']).sum()
df.reset_index(inplace=True)
df.loc[df['stock'] > 0, 'stock'] = 1
df.loc[df['stock'] == 0, 'stock'] = 2
# I create a new Worbook (via pandas was not working too)
wb_out = xlwt.Workbook()
ws_out = wb_out.add_sheet(sheetname='stock')
# Set the columns name
ws_out.write(0, 0, 'sku')
ws_out.write(0, 1, 'generico4')
# Copy DataFrame data to the WorkBook
for index, value in df.iterrows():
ws_out.write(index + 1, 0, str(value['sku']))
ws_out.write(index + 1, 1, int(value['stock']))
path = os.path.join(BASE_DIR, f'src/xls/temp/')
Path(path).mkdir(parents=True, exist_ok=True)
file_path = os.path.join(path, "stock.xls")
wb_out.save(file_path)
return file_path

Is there a way to split a DF using column name comparison?

I am extremely new to Python. I've created a DataFrame using a csv file. My file is a complex nested json file having header values at the lowest granular level.
[Example] df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
Requirement: I have to create multiple smaller csvs using each of the columns that are granular and ID1, fullID2.
My approach that I figured is: save the smaller slices by splitting on the header value.
Problem 1: Not able to split the value correctly or traverse to the first location for comparison.
[Example]
I'm using df.columns.str.split('.').tolist(). Suppose I get the below listed value, I want to compare seedValue of id with seedValue of value1 and pull out this entire part as a new df.
{['seedValue','id'],['seedValue'.'value1'], ['seedValue'.'value2']}
Problem 2: Adding ID1 and fullID2 to this new df.
Any help or direction to achieve this would be super helpful !
[Final output]
df.columns = [ID1, fullID2, total.count, total.value, seedValue.id, seedValue.value1, seedValue.value2, seedValue.largeFile.id, seedValue.largeFile.value1, seedValue.largeFile.value2......]
post-processing the file -
seedValue.columns = ID1,fullID2,id,value1,value2
total.columns = ID1,fullID2,count,value
seedValue.largeFile.columns = ID1,fullID2,id,value1,value2
While I do not possess your complex data to provide a more particular solution. I was able to reproduce a similar case with a .csv sample data, which will exemplify how to achieve what you aim with your data.
In order to save in each ID in a different file, we need to loop through the ID's. Also, assuming there might be more duplicate ID's, the script will save each group of ID's into a .csv file. Below is the script, already with sample data:
import pandas as pd
import csv
my_dict = { 'ids' : [11,11,33,55,55],
'info' : ["comment_1","comment_2", "comment_3", "comment_4", "comment_5"],
'other_column': ["something", "something", "something", "", "something"]}
#Creating a dataframe from the .csv file
df = pd.DataFrame(my_dict)
#sorting the value
df = df.sort_values('ids')
#g=df.groupby('ids')
df
#looping through each group of ids and saving them into a file
for id,g in df.groupby('ids'):
g.to_csv('id_{}.csv'.format(id),index=False)#, header=True, index_label=False)
And the output,
id_11.csv
id_33.csv
id_55.csv
For instance, within id_11.csv:
ids info other_column
11 comment_1 something
11 comment_2 something
Notice that, we use the field ids in the name of each file. Moreover, index=False which means that a new column with indexes for each line of data won't be created.
ADDITIONAL INFO: I have used the Notebook in AI Platform within GCP to execute and test the code.
Compared to the more widely known pd.read_csv, pandas offers more granular json support through pd.json_normalize, which allows you to specify how to unnest the data, which meta-data to use, etc.
Apart from this, reading nested fields from a csv into a two-dimensional dataframe might not be the ideal solution here, and having nested objects inside a dataframe can often be tricky to work with.
Try to read the file as a pure dictionary or list of dictionaries. You can then loop through the keys and create a custom logic to check how many more levels you want to go down, how to return the values and so on. Once you are on a lower level and prefer to have this inside of a dataframe, create a new temporary dataframe, then append these parts together inside the loop.

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?
panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

np.save is converting floats to weird characters

I am attempting to append results to an ongoing csv file. Each result comes out as an nd.array:
[IN]: Print(savearray)
[OUT]: [[ 0.55219001 0.39838119]]
Initially I tried
np.savetxt('flux_ratios.csv', savearray,delimiter=",")
But this overwrites the old data every time I save, so instead I am attempting to append the data like this:
f = open('flux_ratios.csv', 'ab')
np.save(f, 'a',savearray)
f.close()
This is (in a sense) appending, however it is saving the numerical data as weird characters, as can be seen in this screenshot:
I have no idea why or how this is happening so any help would be greatly appreciated!
First off, np.save does not write text whereas np.savetxt does. You are trying to combine binary with text, which is why you get the odd characters when you try to read the file.
You could just change np.save(f, 'a', savearray) to np.savetxt(f, savearray, delimiter=',').
Otherwise you could also consider using pandas.to_csv in append mode.

Resources