Is this example data cleaning code updating the pandas dataframe? [duplicate] - python-3.x

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
In this article on predicting values with linear regression there's a cleaning step
# For beginning, transform train['FullDescription'] to lowercase using text.lower()
train['FullDescription'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
This isn't actually assigning the changes to the dataframe, is it? But if I try something like this...
train['FullDescription'] = train['FullDescription'].str.lower()
train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
Then I get a warning...
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
What's the right way to apply these transformations? Are they in fact already being applied? A print(train['FullDescription']) seems to show me they're not.
EDIT: #EdChum and #jezrael are very much onto something about missing code. When I'm actually trying to run this, my data needs to be split into test and train sets.
from sklearn.model_selection import train_test_split
all_data = pandas.read_csv('salary.csv')
train, test = train_test_split(all_data, test_size=0.1)
That's what seems to be causing this error. If I make the next line
train = train.copy()
test = test.copy()
then everything is happy.
You may be wondering if I shouldn't then just apply this step to all_data, which works, but then lower down in the code train['Body'].fillna('nan', inplace=True) still causes an error. So it seems indeed the problem is with train_test_split not creating copies.

The right way to apply these transformations would be...
df.loc[:, 'FullDescription'] = ...
More informations about this would be here. This is a page from the pandas documentation, all the way to the bottom. Quoting...
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
You can also find extra-reasons of Why to use .loc here. Long story short : Explicit is better than implicit. And while df['some_column'] is not immediatly clear about the intent, usingdf.loc['some_column'] is.
I don't really know how to explain it in a simple way, but if you have further questions or if you think I could make my answer more explicit/eloquent, tell me. :)

Related

How to solve this data set

I am importing a Data set from quandl using API. Everything is perfect, however the time series I am importing is reversed. By this I mean if I used the .head method to print the first elements in the data set, I will get the latest Data set figures and printing the tail will get oldest figures
df = pd.read_csv("https://www.quandl.com/api/v3/datasets/CHRIS/CME_CD4.csv?api_key=H32H8imfVNVm9fcEX6kB",parse_dates=['Date'],index_col='Date')
df.head()
This should be a pretty easy fix if I understand. Credit to behzad.nouri on this answer Right way to reverse pandas.DataFrame?.
You just need to reverse the order of your dataframe using the line below.
df = df.reindex(index=df.index[::-1])

How to write this type of data to csv

im a noobie in python. I want to get some data from csv with pandas and after write a new csv file with extra data in format
"type";"currency";"amount";"comment"
"type1";"currency1";"amount1";"comment1"
etc
import pandas as pd
import csv
req=pd.read_csv('/Users/user/web/python/Bookcopy.csv')
type="type"
comment = "2week"
i=0
while i<3:
Currency = req['Currency'].values[i]
ReqAmount = req['Request'].values[i]
r = round(ReqAmount,-1)
i+=1
data =[type,Currency,r,comment]
#print(data)
csv_file = open('data2.csv', 'w')
with csv_file:
writer = csv.writer(csv_file)
writer.writerow(data)
print("DONE")
writer.writerows(data)
_csv.Error: iterable expected, not numpy.float64
I have multiple things to criticize here. I hope it doesn't come across mean, but rather educational. While your code would work regardless of those points, it is good coding style to follow them.
Variables should not start with a capital letter. Currency and ReqAmount should be currency and reqAmount.
Don't use keywords for variable names. type is a python keyword.
Make sure your formatting doesn't get destroyed when posting here. Especially for python, which relies on tab formatting. Read here for more information: https://stackoverflow.com/editing-help#code
That said, let me try to go through your code and give you tips and tricks:
Don't run code in python directly. Always use the main() method. It's just better coding style.
When looping, don't use the i=0; while i<3; i+=1 construct, rather use for i in range(3). While it works, it is not very pythonic and a lot harder to read.
Never assume anything that cannot be guaranteed in your code. In this case, you assume that the csv-file has at least 3 lines, otherwise your program would crash. Instead, read the number of lines from the csv file, with len(req).
data =[type,Currency,r,comment] keeps overwriting your data variable. You could either append to data and then write everything to the output file at the end, or directly write to the output file in every iteration.
Don't use open to create a variable (except absolutely necessary). Instead, use open in a with statement. This ensures that the file will get closed properly. I have seen that you do use the with statement, nonetheless you usually use it like with open(...) as variable_name:.
CSV files usually start with the column names. Therefore you should write the column names before you write data.
I won't fix that, because it will change the appearance of the program completely, but normally, don't mix different libraries. If you use pandas for csv reading, also use it for writing. If you use the csv library for writing, also use it for reading. While it isn't wrong to mix them, it is bad style and creates more dependencies than would be necessary.
I don't really understand what your code is supposed to do, so I just guess and hope it goes in the right direction.
When fixing all those points, you might have something like that:
import pandas as pd
import csv
def main():
req=pd.read_csv('/Users/user/web/python/Bookcopy.csv')
transferType = "type"
comment = "2week"
with open('data2.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["type","currency","amount","comment"])
for i in range(len(req)):
currency = req['Currency'].values[i]
reqAmount = req['Request'].values[i]
r = round(reqAmount,-1)
data = [transferType,currency,r,comment]
#print(data)
writer.writerow(data)
print("DONE")
# Whenever you run a program, __name__ will be set to '__main__' in the initial
# script. This makes it easier later when you work with multiple code files.
if __name__ == '__main__':
main()

How to remove special usual characters from a pandas dataframe using Python

I have a file some crazy stuff in it. It looks like this:
I attempted to get rid of it using this:
df['firstname'] = map(lambda x: x.decode('utf-8','ignore'), df['firstname'])
But I wound up with this in my dataframe: <map object at 0x0000022141F637F0>
I got that example from another question and this seems to be the Python3 method for doing this but I'm not sure what I'm doing wrong.
Edit: For some odd reason someone thinks that this has something to do with getting a map to return a list. The central issue is getting rid of non UTF-8 characters. Whether or not I'm even doing that correctly has yet to be established.
As I understand it, I have to apply an operation to every character in a column of the dataframe. Is there another technique or is map the correct way and if it is, why am I getting the output I've indicated?
Edit2: For some reason, my machine wouldn't let me create an example. I can now. This is what i'm dealing with. All those weird characters need to go.
import pandas as pd
data = [['🦎Ale','Αλέξανδρα'],['��Grain','Girl🌾'],['Đỗ Vũ','ên Anh'],['Don','Johnson']]
df = pd.DataFrame(data,columns=['firstname','lastname'])
print(df)
Edit 3: I tired doing this using a reg ex and for some reason, it still didn't work.
df['firstname'] = df['firstname'].replace('[^a-zA-z\s]',' ')
This regex works FINE in another process, but here, it still leaves the ugly characters.
Edit 4: It turns out that it's image data that we're looking at.

Pandas Drop and Replace functions won't work within a UDF

I looked around at other questions but couldn't find out that addresses the issue I'm having. I am cleaning a data set in an ipython notebook. When I run the cleaning tasks individually they work as expected, but I am having trouble with the replace() and drop() functions when they are included in a UDF. Specifically, these lines aren't doing anything within the UDF, however, a dataframe is returned that completes the other tasks as expected (i.e. reads in the file, sets the index, and filters select dates out).
Any help is much appreciated!
Note that in this problem the df.drop() and df.replace() commands both work as expected when executed outside of the UDF. The function is below for your reference. The issue is with the last two lines "station.replace()" and "station.drop()".
def read_file(file_path):
'''Function to read in daily x data'''
if os.path.exists(os.getcwd()+'/'+file_path) == True:
station = pd.read_csv(file_path)
else:
!unzip alldata.zip
station = pd.read_csv(file_path)
station.set_index('date',inplace=True) #put date in the index
station = station_data[station_data.index > '1984-09-29'] #removes days where there is no y-data
station.replace('---','0',inplace=True)
station.drop(columns=['Unnamed: 0'],axis=1,inplace=True) #drop non-station columns
There was a mistake here:
station = station_data[station_data.index > '1984-09-29']
I was using an old table index. I corrected it to:
station = station[station.index > '1984-09-29']
Note, I had to restart the notebook and re-run it from the top for it to work. I believe it was an issue with conflicting table names in the UDF vs. what was already stored in memory.

How do I save a scipy distribution in a list or array to call? [duplicate]

I wonder, how to save and load numpy.array data properly. Currently I'm using the numpy.savetxt() method. For example, if I got an array markers, which looks like this:
I try to save it by the use of:
numpy.savetxt('markers.txt', markers)
In other script I try to open previously saved file:
markers = np.fromfile("markers.txt")
And that's what I get...
Saved data first looks like this:
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
0.000000000000000000e+00
But when I save just loaded data by the use of the same method, ie. numpy.savetxt() it looks like this:
1.398043286095131769e-76
1.398043286095288860e-76
1.396426376485745879e-76
1.398043286055061908e-76
1.398043286095288860e-76
1.182950697433698368e-76
1.398043275797188953e-76
1.398043286095288860e-76
1.210894289234927752e-99
1.398040649781712473e-76
What am I doing wrong? PS there are no other "backstage" operation which I perform. Just saving and loading, and that's what I get. Thank you in advance.
The most reliable way I have found to do this is to use np.savetxt with np.loadtxt and not np.fromfile which is better suited to binary files written with tofile. The np.fromfile and np.tofile methods write and read binary files whereas np.savetxt writes a text file.
So, for example:
a = np.array([1, 2, 3, 4])
np.savetxt('test1.txt', a, fmt='%d')
b = np.loadtxt('test1.txt', dtype=int)
a == b
# array([ True, True, True, True], dtype=bool)
Or:
a.tofile('test2.dat')
c = np.fromfile('test2.dat', dtype=int)
c == a
# array([ True, True, True, True], dtype=bool)
I use the former method even if it is slower and creates bigger files (sometimes): the binary format can be platform dependent (for example, the file format depends on the endianness of your system).
There is a platform independent format for NumPy arrays, which can be saved and read with np.save and np.load:
np.save('test3.npy', a) # .npy extension is added if not given
d = np.load('test3.npy')
a == d
# array([ True, True, True, True], dtype=bool)
np.save('data.npy', num_arr) # save
new_num_arr = np.load('data.npy') # load
The short answer is: you should use np.save and np.load.
The advantage of using these functions is that they are made by the developers of the Numpy library and they already work (plus are likely optimized nicely for processing speed).
For example:
import numpy as np
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
np.save(path/'x', x)
np.save(path/'y', y)
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
print(x is x_loaded) # False
print(x == x_loaded) # [[ True True True True True]]
Expanded answer:
In the end it really depends in your needs because you can also save it in a human-readable format (see Dump a NumPy array into a csv file) or even with other libraries if your files are extremely large (see best way to preserve numpy arrays on disk for an expanded discussion).
However, (making an expansion since you use the word "properly" in your question) I still think using the numpy function out of the box (and most code!) most likely satisfy most user needs. The most important reason is that it already works. Trying to use something else for any other reason might take you on an unexpectedly LONG rabbit hole to figure out why it doesn't work and force it work.
Take for example trying to save it with pickle. I tried that just for fun and it took me at least 30 minutes to realize that pickle wouldn't save my stuff unless I opened & read the file in bytes mode with wb. It took time to google the problem, test potential solutions, understand the error message, etc... It's a small detail, but the fact that it already required me to open a file complicated things in unexpected ways. To add to that, it required me to re-read this (which btw is sort of confusing): Difference between modes a, a+, w, w+, and r+ in built-in open function?.
So if there is an interface that meets your needs, use it unless you have a (very) good reason (e.g. compatibility with matlab or for some reason your really want to read the file and printing in Python really doesn't meet your needs, which might be questionable). Furthermore, most likely if you need to optimize it, you'll find out later down the line (rather than spending ages debugging useless stuff like opening a simple Numpy file).
So use the interface/numpy provide. It might not be perfect, but it's most likely fine, especially for a library that's been around as long as Numpy.
I already spent the saving and loading data with numpy in a bunch of way so have fun with it. Hope this helps!
import numpy as np
import pickle
from pathlib import Path
path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)
lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2
# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
pickle.dump(obj={'x':x, 'y':y}, file=db_file)
## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
db_pkl = pickle.load(db_file)
print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')
Some comments on what I learned:
np.save as expected, this already compresses it well (see https://stackoverflow.com/a/55750128/1601580), works out of the box without any file opening. Clean. Easy. Efficient. Use it.
np.savez uses a uncompressed format (see docs) Save several arrays into a single file in uncompressed .npz format. If you decide to use this (you were warned about going away from the standard solution so expect bugs!) you might discover that you need to use argument names to save it, unless you want to use the default names. So don't use this if the first already works (or any works use that!)
Pickle also allows for arbitrary code execution. Some people might not want to use this for security reasons.
Human-readable files are expensive to make etc. Probably not worth it.
There is something called hdf5 for large files. Cool! https://stackoverflow.com/a/9619713/1601580
Note that this is not an exhaustive answer. But for other resources check this:
For pickle (guess the top answer is don't use pickle, use np.save): Save Numpy Array using Pickle
For large files (great answer! compares storage size, loading save and more!): https://stackoverflow.com/a/41425878/1601580
For matlab (we have to accept matlab has some freakin' nice plots!): "Converting" Numpy arrays to Matlab and vice versa
For saving in human-readable format: Dump a NumPy array into a csv file
np.fromfile() has a sep= keyword argument:
Separator between items if file is a text file. Empty (“”) separator means the file should be treated as binary. Spaces (” ”) in the separator match zero or more whitespace characters. A separator consisting only of spaces must match at least one whitespace.
The default value of sep="" means that np.fromfile() tries to read it as a binary file rather than a space-separated text file, so you get nonsense values back. If you use np.fromfile('markers.txt', sep=" ") you will get the result you are looking for.
However, as others have pointed out, np.loadtxt() is the preferred way to convert text files to numpy arrays, and unless the file needs to be human-readable it is usually better to use binary formats instead (e.g. np.load()/np.save()).

Resources