How to remove special usual characters from a pandas dataframe using Python - python-3.x

I have a file some crazy stuff in it. It looks like this:
I attempted to get rid of it using this:
df['firstname'] = map(lambda x: x.decode('utf-8','ignore'), df['firstname'])
But I wound up with this in my dataframe: <map object at 0x0000022141F637F0>
I got that example from another question and this seems to be the Python3 method for doing this but I'm not sure what I'm doing wrong.
Edit: For some odd reason someone thinks that this has something to do with getting a map to return a list. The central issue is getting rid of non UTF-8 characters. Whether or not I'm even doing that correctly has yet to be established.
As I understand it, I have to apply an operation to every character in a column of the dataframe. Is there another technique or is map the correct way and if it is, why am I getting the output I've indicated?
Edit2: For some reason, my machine wouldn't let me create an example. I can now. This is what i'm dealing with. All those weird characters need to go.
import pandas as pd
data = [['🦎Ale','Αλέξανδρα'],['��Grain','Girl🌾'],['Đỗ Vũ','ên Anh'],['Don','Johnson']]
df = pd.DataFrame(data,columns=['firstname','lastname'])
print(df)
Edit 3: I tired doing this using a reg ex and for some reason, it still didn't work.
df['firstname'] = df['firstname'].replace('[^a-zA-z\s]',' ')
This regex works FINE in another process, but here, it still leaves the ugly characters.
Edit 4: It turns out that it's image data that we're looking at.

Related

Is there any way I can fix the error string indices must be integers?

I tried to make graphs for my csv dataset in Jupyter Notebook, using this line of code:
bank['marital'].value_counts().plot(kind='pie',autopct='%.2f')
plt.show()
However, the system return, "string indices must be integers".
I have tried to use many different methods like changing the string to a number,... but nothing really worked
I tried to reproduce it and it worked fine. So it's not something wrong with the code itself.
I suggest experimenting with:
restart Jupyter Notebook
play with a tiny synthetic dataset
cut the real dataset till it works
attach failing dataset contents to the question
Attaching my results:
[input.csv]
name,smth
Maria,12
Anton,2
Maria,3
...
df = pd.read_csv('input.csv')
df['name'].value_counts().plot(kind='pie',autopct='%.2f')

Is this example data cleaning code updating the pandas dataframe? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
In this article on predicting values with linear regression there's a cleaning step
# For beginning, transform train['FullDescription'] to lowercase using text.lower()
train['FullDescription'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
This isn't actually assigning the changes to the dataframe, is it? But if I try something like this...
train['FullDescription'] = train['FullDescription'].str.lower()
train['FullDescription'] = train['FullDescription'].replace('[^a-zA-Z0-9]', ' ', regex = True)
Then I get a warning...
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
What's the right way to apply these transformations? Are they in fact already being applied? A print(train['FullDescription']) seems to show me they're not.
EDIT: #EdChum and #jezrael are very much onto something about missing code. When I'm actually trying to run this, my data needs to be split into test and train sets.
from sklearn.model_selection import train_test_split
all_data = pandas.read_csv('salary.csv')
train, test = train_test_split(all_data, test_size=0.1)
That's what seems to be causing this error. If I make the next line
train = train.copy()
test = test.copy()
then everything is happy.
You may be wondering if I shouldn't then just apply this step to all_data, which works, but then lower down in the code train['Body'].fillna('nan', inplace=True) still causes an error. So it seems indeed the problem is with train_test_split not creating copies.
The right way to apply these transformations would be...
df.loc[:, 'FullDescription'] = ...
More informations about this would be here. This is a page from the pandas documentation, all the way to the bottom. Quoting...
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
You can also find extra-reasons of Why to use .loc here. Long story short : Explicit is better than implicit. And while df['some_column'] is not immediatly clear about the intent, usingdf.loc['some_column'] is.
I don't really know how to explain it in a simple way, but if you have further questions or if you think I could make my answer more explicit/eloquent, tell me. :)

How to write this type of data to csv

im a noobie in python. I want to get some data from csv with pandas and after write a new csv file with extra data in format
"type";"currency";"amount";"comment"
"type1";"currency1";"amount1";"comment1"
etc
import pandas as pd
import csv
req=pd.read_csv('/Users/user/web/python/Bookcopy.csv')
type="type"
comment = "2week"
i=0
while i<3:
Currency = req['Currency'].values[i]
ReqAmount = req['Request'].values[i]
r = round(ReqAmount,-1)
i+=1
data =[type,Currency,r,comment]
#print(data)
csv_file = open('data2.csv', 'w')
with csv_file:
writer = csv.writer(csv_file)
writer.writerow(data)
print("DONE")
writer.writerows(data)
_csv.Error: iterable expected, not numpy.float64
I have multiple things to criticize here. I hope it doesn't come across mean, but rather educational. While your code would work regardless of those points, it is good coding style to follow them.
Variables should not start with a capital letter. Currency and ReqAmount should be currency and reqAmount.
Don't use keywords for variable names. type is a python keyword.
Make sure your formatting doesn't get destroyed when posting here. Especially for python, which relies on tab formatting. Read here for more information: https://stackoverflow.com/editing-help#code
That said, let me try to go through your code and give you tips and tricks:
Don't run code in python directly. Always use the main() method. It's just better coding style.
When looping, don't use the i=0; while i<3; i+=1 construct, rather use for i in range(3). While it works, it is not very pythonic and a lot harder to read.
Never assume anything that cannot be guaranteed in your code. In this case, you assume that the csv-file has at least 3 lines, otherwise your program would crash. Instead, read the number of lines from the csv file, with len(req).
data =[type,Currency,r,comment] keeps overwriting your data variable. You could either append to data and then write everything to the output file at the end, or directly write to the output file in every iteration.
Don't use open to create a variable (except absolutely necessary). Instead, use open in a with statement. This ensures that the file will get closed properly. I have seen that you do use the with statement, nonetheless you usually use it like with open(...) as variable_name:.
CSV files usually start with the column names. Therefore you should write the column names before you write data.
I won't fix that, because it will change the appearance of the program completely, but normally, don't mix different libraries. If you use pandas for csv reading, also use it for writing. If you use the csv library for writing, also use it for reading. While it isn't wrong to mix them, it is bad style and creates more dependencies than would be necessary.
I don't really understand what your code is supposed to do, so I just guess and hope it goes in the right direction.
When fixing all those points, you might have something like that:
import pandas as pd
import csv
def main():
req=pd.read_csv('/Users/user/web/python/Bookcopy.csv')
transferType = "type"
comment = "2week"
with open('data2.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["type","currency","amount","comment"])
for i in range(len(req)):
currency = req['Currency'].values[i]
reqAmount = req['Request'].values[i]
r = round(reqAmount,-1)
data = [transferType,currency,r,comment]
#print(data)
writer.writerow(data)
print("DONE")
# Whenever you run a program, __name__ will be set to '__main__' in the initial
# script. This makes it easier later when you work with multiple code files.
if __name__ == '__main__':
main()

np.save is converting floats to weird characters

I am attempting to append results to an ongoing csv file. Each result comes out as an nd.array:
[IN]: Print(savearray)
[OUT]: [[ 0.55219001 0.39838119]]
Initially I tried
np.savetxt('flux_ratios.csv', savearray,delimiter=",")
But this overwrites the old data every time I save, so instead I am attempting to append the data like this:
f = open('flux_ratios.csv', 'ab')
np.save(f, 'a',savearray)
f.close()
This is (in a sense) appending, however it is saving the numerical data as weird characters, as can be seen in this screenshot:
I have no idea why or how this is happening so any help would be greatly appreciated!
First off, np.save does not write text whereas np.savetxt does. You are trying to combine binary with text, which is why you get the odd characters when you try to read the file.
You could just change np.save(f, 'a', savearray) to np.savetxt(f, savearray, delimiter=',').
Otherwise you could also consider using pandas.to_csv in append mode.

Unpickling from converted string in python/numpy

I have a ton of numpy ndarrays that are stored picked to strings. That may have been a poor design choice but it's what I did, and now the picked strings seem to have been converted or something along the way, when I try to unpickle I notice they are of type str and I get the following error:
TypeError: 'str' does not support the buffer interface
when I invoke
numpy.loads(bin_str)
Where bin_str is the thing I'm trying to unpickle. If I print out bin_strit looks like
b'\x80\x02cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nndarray\nq\x01K\x00\x85q\x02c_codecs\nencode\nq\x03X\x01\x00\x00\ ...
continuing for some time, so the info seems to be there, I'm just not quite sure how to convert it into whatever string format numpy/pickle need. On a whim I tried
numpy.loads( bytearray(bin_str, encoding='utf-8') )
and
numpy.loads( bin_str.encode() )
which both throw an error _pickle.UnpicklingError: unpickling stack underflow. Any ideas?
PS: I'm on python 3.3.2 and numpy 1.7.1
Edit
I discovered that if I do the following:
open('temp.txt', 'wb').write(...)
return numpy.load( 'temp.txt' )
I get back my array, and ... denotes copying and pasting the output of print(bin_str) from another window. I've tried writing bin_str to a file directly to unpickle but that doesn't work, it complains that TypeError: 'str' does not support the buffer interface. A few sane ways of converting bin_str to something that can be written directly to a binary file result in pickle errors when trying to read it back.
Edit 2
So I guess what's happened is that my binary pickle string ended up encoded inside of a normal string, something like:
"b'pickle'"
which is unfortunate and I haven't figured out how to deal with that, except this ridiculous and convoluted way to get it back:
open('temp.py', 'w').write('foo = ' + bin_str)
from temp import foo
numpy.loads( foo )
This seems like a very shameful solution to the problem, so please give me a better one!
It sounds like your saved strings are the reprs of the original bytes instances returned by your pickling code. That's a bit unfortunate, but not too bad. repr is intended to return a "machine friendly" representation of an object, and it can often be reversed by using eval:
import numpy as np
import pickle
# this part has already happened
orig_obj = np.array([1,2,3])
orig_pickle = pickle.dumps(orig_obj)
saved_str = repr(orig_pickle) # this was a mistake, but it's already done
# this is what you need to do to get something equivalent to orig_obj back
reconstructed_pickle = eval(saved_str)
reconstructed_obj = pickle.loads(reconstructed_pickle)
# test
if np.all(reconstructed_obj == orig_obj):
print("It worked!")
Obligatory note that using eval can be dangerous: Be aware that eval can run any Python code it wants, so don't call it with untrusted data. However, pickle data has the same risks (a malicious Pickle string can run arbitrary code upon unpickling), so you're not losing much safety in this situation. I'm guessing that you trust your data in this case anyway.

Resources