Remove unicode '\xa0' from pandas column - python-3.x

I was given a latin-1 .txt dataset, which I am trying to clean up to use for proper analysis using python 3 and pandas. The dataset, being scraped from html contains a number of \xa0 occurences, which I can't seem to deal with using decode, strip, -u replace, or any other method which I found on stack overflow. All of my attempts seem to be ignored by python, still printing out the same results. As I am new to data scraping, chances are that I missed out on something obvious, but right now I don't see a way forward
I have tried to decode to ascii, strip to str and then replace, or replace using a -u clause, not leading to anything
filepath = 'meow3.txt'
outF = open("myOutFile.txt", "a")
with open(filepath) as fp:
line = fp.readline()
for line in fp:
if line.strip().startswith(','):
line = line.replace(',','',1)
line = line.replace(u'\xa0', u' ')
print(line)
df = pd.read_csv('myOutFile.txt', sep=",", encoding="latin-1", header=None, names=["Company name", "Total", "Invested since-to"])
print (df)
3sun Group, £10m ,Feb 2014
,Abacus, £4.15m ,Aug 2013
,Accsys Group ,£12m, Mar 2017,
Acro ,\xa0£7.8m, Nov 2015 – Sep 2017,
ACS Clothing, £25.3m ,Jan 2014
this is how the dataset looks like, and why in my code I am removing the first comma provided it is at the start of the column. But none of the suggested answers I tried seemed to help with removing the \xa0 part of the dataset, still giving the same result (seen above). If anyone has any clue for how I could make this work, I would be very grateful,
Cheers,
Jericho
Edit: While I know this would be best dealt with by pre-processing before turning it into txt file, I have no access or control of that process, and I have to work with the data I was given.

I suddenly stuck by this problem today and finally find a quickest and neatest solution.
Say your pandas dataframe has a column with values like 'South\xa0Carolina'.
Use the following code to remove all '\xa0'. Actually I have tried .replace("\xa0"," ") and .replace(r'\xa0',r' '), but none of them worked.
data['col'].str.split().str.join(' ')

do this after reading the file.
df['col'] = df['col'].apply(lambda x: str(x).replace(u'\xa0', u''))

Maybe decoding line to UTF8 will help
line = line.decode('UTF-8')
Then do the string replacement after that, if necessary.

Related

Remove double quotes while printing string in dataframe to text file

I have a dataframe which contains one column with multiple strings. Here is what the data looks like:
Value
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1
There are almost 100,000 such rows in the dataframe. I want to write this data into a text file.
For this, I tried the following:
df.to_csv(filename, header=None,index=None,mode='a')
But I am getting the entire string in quotes when I do this. The output I obtain is:
"EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
But what I want is:
EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1 -> No Quotes
I also tried this:
df.to_csv(filename,header=None,index=None,mode='a',
quoting=csv.QUOTE_NONE)
However, I get an error that an escapechar is required. If i add escapechar='/' into the code, I get '/' in multiple places (but no quotes). I don't want the '/' either.
Is there anyway I can remove the quotes while writing into a text file WITHOUT adding any other escape characters ?
Based on OP's comment, I believe the semicolon is messing things up. I no longer have unwanted \ if using tabs to delimit csv.
import pandas as pd
import csv
df = pd.DataFrame(columns=['col'])
df.loc[0] = "EU-1050-22345,201908 XYZ DETAILS, CD_123_123;CD_123_124,2;1"
df.to_csv("out.csv", sep="\t", quoting=csv.QUOTE_NONE, quotechar="", escapechar="")
Original Answer:
According to this answer, you need to specify escapechar="\\" to use csv.QUOTE_NONE.
Have you tried:
df.to_csv("out.csv", sep=",", quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\")
I was able to write a df to a csv using a single space as the separator and get the "quotes" around strings removed by replacing existing in-string spaces in the dataframe with non-breaking spaces before I wrote it as as csv.
df = df.applymap(lambda x: str(x).replace(' ', u"\u00A0"))
df.to_csv(outpath+filename, header=True, index=None, sep=' ', mode='a')
I couldn't use a tab delimited file for what I was writing output for, though that solution also works using additional keywords to df.to_csv(): quoting=csv.QUOTE_NONE, quotechar="", escapechar="")

Writing in columns of an excel file from a list of texts

I have a list of texts (reviews_train) which I gathered from a text file (train.txt).
reviews_train = []
for line in open('C:\\Users\\Dell\\Desktop\\New Beginnings\\movie_data\\train.txt', 'r', encoding="utf8"):
reviews_train.append(line.strip())
Suppose reviews_train = ["Nice movie","Bad film",....]
I have another result.csv file which looks like
company year
a 2000
b 2001
.
.
.
What I want to do is add another column text to the existing file to look something like this.
company year text
a 2000 Nice movie
b 2001 Bad film
.
.
.
The items of the list should get appended in the new column one after the other.
I am really new to python. Can some one please tell me how to do it? Any help is really aprreciated.
EDIT: My question is not just about adding another column in the .csv file. The column should have the texts in the list appended row wise.
EDIT: I used the solution given by #J_H but I get this error
Use zip():
def get_rows(infile='result.csv'):
with open(infile) as fin:
sheet = csv.reader(fin)
for row in sheet:
yield list(row)
def get_lines(infile=r'C:\Users\Dell\Desktop\New Beginnings\movie_data\train.txt'):
return open(infile).readlines()
for row, line in zip(get_rows(), get_lines()):
row.append(line)
print(row)
With those 3-element rows in hand,
you could e.g. writerow().
EDIT
The open() in your question mentions 'r' and encoding='utf8',
which I suppressed since open() should default to using those.
Apparently you're not using the python3 mentioned in your tag,
or perhaps an ancient version.
PEPs 529 & 540 suggest that since 3.6 windows will default to UTF-8,
just like most platforms.
If your host manages to default to something crazy like CP1252,
then you will certainly want to override that:
return open(infile, encoding='utf8').readlines()

Python 3: Removing u200b (zwsp) and newlines (\n) and spaces - chaining List operations?

I'm really stumped as to why this doesn't work. All I want to do is removing zwsp (u200b), and newlines and extra spaces from content read from a file.
Ultimately, I want to write this out to a new file, which I have functional, just not in the desired format yet.
My input (a short test file, which has zwsp / u200b in it) consists of the following:
Australia 1975
​Adelaide ​ 2006 ​ 23,500
Brisbane (Logan) 2006 29,700
​Brisbane II (North Lakes) ​ 2016 ​ 29,000
Austria 1977
Graz 1989 26,100
Innsbruck 2000 16,000
Klagenfurt 2008 27,000
My code so is as follows:
input_file = open('/home/me/python/info.txt', 'r')
file_content = input_file.read()
input_file.close()
output_nospace = file_content.replace('\u200b' or '\n' or ' ', '')
print(output_nospace)
f = open('nospace_u200b.txt', 'w')
f.write(output_nospace)
f.close()
However, this doesn't work as I expect.
Whilst it removes u200b, it does not remove newlines or spaces. I have to test for absence of u200b by checking the output file produced as part of my script.
If I remove one of the operations, e.g. /u200b, like so:
output_nospace = file_content.replace('\n' or ' ', '')
...then sure enough the resulting file is without newlines or spaces, but u200b remains as expected. Revert back to the original described at the top of this post, and it doesn't remove u200b, newlines and spaces.
Can anyone advise what I'm doing wrong here? Can you chain list operations like this? How can I get this to work?
Thanks.
The result of code like "a or b or c" is just the first thing of a, b, or c that isn't considered false by Python (None, 0, "", [], and False are some false values). In this case the result is the first value, the zwsp character. It doesn't convey to the replace function that you're looking to replace a or b or c with ''; the replace code isn't informed you used 'or' at all. You can chain replacements like this, though: s.replace('a', '').replace('b', '').replace('c', ''). (Also, replace is a string operation, not a list operation, here.)
Based on this question, I'd suggest a tutorial like learnpython.org. Statements in Python or other programming languages are different from human-language sentences in ways that can confuse you when you're just starting out.
As indicated by #twotwotwo, the following implementation of a .replace chain solves the issue.
output_nospace = \
file_content.replace('\u200b', '').replace('\n', '').replace(' ', '')
Thanks so much for pointing me in the right direction. :)

Gap Analysis/Report for CSV in Python 3.6.2

Start End
MM0001 MM0009
MM0010 MM0020
MM0030 MM0039
MM0059 MM0071
Good afternoon, I wanted to create code in Python in 3.6.2 that will allow me to essentially look for gaps in rows of consecutive numbers, such as with this one. It would then output to the screen for the missing numbers in a format similar to below:
MM0021 MM0029
MM0040 MM0051
MM0052 MM0058
I've created some code for this program based on an answer I found around here, but I don't believe it's complete, as well as it being done in Python 2.7 I believe. I however used it as a basis for what I was trying to do.
import csv
with open('thedata.csv') as csvfile:
reader = csv.reader (csvfile)
for line, row in enumerate(reader, 1):
if not row:
print 'Start of line', line, 'Contents', row
Any help will be greatly appreciated.
import csv
def out(*args):
print('{},{}'.format(*(str(i).rjust(4, "0") for i in args)))
prev = 0
data = csv.reader(open('thedata.csv'))
print(*next(data), sep=', ') # header
for line in data:
start, end = (int(s.strip()[2:]) for s in line)
if start != prev+1:
out(prev+1, start-1)
prev = end
out(start, end)
it’s really ugly sorry, but should work?
outputs comma separated text
if something doesn’t work, or you want it to save to a file, just comment

Python code to read first 14 characters, uniquefy based on them, and parse duplicates

I have a list of more than 10k os string that look like different versions of this (HN5ML6A02FL4UI_3 [14 numbers or letters_1-6]), where some are duplicates except for the _1 to _6.
I am trying to find a way to list these and remove the duplicate 14 character (that comes before the _1-_6).
Example of part of the list:
HN5ML6A02FL4UI_3
HN5ML6A02FL4UI_1
HN5ML6A01BDVDN_6
HN5ML6A01BDVDN_1
HN5ML6A02GVTSV_3
HN5ML6A01CUDA2_1
HN5ML6A01CUDA2_5
HN5ML6A02JPGQ9_5
HN5ML6A02JI8VU_1
HN5ML6A01AJOJU_5
I have tried versions of scripts using Reg Expressions: var n = /\d+/.exec(info)[0]; into the following that were posted into my previous question. and
I also used a modified version of the code from : How can I strip the first 14 characters in an list element using python?
More recently I used this script and I am still not getting the correct output.
import os, re
def trunclist('rhodopsins_play', 'hope4'):
with open('rhodopsins_play','r') as f:
newlist=[]
trunclist=[]
for line in f:
if line.strip().split('_')[0] not in trunclist:
newlist.append(line)
trunclist.append(line.split('_')[0])
print newlist, trunclist
# write newlist to file, with carriage returns
with open('hope4','w') as out:
for line in newlist:
out.write(line)
My inputfile.txt contains more than 10k of data which looks like the list above, where the important part are the characters are in front of the '_' (underscore), then outputting a file of the uniquified ABCD12356_1.
Can someone help?
Thank you for your help
Import python and run this script that is similar to the above. It is slitting at the '_' This worked on the file
def trunclist(inputfile, outputfile):
with open(inputfile,'r') as f:
newlist=[]
trunclist=[]
for line in f:
if line.strip().split('_')[0] not in trunclist:
newlist.append(line)
trunclist.append(line.split('_')[0])
print newlist, trunclist
# write newlist to file, with carriage returns
with open(outputfile,'w') as out:
for line in newlist:
out.write(line)

Resources