How to remove from the Pandas series the characters contained in the list (or series) - python-3.x

Good afternoon. Checking the text in the column, I came across characters that I didn't need:
"|,|.|
|2|5|0|1|6|ё|–|8|3|-|c|t|r|l|+|e|n|g|i|w|s|k|z|«|(|)|»|—|9|7|?|o|b|a|/|f|v|:|%|4|!|;|h|y|u|d|&|j|p|x|m|і|№|ұ|…|қ|$|_|[|]|“|”|ғ|||​|>|−|„|*|¬|ү|ң|#|©|―|q|→|’|∙|·| |ә| |ө|š|é|=|­|×|″|⇑|⇐|⇒|‑|′|\|<|#|'|˚| |ü|̇|̆|•|½|¾|ń|¤|һ|ý|{|}| |‘|ā|í||ī|‎|ќ|ђ|°|‚|ѓ|џ|ļ|▶|新|千|歳|空|港|全|日|機|が|曲|り|き|れ|ず|に|雪|突|っ|込|む|ニ|ュ|ー|ス|¼|ù|~|ə|ў|ҳ|ό||€|🙂|¸|⠀|ä|¯|ツ|ї|ş|è|`|́|ҹ|®|²|‪|ç| |☑|️|‼|ú|‒||👊|🏽|👁|ó|±|ñ|ł|ش|ا|ه|ن|م|›|
|£||||º
Text encoding - UTF8.
How do I correctly remove all these characters from a specific column (series) of a Pandas data frame?
I try
template = bad_symbols[0].str.cat(sep='|')
print(template)
template = re.compile(template, re.UNICODE)
test = label_data['text'].str.replace(template, '', regex=True)
And I get the following error:
"|,|.|
|2|5|0|1|6|ё|–|8|3|-|c|t|r|l|+|e|n|g|i|w|s|k|z|«|(|)|»|—|9|7|?|o|b|a|/|f|v|:|%|4|!|;|h|y|u|d|&|j|p|x|m|і|№|ұ|…|қ|$|_|[|]|“|”|ғ|||​|>|−|„|*|¬|ү|ң|#|©|―|q|→|’|∙|·| |ә| |ө|š|é|=|­|×|″|⇑|⇐|⇒|‑|′|\|<|#|'|˚| |ü|̇|̆|•|½|¾|ń|¤|һ|ý|{|}| |‘|ā|í||ī|‎|ќ|ђ|°|‚|ѓ|џ|ļ|▶|新|千|歳|空|港|全|日|機|が|曲|り|き|れ|ず|に|雪|突|っ|込|む|ニ|ュ|ー|ス|¼|ù|~|ə|ў|ҳ|ό||€|🙂|¸|⠀|ä|¯|ツ|ї|ş|è|`|́|ҹ|®|²|‪|ç| |☑|️|‼|ú|‒||👊|🏽|👁|ó|±|ñ|ł|ش|ا|ه|ن|م|›|
|£||||º
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-105-36817f343a8a> in <module>
5 print(template)
6
----> 7 template = re.compile(template, re.UNICODE)
8
9 test = label_data['text'].str.replace(template, '', regex=True)
5 frames
/usr/lib/python3.7/sre_parse.py in _parse(source, state, verbose, nested, first)
643 if not item or item[0][0] is AT:
644 raise source.error("nothing to repeat",
--> 645 source.tell() - here + len(this))
646 if item[0][0] in _REPEATCODES:
647 raise source.error("multiple repeat",
error: nothing to repeat at position 36 (line 2, column 30)

You need to escape your characters, use re.escape:
import re
template = '|'.join(map(re.escape, bad_symbols[0]))
Then, not need to compile, pandas will handle it for you:
test = label_data['text'].str.replace(template, '', regex=True, flags=re.UNICODE)

Related

How can I make a list of three sentences to a string?

I have a target word and the left and right context that I have to join together. I am using pandas and I try to join the sentences, and the target word, together into a list, which I can then turn into a string so that it would work with my vectorizer. Basically I am just trying to turn a list of three sentences to a string.
This is the error that I get:
AttributeError Traceback (most recent call last)
<ipython-input-195-ae09731d3572> in <module>()
3
4 vectorizer=CountVectorizer(max_features=100000,binary=True,ngram_range=(1,2))
----> 5 feature_matrix=vectorizer.fit_transform(trainTexts)
6 print("shape=",feature_matrix.shape)
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
66 """
67 if lower:
---> 68 doc = doc.lower()
69 if accent_function is not None:
70 doc = accent_function(doc)
AttributeError: 'list' object has no attribute 'lower'
I have tried using .joinand .split but they are not working for me so I am doing something wrong.
import sys
import csv
import random
csv.field_size_limit(sys.maxsize)
trainLabels = []
trainTexts = []
with open ("myTsvFile.tsv") as train:
trainData = [row for row in csv.reader(train, delimiter='\t')]
random.shuffle(trainData)
for example in trainData:
trainLabels.append(example[1])
trainTexts.append(example[3:6])
The indexes example[3:6] means that the 3 is left context 4 is target word and 5 right context.
print('Text:', trainTexts[3])
print('Label:', trainLabels[1])
edited the few printed lines from the code:
['Visa electron käy aika monessa paikassa luottokortista . Mukaanlukien ', 'Paypal', ' , mikä avaa taas lisää ovia .']
['Nyt pistän pääni pölkyllä : ', 'WinForms', ' on ihan ok .']

Pandas & Dataframe: ValueError: can only convert an array of size 1 to a Python scalar

I tried to find a similar question, but still couldnt find the solution.
I am working on a dataframe with pandas.
The following code is nor working. It is only working for the first row on the dataframe. Already for the second row I am getting the error. as below. Maybe somebody sees the mistake and can help :)
census2=census_df[census_df["SUMLEV"]==50]
list=census2["CTYNAME"].tolist()
max=0
for county1 in list:
countylist=[]
df1=census2[census2["CTYNAME"]==county1]
countylist.append(df1["POPESTIMATE2010"].item())
countylist.append(df1["POPESTIMATE2011"].item())
countylist.append(df1["POPESTIMATE2012"].item())
countylist.append(df1["POPESTIMATE2013"].item())
countylist.append(df1["POPESTIMATE2014"].item())
countylist.append(df1["POPESTIMATE2015"].item())
countylist.sort()
difference=countylist[5]-countylist[0]
if difference > max:
max=difference
maxcounty=county1
print(maxcounty)
print(max)
[54660, 55253, 55175, 55038, 55290, 55347]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-340aeaf28039> in <module>()
12 countylist=[]
13 df1=census2[census2["CTYNAME"]==county1]
---> 14 countylist.append(df1["POPESTIMATE2010"].item())
15 countylist.append(df1["POPESTIMATE2011"].item())
16 countylist.append(df1["POPESTIMATE2012"].item())
/opt/conda/lib/python3.6/site-packages/pandas/core/base.py in item(self)
829 """
830 try:
--> 831 return self.values.item()
832 except IndexError:
833 # copy numpy's message here because Py26 raises an IndexError
ValueError: can only convert an array of size 1 to a Python scalar

TypeError: sequence item 0: expected str instance, NoneType found

Issue : In the above code, i have used two specific print statements to do the same thing. While the first one does its job, the second one is throwing an exception while being executed. I have brain stormed it a lot but not able to find exactly where the None type object is coming from inside join:
import numpy as np
from sklearn import preprocessing
input_labels=['red','black','red','green','black','yellow','white']
encoder=preprocessing.LabelEncoder()
encoder.fit(input_labels)
print("\nLabel Mapping:")
for i,item in enumerate(encoder.classes_):
print(item, '--->',i)
print("\nLabel Mapping:",''.join(print(item, '--->',i) for i,item in
enumerate(encoder.classes_)))
Here is the output:
Label Mapping:
black ---> 0
green ---> 1
red ---> 2
white ---> 3
yellow ---> 4
Traceback (most recent call last):
File "C:\Users\satyaranjan.rout\workspace\archival script\bokehtest.py", line 12, in <module>
Label Mapping:
black ---> 0
green ---> 1
red ---> 2
white ---> 3
yellow ---> 4
print("\nLabel Mapping:"),''.join(print(item, '--->',i) for i,item in enumerate(encoder.classes_))
TypeError: sequence item 0: expected str instance, NoneType found
Question : Both code blocks (line 8,9,10) and line 12 are doing the same functions . What is the issue here with the one liner(line 12) for which it is returning Nonetype object from with in join . If i want to remove it, what replacement can be performed?
Change the line
print("\nLabel Mapping:",''.join(print(item, '--->',i) for i,item in enumerate(encoder.classes_)))
into:
print("\nLabel Mapping:",''.join('%s--->%s' % (item, i) for i,item in enumerate(encoder.classes_)))
The return of print function is None so your code tries to join None elements, that is why it gives an error. When you convert format them as string, the problem should be solved.

Compiler error, while creating pdf using pylatex

I am using PyLaTex to create a pdf document. I'm facing some issues with compilers.
I am running my program on MacOS-High Sierra, and have installed basic version of MacTeX Mactex Download and latexmk is also installed using
sudo tlmgr install latexmk
For the following starter code, I'm getting error in compilers loop. Here's the error log attached after code.
import numpy as np
from pylatex import Document, Section, Subsection, Tabular, Math, TikZ, Axis, \
Plot, Figure, Matrix, Alignat
from pylatex.utils import italic
import os
if __name__ == '__main__':
# image_filename = os.path.join(os.path.dirname(__file__), 'kitten.jpg')
geometry_options = {"tmargin": "1cm", "lmargin": "10cm"}
doc = Document(geometry_options=geometry_options)
with doc.create(Section('The simple stuff')):
doc.append('Some regular text and some')
doc.append(italic('italic text. '))
doc.append('\nAlso some crazy characters: $&#{}')
with doc.create(Subsection('Math that is incorrect')):
doc.append(Math(data=['2*3', '=', 9]))
with doc.create(Subsection('Table of something')):
with doc.create(Tabular('rc|cl')) as table:
table.add_hline()
table.add_row((1, 2, 3, 4))
table.add_hline(1, 2)
table.add_empty_row()
table.add_row((4, 5, 6, 7))
doc.generate_pdf('full', clean_tex=False, compiler_args='--latexmk')
Error code:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-dbe7f407e095> in <module>()
27
28
---> 29 doc.generate_pdf('full', clean_tex=False, compiler_args='--latexmk')
~/anaconda3/lib/python3.6/site-packages/pylatex/document.py in generate_pdf(self, filepath, clean, clean_tex, compiler, compiler_args, silent)
227
228 for compiler, arguments in compilers:
--> 229 command = [compiler] + arguments + compiler_args + main_arguments
230
231 try:
TypeError: can only concatenate list (not "str") to list
Please help me understand the error and fix the same
Regards,
Looks like a confusion between the compiler keyword argument, which accepts a string and compiler_args which accepts a list.
Maybe something like this is what you're after:
doc.generate_pdf('full', clean_tex=False, compiler='latexmk', compiler_args=['-c'])

applying a lambda function to pandas dataframe

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')
The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Resources