Is there a Python function that removes characters (with a digit) from a string? - python-3.x

I am working on a project about gentrification. My teammates pulled data from the census and cleaned it to get the values we need. The issue is, the zip code values won't print 0's (i.e. "2322" when it should be "02322"). We managed to find the tact value that prints the full zip code with the tact codes("ZCTA5 02322"). I want to remove "ZCTA5" to get the zip code alone.
I've tried the below code but it only gets rid of the "ZCTA" instead of "ZCTA5" (i.e. "502322"). I'm also concerned that if I manage to remove the 5 with the characters, it will remove all 5's in the zip codes as well.
From there I will be pulling from pgeocode to access the respective lat & lng values to create the heatmap. Please help?
I've tried the .replace(), .translate(), functions. Replace still prints the zip codes with 5. Translate gets an attribute error.
Sample data
Zipcode | Name | Change_In_Value | Change_In_Income | Change_In_Degree | Change_In_Rent
2322 | ZCTA5 02322 | -0.050242 | -0.010953 | 0.528509 | -0.013263
2324 | ZCTA5 02324 | 0.012279 | -0.022949 | -0.040456 | 0.210664
2330 | ZCTA5 02330 | 0.020438 | 0.087415 | -0.095076 | -0.147382
2332 | ZCTA5 02332 | 0.035024 | 0.054745 | 0.044315 | 1.273772
2333 | ZCTA5 02333 | -0.012588 | 0.079819 | 0.182517 | 0.156093
Translate
zipcode = []
test2 = gent_df['Name'] = gent_df['Name'].astype(str).translate({ord('ZCTA5'): None}).astype(int)
zipcode.append(test2)
test2.head()
Replace
zipcode = []
test2 = gent_df['Name'] = gent_df['Name'].astype(str).replace(r'\D', '').astype(int)
zipcode.append(test2)
test2.head()
Replace
Expected:
24093
26039
34785
38944
29826
Actual:
524093
526039
534785
538944
529826
Translate
Expected:
24093
26039
34785
38944
29826
Actual:
AttributeError Traceback (most recent call last)
<ipython-input-71-0e5ff4660e45> in <module>
3 zipcode = []
4
----> 5 test2 = gent_df['Name'] = gent_df['Name'].astype(str).translate({ord('ZCTA5'): None}).astype(int)
6 # zipcode.append(test2)
7 test2.head()
~\Anaconda3\envs\MyPyEnv\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5178 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5179 return self[name]
-> 5180 return object.__getattribute__(self, name)
5181
5182 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'translate'

It looks like you are using pandas so you should be able to use the .lstrip() method. I tried this on a sample df and it worked for me:
gent_df.Name = gent_df.Name.str.lstrip(to_strip='ZCTA5')
Here is a link to the library page for .strip(), .lstrip(), and .rstrip()
I hope this helps!

There are many ways to do this. I can think of 2 off the top of my head.
If you want to keep the last 5 characters of the zipcode string, regardless of whether they are digits or not:
gent_df['Name'] = gent_df['Name'].str[-5:]
If want to get the last 5 digits of the zipcode string:
gent_df['Name'] = gent_df['Name'].str.extract(r'(\d{5})$')[0]
Include some sample data for more specific answer.

Related

Bad value during floating point read

I'm running some simulation code using ubuntu and I keep running into the same error. I am trying to read data from the .dat file. But there is some error which I could not find.
This is error message:
At line 1939 of file CompoundMPIBSC20200823.f90 (unit = 11, file = 'C-340120b.dat')
Fortran runtime error: Bad value during floating point read"
And C-340120b.dat file looks like this:
C-340120b.dat file
6 10 1.531581196563372e-15
0.0014553174 0.0055615333 0.0119703978 0.0203850084 0.0305528957 0.0422600997 0.0553257997 0.0695976542 0.0849475255 0.1012676622
0.1184670631 0.1364683308 0.1552047081 0.1746171362 0.1946516651 0.2152575030 0.2363847713 0.2579826431 0.2799978445 0.3023733948
0.3250475021 0.3479528735 0.3710162901 0.3941586119 0.4172949742 0.4403351781 0.4631846022 0.4857450840 0.5079162267 0.5295967855
0.5506862041 0.5710862773 0.5907027347 0.6094469646 0.6272378474 0.6440030470 0.6596808652 0.6742212377 0.6875870601 0.6997550335
0.7107163254 0.7204769965 0.7290581527 0.7364958286 0.7428406051 0.7481569620 0.7525223874 0.7560262383 0.7587683705 0.7608575997
0.7624099924 0.7635469966 0.7643935146 0.7650758792 0.7657198055 0.7664483527 0.7673799150 0.7686263022 0.7702908977 0.7724669898
0.7752362202 0.7786672785 0.7828147402 0.7877181912 0.7934015601 0.7998727305 0.8071233638 0.8151290525 0.8238496666 0.8332299862
0.8432005800 0.8536788860 0.8645704950 0.8757706624 0.8871659719 0.8986361128 0.9100557539 0.9212965992 0.9322292652 0.9427254733
0.9526599682 0.9619125838 0.9703700425 0.9779277777 0.9844915839 0.9899791031 0.9943211058 0.9974625834 0.9993636185 1.0000000000
6 11 1.475893077189510e-15
0.0016525844 0.0062956494 0.0135059282 0.0229208176 0.0342303569 0.0471704147 0.0615165460 0.0770787515 0.0936966304 0.1112350678
0.1295802432 0.1486359799 0.1683205560 0.1885633307 0.2093018003 0.2304793480 0.2520427451 0.2739399970 0.2961185348 0.3185236648
0.3410972443 0.3637767806 0.3864948806 0.4091790904 0.4317517598 0.4541305840 0.4762291287 0.4979578467 0.5192252758 0.5399391683
0.5600082344 0.5793434896 0.5978599757 0.6154784463 0.6321271026 0.6477428598 0.6622731054 0.6756767045 0.6879251243 0.6990032865
0.7089101737 0.7176591771 0.7252781911 0.7318094543 0.7373091004 0.7418464594 0.7455031054 0.7483717148 0.7505546359 0.7521623630
0.7533118077 0.7541244615 0.7547244701 0.7552366540 0.7557845046 0.7564881945 0.7574626361 0.7588156121 0.7606460387 0.7630423235
0.7660809513 0.7698252039 0.7743241221 0.7796116798 0.7857062078 0.7926100427 0.8003094738 0.8087748795 0.8179612002 0.8278085409
0.8382431405 0.8491784407 0.8605164079 0.8721490905 0.8839601234 0.8958267320 0.9076214149 0.9192140226 0.9304737122 0.9412709632
0.9514795610 0.9609785827 0.9696542131 0.9774014787 0.9841259295 0.9897450178 0.9941894123 0.9974040480 0.9993489837 1.0000000000
...
...
...
6 30500 2.203435261320421e-18
0.5647132406 0.8435296561 0.9197993603 0.9501219587 0.9657979424 0.9751478483 0.9812026747 0.9853454006 0.9882967561 0.9904674356
0.9921063590 0.9933715119 0.9943670960 0.9951635956 0.9958099778 0.9963412928 0.9967830203 0.9971540150 0.9974684566 0.9977371653
0.9979685074 0.9981690312 0.9983439172 0.9984973073 0.9986325426 0.9987523421 0.9988589360 0.9989541675 0.9990395697 0.9991164266
0.9991858197 0.9992486651 0.9993057428 0.9993577205 0.9994051720 0.9994485928 0.9994884125 0.9995250049 0.9995586968 0.9995897744
0.9996184895 0.9996450644 0.9996696956 0.9996925575 0.9997138054 0.9997335777 0.9997519982 0.9997691778 0.9997852163 0.9998002034
0.9998142198 0.9998273389 0.9998396267 0.9998511432 0.9998619429 0.9998720754 0.9998815859 0.9998905154 0.9998989017 0.9999067791
0.9999141790 0.9999211303 0.9999276594 0.9999337906 0.9999395461 0.9999449463 0.9999500100 0.9999547546 0.9999591958 0.9999633483
0.9999672254 0.9999708395 0.9999742019 0.9999773228 0.9999802118 0.9999828775 0.9999853278 0.9999875699 0.9999896101 0.9999914544
0.9999931080 0.9999945754 0.9999958609 0.9999969680 0.9999978997 0.9999986585 0.9999992466 0.9999996655 0.9999999164 1.0000000000
The 3 dots in the above data file are just to tell you that there are many more entries in the file. These dots are not there in the original file.
And program :
file CompoundMPIBSC20200823.f90
open(11,file=fname(mel),status='old',form='formatted')
open(12,file=fname1(mel),status='old',form='formatted')
do men=1,nen !nen=75:energy intervals Energy split number cycle
!!write(iw,*) 'mel,men:',mel,men
read (11,'(i2,I7,d22.15/(10f13.10))') na,nenerg,tcrpc,(rpw(i),i=1,ith)
read (12,'(i2,I7,d22.15/(10f15.10))') na,nenerg,tcrpc,(rpw1(i),i=1,ith)
!!write(iw,*) 'mel,men,na,nenerg:',mel,men,na,nenerg,tcrpc
ftcs(men)=tcrpc ! corresponds to the total elastic scattering cross section at the energy
penergy(men)=nenerg/1000.
rpw(ith)=1.
!---------------------------
line 1939 is
read (11,'(i2,I7,d22.15/(10f13.10))') na,nenerg,tcrpc,(rpw(i),i=1,ith)
I've tried different modifications of the code but didn't get any results.
Any help would be greatly appreciated!
You appear to have spaces padding the fields in your data file, but not in your read format. The first line of your file (with column labels) is
000000000111111111122222222223333
123456789012345678901234567890123
6 10 1.531581196563372e-15
so splitting this into i2,I7,d22.15 gives
i2 | I7 | d22.15 |
00 | 0000000 | 1111111111222222222233 | 33
12 | 3456789 | 0123456789012345678901 | 23
6 | 1 | 0 1.531581196563372e- | 15
which is clearly not as intended.
There are two ways around this problem:
As Ian Bush points out, you can forego the read format entirely, and used list-directed input, as
read (11,*) na,nenerg,tcrpc,(rpw(i),i=1,ith)
This will parse your file token by token rather than relying on column widths, and is usually a much better option for parsing data files.
If you must use a read format, you need to add space padding to it, e.g.
'(i2,X,I7,X,d22.15/10(X,f13.10))', which will then split the input string as
i2 | X | I7 | X | d22.15
00 | 0 | 0000001 | 1 | 1111111122222222223333
12 | 3 | 4567890 | 1 | 2345678901234567890123
6 | | 10 | | 1.531581196563372e-15

Let the user create a custom format using a string

I'd like a user to be able to create a custom format in QtWidgets.QPlainTextEdit() and it would format the string and split out the results in another QtWidgets.QPlainTextEdit().
For example:
movie = {
"Title":"The Shawshank Redemption",
"Year":"1994",
"Rated":"R",
"Released":"14 Oct 1994",
"Runtime":"142 min",
"Genre":"Drama",
"Director":"Frank Darabont",
"Writer":"Stephen King (short story \"Rita Hayworth and Shawshank Redemption\"),Frank Darabont (screenplay)",
"Actors":"Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler",
"Plot":"Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.",
"Language":"English",
"Country":"USA",
"Awards":"Nominated for 7 Oscars. Another 21 wins & 36 nominations.",
"Poster":"https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU#._V1_SX300.jpg",
"Ratings": [
{
"Source":"Internet Movie Database",
"Value":"9.3/10"
},
{
"Source":"Rotten Tomatoes",
"Value":"91%"
},
{
"Source":"Metacritic",
"Value":"80/100"
}
],
"Metascore":"80",
"imdbRating":"9.3",
"imdbVotes":"2,367,380",
"imdbID":"tt0111161",
"Type":"movie",
"DVD":"15 Aug 2008",
"BoxOffice":"$28,699,976",
"Production":"Columbia Pictures, Castle Rock Entertainment",
"Website":"N/A"
}
custom_format = '[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]'.format(Title=movie['Title'], Runtime=movie['Runtime'], Genre=movie['Genre'],Released=movie['Released'],Rated=movie['Rated'])
print(custom_format)
This code above, would easily print [ The Shawshank Redemption | ⌚ 142 min | ⭐ Drama | 📅 14 Oct 1994 | R ].
However, if I change this code from:
custom_format = '[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]'.format(Title=movie['Title'], Runtime=movie['Runtime'], Genre=movie['Genre'],Released=movie['Released'],Rated=movie['Rated'])
To:
custom_format = "'[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]'.format(Title=movie['Title'], Runtime=movie['Runtime'], Genre=movie['Genre'],Released=movie['Released'],Rated=movie['Rated'])"
Notice, that the whole thing is wrapped in "". Therefor its a string. Now doing this will not print out the format that I want.
The reason I wrapped it in "" is because when I add my original custom_format into a QtWidgets.QPlainTextEdit(), it converts it into a string it wont format later on.
So my original idea was, the user creates a custom format for themselves in a QtWidgets.QPlainTextEdit(). Then I copy that format, open a new window wher the movie json variable is contained and paste the format into another QtWidgets.QPlainTextEdit() where it would hopefuly show it formatted correctly.
Any help on this would be appreciated.
ADDITIONAL INFORMATION:
User creates their format inside QtWidgets.QPlainTextEdit().
Then the user clicks Test Format which should display [ The Shawshank Redemption | ⌚ 142 min | ⭐ Drama | 📅 14 Oct 1994 | R ] but instead it displays
Trying to use the full format command would require an eval(), which is normally considered not only bad practice, but also a serious security issue, especially when the input argument is completely set by the user.
Since the fields are known, I see little point in providing the whole format line, and it is better to parse the format string looking for keywords, then use keyword lookup to create the output.
class Formatter(QtWidgets.QWidget):
def __init__(self):
super().__init__()
layout = QtWidgets.QVBoxLayout(self)
self.formatBase = QtWidgets.QPlainTextEdit(
'[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]')
self.formatOutput = QtWidgets.QPlainTextEdit()
layout.addWidget(self.formatBase)
layout.addWidget(self.formatOutput)
self.formatBase.textChanged.connect(self.processFormat)
self.processFormat()
def processFormat(self):
format_str = self.formatBase.toPlainText()
# escape double braces
clean = re.sub('{{', '', re.sub('}}', '', format_str))
# capture keyword arguments
tokens = re.split(r'\{(.*?)\}', clean)
keywords = tokens[1::2]
try:
# build the dictionary with given arguments, unrecognized keywords
# are just printed back in the {key} form, in order let the
# user know that the key wasn't valid;
values = {k:movie.get(k, '{{{}}}'.format(k)) for k in keywords}
self.formatOutput.setPlainText(format_str.format(**values))
except (ValueError, KeyError):
# exception for unmatching braces
pass

Katalon Posibble to assert response = data?

I store json test data in excel file.
Make use of apache POI to read the json data and parse it as request body, call it from katalon.
Then I write many lines of assertion (groovy assert) to verify each line response = test data.
Example:
Assert test.responseText.fieldA == 'abc'
Assert test.responseText.fieldB == 'xyz'
And so on if I have total of 20 fields.
I'm thinking of there is better way to make use of the json data stored in data file.
To assert the response = test data. So I can save alot of time to key in each line and modify them is the test data changed.
Please advise if this can be achieved?
Here is an example: you have two excel sheets - current values and expected values (values you are testing against).
Current values:
No. | key | value
----+-----+------
1 a 100
2 b 6
3 c 13
Expected values:
No. | key | value
----+-----+------
1 a 100
2 b 6
3 c 14
You need to add those to Data Files:
The following code will compare the values in the for loop and the assertion will fail on the third run (13!=14):
def expectedData = findTestData("expected")
def currentData = findTestData("current")
for(i=1; i<=currentData.getRowNumbers(); i++){
assert currentData.getValue(2, i) == expectedData.getValue(2, i)
}
Failure message should look like this:
2020-07-02 15:16:40.471 ERROR c.k.katalon.core.main.TestCaseExecutor - ❌ Test Cases/table comparison FAILED.
Reason:
Assertion failed:
assert currentData.getValue(2, i) == expectedData.getValue(2, i)
| | | | | | |
| 14 3 | | 13 3
| | com.kms.katalon.core.testdata.reader.SheetPOI#5aabbb29
| false
com.kms.katalon.core.testdata.reader.SheetPOI#72c927f1

How to get parents and grand parents tags given specific attribute in XML in python?

I have an xml with a structure like this one:
<cat>
<foo>
<fooID>1</fooID>
<fooName>One</fooName>
<bar>
<barID>a</barID>
<barName>small_a</barName>
<barClass>
<baz>
<qux>
<corge>
<corgeName>...</corgeName>
<corgeType>
<corgeReport>
<corgeReportRes Reference="x" Channel="High">
<Pos>1</Pos>
</corgeReportRes>
</corgeReport>
</corgeType>
</corge>
</qux>
</baz>
</barClass>
</bar>
<bar>
<barID>b</barID>
<barName>small_b</barName>
<barClass>
<baz>
<qux>
<corge>
<corgeName>...</corgeName>
<corgeType>
<corgeReport>
<corgeReportRes Reference="y" Channel="High">
<Pos>1</Pos>
</corgeReportRes>
</corgeReport>
</corgeType>
</corge>
</qux>
</baz>
</barClass>
</bar>
</foo>
<foo>
<fooID>2</fooID>
<fooName>Two</fooName>
<bar>
<barID>c</barID>
<barName>small_c</barName>
<barClass>
<baz>
<qux>
<corge>
<corgeName>...</corgeName>
<corgeType>
<corgeReport>
<corgeReportRes Reference="z" Channel="High">
<Pos>1</Pos>
</corgeReportRes>
</corgeReport>
</corgeType>
</corge>
</qux>
</baz>
</barClass>
</bar>
</foo>
</cat>
And, I would like to obtain the values of specific parent/grand parent/grand grand parent tags that have a node with attribute Channel="High". I would like to obtain only fooID value, fooName value, barID value, barName value.
I have the following code in Python 3:
import xml.etree.ElementTree as xmlET
root = xmlET.parse('file.xml').getroot()
test = root.findall(".//*[#Channel='High']")
Which is actually giving me a list of elements that match, however, I still need the information of the specific parents/grand parents/grand grand parents.
How could I do that?
fooID | fooName | barID | barName
- - - - - - - - - - - - - - - - -
1 | One | a | small_a <-- This is the information I'm interested
1 | One | b | small_b <-- Also this
2 | Two | c | small_c <-- And this
Edit: fooID and fooName nodes are siblings of the grand-grand-parent bar, the one that contains the Channel="High". It's almost the same case for barID and barName, they are siblings of the grand-parent barClass, the one that contains the Channel="High". Also, what I want to obtain is the values 1, One, a and small_a, not filtering by it, since there will be multiple foo blocks.
If I understand you correctly, you are probably looking for something like this (using python):
from lxml import etree
foos = """[your xml above]"""
items = []
for entry in doc.xpath('//foo[.//corgeReportRes[#Channel="High"]]'):
items.append(entry.xpath('./fooID/text()')[0])
items.append(entry.xpath('./fooName/text()')[0])
items.append(entry.xpath('./bar/barID/text()')[0])
items.append(entry.xpath('./bar/barName/text()')[0])
print('fooID | fooName | barID | barName')
print(' | '.join(items))
Output:
fooID | fooName | barID | barName
1 | One | a | small_a

How to remove '/5' from CSV file

I am cleaning a restaurant data set using Pandas' read_csv.
I have columns like this:
name, online_order, book_table, rate, votes
xxxx, Yes, Yes, 4.5/5, 705
I expect them to be like this:
name, online_order, book_table, rate, votes
xxxx, Yes, Yes, 4.5, 705
You basically need to split the item(dataframe["rate"]) based on / and take out what you need. .apply this on your dataframe using lambda x: getRate(x)
def getRate(x):
return str(x).split("/")[0]
To use it with column name rate, we can use:
dataframe["rate"] = dataframe["rate"].apply(lambda x: getRate(x))
You can use the python .split() function to remove specific text, given that the text is consistently going to be "/5", and there are no instances of "/5" that you want to keep in that string. You can use it like this:
num = "4.5/5"
num.split("/5")[0]
output: '4.5'
If this isn't exactly what you need, there's more regex python functions here
You can use DataFrame.apply() to make your replacement operation on the ratecolumn:
def clean(x):
if "/" not in x :
return x
else:
return x[0:x.index('/')]
df.rate = df.rate.apply(lambda x : clean(x))
print(df)
Output
+----+-------+---------------+-------------+-------+-------+
| | name | online_order | book_table | rate | votes |
+----+-------+---------------+-------------+-------+-------+
| 0 | xxxx | Yes | Yes | 4.5 | 705 |
+----+-------+---------------+-------------+-------+-------+
EDIT
Edited to handle situations in which there could be multiple / or that it could be another number than /5 (ie : /4or /1/3 ...)

Resources