Looping over Tuple of Btye object gives ASCII value - python-3.x

The below code in python 3 returns as
fro val in A:
print (val)
Returns:
[(b'3 (RFC822 {821}', b'MIME-Version: 1.0\r\nDate: Sun, 2 Feb 2020
22:12:19 +0530\r\nMessage-ID:
\r\nSubject:
code\r\nFrom: abc >\r\nTo: abc \r\nContent-Type:
multipart/alternative;
boundary="43434343"\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type:
text/plain; charset="UTF-8"\r\n\r\n1. 4549 3867 6. 1755 6816\r\n2.
3068 0287 7. 8557 7000\r\n3. 3827 1727 8. 4177 1609\r\n4. 5093 4909 9.
9799 3366\r\n5. 1069 7992 10. 5141
2029\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type: text/html;
charset="UTF-8"\r\n\r\ntest
code\r\n\r\n--0000000000008ecb2e059d9a7dfe--'), b')']
whereas for
for val in A:
for v in val:
print (v)
returns:
b'3 (RFC822 {821}' b'MIME-Version: 1.0\r\nDate: Sun, 2 Feb 2020
22:12:19 +0530\r\nMessage-ID:
\r\nSubject:
code\r\nFrom: >\r\nTo: \r\nContent-Type: multipart/alternative;
boundary="0000000000008ecb2e059d9a7dfe"\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type:
text/plain; charset="UTF-8"\r\n\r\n1. 4549 3867 6. 1755 6816\r\n2.
3068 0287 7. 8557 7000\r\n3. 3827 1727 8. 4177 1609\r\n4. 5093 4909 9.
9799 3366\r\n5. 1069 7992 10. 5141
2029\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type: text/html;
charset="UTF-8"\r\n\r\n\r\n\r\n--0000000000008ecb2e059d9a7dfe--' 41
I dont understand why i am getting ASCII Value of ')' i.e 41 and how can i just read it as ')'

It would appear that when you iterate of a bytes object, it yields a decimal value for each byte in it:
eg:
>>> phrase = b'Hello, world!'
>>> chars = list(phrase)
>>> chars
[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 33]
With your example, you only have one character, so it's only one number that's printed. The reason none of the other bytes strings are printed as integers is because they're incased in an extra tuple, so instead it's the tuple that's iterated over.
Fixing your code:
A = [
(
b'3 (RFC822 {821}',
b'MIME-Version: 1.0\r\nDate: Sun, 2 Feb 2020 22:12:19 +0530\r\nMessage-ID: \r\nSubject: code\r\nFrom: abc >\r\nTo: abc \r\nContentType: multipart/alternative; boundary="43434343"\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type: text/plain; charset="UTF-8"\r\n\r\n1. 4549 3867 6. 1755 6816\r\n2. 3068 0287 7. 8557 7000\r\n3. 3827 1727 8. 4177 1609\r\n4. 5093 4909 9. 9799 3366\r\n5. 1069 7992 10. 5141 2029\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type: text/html; charset="UTF-8"\r\n\r\ntest code\r\n\r\n--0000000000008ecb2e059d9a7dfe--'
),
b')'
]
for val in A:
if isinstance(val, bytes):
print(val)
else:
for v in val:
print (v)
Output:
b'3 (RFC822 {821}'
b'MIME-Version: 1.0\r\nDate: Sun, 2 Feb 2020 22:12:19 +0530\r\nMessage-ID: \r\nSubject: code\r\nFrom: abc >\r\nTo: abc \r\nContentType: multipart/alternative; boundary="43434343"\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type: text/plain; charset="UTF-8"\r\n\r\n1. 4549 3867 6. 1755 6816\r\n2. 3068 0287 7. 8557 7000\r\n3. 3827 1727 8. 4177 1609\r\n4. 5093 4909 9. 9799 3366\r\n5. 1069 7992 10. 5141 2029\r\n\r\n--0000000000008ecb2e059d9a7dfe\r\nContent-Type: text/html; charset="UTF-8"\r\n\r\ntest code\r\n\r\n--0000000000008ecb2e059d9a7dfe--'
b')'
Basically, if the object that we've encountered in A is a bytes-object, we print it as it is. Otherwise (eg if it's a tuple), we iterate over it, and print its items.

Related

Set the extracted text in a column As a Single String in Pytesseract

So I extracted string from an image with 3 columns.
the extracted text is:
SUBJECT GRADE FINALGRADE CREDITS
ADVANCED CALCULUS 1 1.54 A 3
I want to put a separator between these items that it should look like this:
SUBJECT, GRADE, FINALGRADE, CREDITS
ADVANCED CALCULUS 1, 1.54, A, 3
We can achieve the solution by two-steps.
Specify the starting keyword.
Split the line using space as the separator.
If we look at the provided example from the comment:
We don't need any image-preprocessing, since there is no artifact in the image.
Assume I want to separate the row starting with "state" with comma.
Specify the starting keyword:
start_word = line.split(" ")[0]
Split the line using space as the separator:
if start_word == "state":
line = line.split(" ")
Now for each word in the line, we can add comma to the end
for word in line:
result += word + ", "
But we need to remove the last two characters, otherwise it will end "2000, "
result = result[:-2]
print(result)
Result:
state, 1983, 1987, 1988, 1993, 1994, 1999, 2000
Code:
import cv2
import pytesseract
img = cv2.imread("15f8U.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 11, 2)
txt = pytesseract.image_to_string(gry)
txt = txt.split("\n")
result = ""
for line in txt:
start_word = line.split(" ")[0]
if start_word == "state":
line = line.split(" ")
for word in line:
result += word + ", "
result = result[:-2]
print(result)
continue
if line != '' or line != "":
print(line)
Result:
Table 1: WAGE SAMPLE STATISTICS, by year and state (1983-2000)
Logged mean wages
in year
state, 1983, 1987, 1988, 1993, 1994, 1999, 2000
Andhra Pradesh 5.17 5.49 5.53 6.28 6.24 5.77 5.80
Gujarat 9 6.04 5.92 6.64 6.58 6.09 6.04
Haryana 12 6.25 6.43 6.80 6.60 6.54 6.74
Manipur 54 6.31 6.73 7.15 7.09 6.90 6.83
Orissa 5.24 5.90 5.96 6.16 6.26 5.57 5.58
Tamil Nadu 5.19 5.67 5.68 6.31 633 6.02 5.97
Uttar Pradesh 5.55 6.06 3 6.61 2 6.00 6.07
Mizoram 6.43 5.44 6.03 681 6.76 8 7

how do I show the correct text in the powershell console

My process1.txt is get from ps > process1.txt in windows powershell
it is like this
Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
167 12 2044 8024 0.02 14640 1 acrotray
448 29 46584 33768 0.30 14692 1 acwebbrowser
I use ipython3 in win10
In [12]: f = open('process1.txt',encoding = 'unicode_escape')
In [13]: line = f.readlines()
I want to show the third line like
167 12 2044 8024 0.02 14640 1 acrotray
but when I type in line[3] is show like this
In [15]: line[3]
Out[15]: '\x00\n'
and mysystem is utf-8 system how do I show the correct line ?
I used ConEmu and regenerated the file in the ConEmu console
tasklist > process3.txt
and get this
In [1]: f = open('process3.txt')
In [2]: line = f.readlines()
In [3]: line[3]
Out[3]: 'System Idle Process 0 Services 0 8 K\n'

Groovy Script: List / Map Collection

I've created a script which I'm using to simulate the behaviour of a SOAP service in SOAP UI (as a mock service) for sprint testing purposes but am having problems when trying to iterate over a List I've created. The List is made up of a number of Maps, and each Map contains a List. It should look something like this:
[[Rec: 1, Items: ["AB123","BC234","CD345"]],[Rec: 2, Items: ["AB123","BC234","CD345","DE456"]]]
And this is the code I have to build up the List:
def offerMap = [:]
def outputList = []
def offerItemList = []
def outputMap = [:]
def outList = []
def item = ""
def rec = ""
offerItemList.add("AB123")
offerItemList.add("BC234")
offerItemList.add("CD345")
offerMap.put("Rec",1)
offerMap.put("Items",offerItemList)
outputList.add(offerMap)
log.info "OUT: outputList.size ${outputList.size()}"
log.info "OUT: offerItemList.size ${offerItemList.size()}"
offerMap.clear()
offerItemList.clear()
offerItemList.add("AB123")
offerItemList.add("BC234")
offerItemList.add("CD345")
offerItemList.add("DE456")
offerMap.put("Rec",2)
offerMap.put("Items",offerItemList)
outputList.add(offerMap)
log.info "OUT: outputList.size ${outputList.size()}"
log.info "OUT: offerItemList.size ${offerItemList.size()}"
And this is the the code I have to iterate over the list:
outputList.each {
log.info "OUT: outputList.size ${outputList.size()}"
outputMap.clear()
outputMap = it
rec = outputMap.get("Rec")
log.info "OUT: REC ${rec}"
outList.clear()
outList = outputMap.get("Items")
outList.each {
item = it
log.info "OUT: Item ${item}"
}
}
But this is not giving me the results I expect. The first problem is that the outputList.each loop appears to immediately be jumping to the second entry in the list, as witnessed from the output:
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: outputList.size 1
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: offerItemList.size 3
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: outputList.size 2
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: offerItemList.size 4
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: outputList.size 2
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: REC 2
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: Item AB123
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: Item BC234
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: Item CD345
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: Item DE456
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: outputList.size 2
Fri Nov 03 17:54:32 GMT 2017:INFO:OUT: REC null
I'm running out of ideas and fear I may be missing something fundamental due to my lack of experience with Groovy.
Consider the following. Note that the goal isn't entirely clear, but this is an educated guess (also unclear on Rec versus Row in your example at the top).
def outputList = [
['Rec': 1, 'Items': ["AB123","BC234","CD345"]],
['Rec': 2, 'Items': ["AB123","BC234","CD345","DE456"]]
]
outputList.each { outputMap ->
// is row == Rec ???
def row = outputMap['Rec']
println "OUT ROW: ${row}"
def items = outputMap['Items']
items.each { item ->
println "OUT Item: ${item}"
}
}
Output:
$ groovy Example.groovy
OUT ROW: 1
OUT Item: AB123
OUT Item: BC234
OUT Item: CD345
OUT ROW: 2
OUT Item: AB123
OUT Item: BC234
OUT Item: CD345
OUT Item: DE456

using shift() to compare row elements

I have the sample data and code below where I'm trying to loop through the dataDF column with the function and find the first case of increasing values then return the Quarter value corresponding the the 1st increasing value from the dataDF column. I'm planning to use the function with apply, but I don't think I'm using shift() properly. If I just try to return dataDF.shift() I get an error. I'm new to python so any tips on how to compare a row to the next row or what I'm doing wrong with shift() are greatly appreciated.
Sample Data:
return dataDF.head(20).to_dict()
{'Quarter': {246: '2008q3',
247: '2008q4',
248: '2009q1',
249: '2009q2',
250: '2009q3',
251: '2009q4',
252: '2010q1',
253: '2010q2',
254: '2010q3',
255: '2010q4',
256: '2011q1',
257: '2011q2',
258: '2011q3',
259: '2011q4',
260: '2012q1',
261: '2012q2',
262: '2012q3',
263: '2012q4',
264: '2013q1',
265: '2013q2'},
'dataDF': {246: 14843.0,
247: 14549.9,
248: 14383.9,
249: 14340.4,
250: 14384.1,
251: 14566.5,
252: 14681.1,
253: 14888.6,
254: 15057.700000000001,
255: 15230.200000000001,
256: 15238.4,
257: 15460.9,
258: 15587.1,
259: 15785.299999999999,
260: 15973.9,
261: 16121.9,
262: 16227.9,
263: 16297.299999999999,
264: 16475.400000000001,
265: 16541.400000000001}}
Code:
def find_end(x):
qrts = []
if (dataDF < dataDF.shift()):
qrts.append(dataDF.iloc[0,:].shift(1))
return qrts
Try
df.Quarter[df.dataDF > df.dataDF.shift()].iloc[0]
Returns
'2009q3'
IIUC:
In [46]: x.loc[x.dataDF.diff().gt(0).idxmax(), 'Quarter']
Out[46]: '2009q3'
Explanation:
In [43]: x
Out[43]:
Quarter dataDF
246 2008q3 14843.0
247 2008q4 14549.9
248 2009q1 14383.9
249 2009q2 14340.4
250 2009q3 14384.1
251 2009q4 14566.5
252 2010q1 14681.1
253 2010q2 14888.6
254 2010q3 15057.7
255 2010q4 15230.2
256 2011q1 15238.4
257 2011q2 15460.9
258 2011q3 15587.1
259 2011q4 15785.3
260 2012q1 15973.9
261 2012q2 16121.9
262 2012q3 16227.9
263 2012q4 16297.3
264 2013q1 16475.4
265 2013q2 16541.4
In [44]: x.dataDF.diff()
Out[44]:
246 NaN
247 -293.1
248 -166.0
249 -43.5
250 43.7 # <-------------------
251 182.4
252 114.6
253 207.5
254 169.1
255 172.5
256 8.2
257 222.5
258 126.2
259 198.2
260 188.6
261 148.0
262 106.0
263 69.4
264 178.1
265 66.0
Name: dataDF, dtype: float64
In [45]: x.dataDF.diff().gt(0).idxmax()
Out[45]: 250
Using numpy to find the argmax of diff greater than 0. Then using get_value to retrieve the value we need.
v = dataDF.dataDF.values
j = dataDF.columns.get_loc('Quarter')
dataDF.get_value((np.diff(v) > 0).argmax() + 1, j, takeable=True)
'2009q3'
What about the speeeeed!

How to combine multiple character columns into a single column in an R data frame

I am working with Census data and I need to combine four character columns into a single column.
Example:
LOGRECNO STATE COUNTY TRACT BLOCK
60 01 001 021100 1053
61 01 001 021100 1054
62 01 001 021100 1055
63 01 001 021100 1056
64 01 001 021100 1057
65 01 001 021100 1058
I want to create a new column that adds the strings of STATE, COUNTY, TRACT, and BLOCK together into a single string. Example:
LOGRECNO STATE COUNTY TRACT BLOCK BLOCKID
60 01 001 021100 1053 01001021101053
61 01 001 021100 1054 01001021101054
62 01 001 021100 1055 01001021101055
63 01 001 021100 1056 01001021101056
64 01 001 021100 1057 01001021101057
65 01 001 021100 1058 01001021101058
I've tried:
AL_Blocks$BLOCK_ID<- paste(c(AL_Blocks$STATE, AL_Blocks$County, AL_Blocks$TRACT, AL_Blocks$BLOCK), collapse = "")
But this combines all rows of all four columns into a single string.
Try this:
AL_Blocks$BLOCK_ID<- with(AL_Blocks, paste0(STATE, COUNTY, TRACT, BLOCK))
there was a typo in County... it should've been COUNTY. Also, you don't need the collapse parameter.
I hope that helps.
You can use do.call and paste0. Try:
AL_Blocks$BLOCK_ID <- do.call(paste0, AL_Block[c("STATE", "COUNTY", "TRACT", "BLOCK")])
Example output:
do.call(paste0, AL_Blocks[c("STATE", "COUNTY", "TRACT", "BLOCK")])
# [1] "010010211001053" "010010211001054" "010010211001055" "010010211001056"
# [5] "010010211001057" "010010211001058"
do.call(paste0, AL_Blocks[2:5])
# [1] "010010211001053" "010010211001054" "010010211001055" "010010211001056"
# [5] "010010211001057" "010010211001058"
You can also use unite from "tidyr", like this:
library(tidyr)
library(dplyr)
AL_Blocks %>%
unite(BLOCK_ID, STATE, COUNTY, TRACT, BLOCK, sep = "", remove = FALSE)
# LOGRECNO BLOCK_ID STATE COUNTY TRACT BLOCK
# 1 60 010010211001053 01 001 021100 1053
# 2 61 010010211001054 01 001 021100 1054
# 3 62 010010211001055 01 001 021100 1055
# 4 63 010010211001056 01 001 021100 1056
# 5 64 010010211001057 01 001 021100 1057
# 6 65 010010211001058 01 001 021100 1058
where "AL_Blocks" is provided as:
AL_Blocks <- structure(list(LOGRECNO = c("60", "61", "62", "63", "64", "65"),
STATE = c("01", "01", "01", "01", "01", "01"), COUNTY = c("001", "001",
"001", "001", "001", "001"), TRACT = c("021100", "021100", "021100",
"021100", "021100", "021100"), BLOCK = c("1053", "1054", "1055", "1056",
"1057", "1058")), .Names = c("LOGRECNO", "STATE", "COUNTY", "TRACT",
"BLOCK"), class = "data.frame", row.names = c(NA, -6L))
You can try this too
AL_Blocks <- transform(All_Blocks, BLOCKID = paste(STATE,COUNTY,
TRACT, BLOCK, sep = "")
Or try this
DF$BLOCKID <-
paste(DF$LOGRECNO, DF$STATE, DF$COUNTY,
DF$TRACT, DF$BLOCK, sep = "")
(Here is a method to set up the dataframe for people coming into this discussion later)
DF <-
data.frame(LOGRECNO = c(60, 61, 62, 63, 64, 65),
STATE = c(1, 1, 1, 1, 1, 1),
COUNTY = c(1, 1, 1, 1, 1, 1),
TRACT = c(21100, 21100, 21100, 21100, 21100, 21100),
BLOCK = c(1053, 1054, 1055, 1056, 1057, 1058))
You can use tidyverse package:
DF %>% unite(new_var, STATE, COUNTY, TRACT, BLOCK)
The new kid on the block is the glue package:
library(glue)
my_data %>%
glue::glue("{STATE}{COUNTY}{TRACT}{BLOCK}")
You can both WRITE and READ Text files with any specified "string-separator", not necessarily a character separator. This is very useful in many cases when the data has practically all terminal symbols, and thus, no 1 symbol can be used as a separator. Here are examples of read and write functions:
WRITE OUT Special Separator Text:
writeSepText <- function(df, fileName, separator) {
con <- file(fileName)
data <- apply(df, 1, paste, collapse = separator)
# data
data <- writeLines(data, con)
close(con)
return
}
Test Writing out text file separated by a string "bra_break_ket"
writeSepText(df=as.data.frame(Titanic), fileName="/Users/user/break_sep.txt", separator="<break>")
READ In text files with special separator string
readSepText <- function(fileName, separator) {
data <- readLines(con <- file(fileName))
close(con)
records <- sapply(data, strsplit, split=separator)
dataFrame <- data.frame(t(sapply(records,c)))
rownames(dataFrame) <- 1: nrow(dataFrame)
return(as.data.frame(dataFrame,stringsAsFactors = FALSE))
}
Test Reading in text file separated by
df <- readSepText(fileName="/Users/user/break_sep.txt", separator="<break>"); df

Resources