Suppose this is the data:
2011/03/06,17:24:17.100,EUR/USD,1.40200,3000000
I want to extract the last digit of the price, even if it is zero. Here is the code:
library(data.table)
library(stringr)
types<-list(date="TEXT",time="TEXT",curr="TEXT",price="TEXT",volume="INTEGER")
mydata=read.table("data.out", sep="," , header=FALSE)
setnames(mydata, names(types))
str_sub(mydata$price,start=-1)
this is the result: "2", it ignores zeros. I read the price as a text so I should get "0".
EDIT, Thanks to jlhoward:
mydata<-read.table(...) is converting the price and volume columns to numeric automatically. I used mydata<- read.table(...,colClasses="character"), problem solved.
You aren't stating where your data come from. If you use read.csv on a file, you'll have the items separated into columns immediately. You can even get them from the keyboard, e.g.:
read.csv(stdin(),header=FALSE)
0: 300,hello,40.2
1:
# V1 V2 V3
# 1 300 hello 40.2
Then you could do foo<- dat$V3%%10 , as in my comment above, to get the last digit.
Here is the method I talked about in the comment.
line = "2011/03/06,17:24:17.100,EUR/USD,1.40200,3000000"
sp = line.split(",")
sp
['2011/03/06', '17:24:17.100', 'EUR/USD', '1.40200', '3000000']
price = sp[-2]
price
'1.40200'
last = price[-1]
last
'0'
Related
I have a text file to import to dictionary but I have an issue trying to get the program to identify the correct line no as item 1 and items 2 in a list in dictionary
The format of text file is like this (there is no empty line between each lines and only at the end of each record, there is a line break):
ProductA
2020-08-03 16:26:21
This painting was done by XNB.
The artist seeks to portray the tragedies caused by event XYZ.
The painting weighs 2kg.
####blank line#####
ProductB
2020-08-03 16:26:21
This painting is done by ONN.
It was stolen during world war 2.
Decades later, it was discovered in the black market of country XYZ.
It was bought for 2 million dollars by ABC.
###blank line###
Desired outcome in dictionary:
{ 'ProductA' : ['2020-08-03 16:26:21', 'This painting was done by XNB.The artist seeks to portray the tragedies caused by event XYZ. The painting weighs 2kg.'], 'ProductB':['2020-08-03 16:26:21','This painting is done by ONN.This painting is done by ONN.Decades later, it was discovered in the black market of country XYZ.It was bought for 2 million dollars by ABC.']}
where item_2 is a single string that is combined from line 3 onwards till the end of the information where it meets a blank line.
Problem: I don't know how to code the logic in such as way that the program will be able to properly assign it to where I want it to.
header = ""
header = True
for line in records:
data = line.splitlines()
if line!= '\n': # check for line break which indicate new record
if Header: #
#code which will assign 1st line of each record as key to dictionary
else:
# This is where I need help.
# Code which will assign 2nd line as item_1 and then assign 3rd lines onwards till the end of record as item_2 in a single string.
# items_2 may have different number of lines being combined into 1 string for each record.
# I try to form a rough idea how the logic might be in code below but I feel that something is missing and I got a bit confused.
for line in list: # result in TypeError, 'type' object is not iterable.
dict[line[1]] = dict[header].append(line[1].strip("\n"))
# Since the outer if has already done its job of identifying 1st line of record. The line of code seeks to assign the next line (line 2 in text file) which I think would be interpreted by the program as line[1] to item 2.
dict[line[2:]] = dict[header].append(line[2:].strip("\n"))
# Assign 3rd line of text file onwards as a single string which is item_2 in the list of value for dictionary.
else:
#code which reset boolean for header
Try this:
with open('data.txt') as fp:
data = fp.read().split('\n\n')
res = {}
for x in data:
k, v = x.strip().split('\n', 1)
v = v.split('\n')
res[k] = [v[0], ' '.join(v[1:])]
print(res)
Output:
{'ProductA': ['2020-08-03 16:26:21', 'This painting was done by XNB. The artist seeks to portray the tragedies caused by event XYZ. The painting weighs 2kg.'], 'ProductB': ['2020-08-03 16:26:21', 'This painting is done by ONN. It was stolen during world war 2. Decades later, it was discovered in the black market of country XYZ. It was bought for 2 million dollars by ABC.']}
I have a string containing thousands of lines of this data without line break (only a few lines shown for readability with line break)
5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital
7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital
Format is
(entry number)(district)(patient number)(age)(gender)(case of)(symptoms)(comorbidity)(date of death)(place of death)
without spaces, or brackets.
Problem : The data i want to collect is age.
However i cant seem to find a way to single out the age since its clouded by a lot of other numbers in the data. I have tried various iterations of count, limiting it to 1 to 99, separating the data etc, and failed.
My Idea : Since the gender is always either 'M'/'F', and the two numbers before the gender is the age. Isolating the two numbers before the gender seems like an ideal solution.
xxM
xxF
My Goal : I would like to collect all the xx numbers irrespective of gender and store them in a list. How do i go about this?
import re
input_str = '5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'
ages = [found[-3:-1] for found in re.findall('[0-9]+[M,F]', input_str, re.I)]
print(ages)
# ['62', '65']
This works fine with the sample but if there are districts starting with 'M/F' then entry number will be collected as well.
A workaround is to match exactly seven digits (if the patient number is always 5 digits and and the age is generally 2 digits).
ages = [found[-3:-1] for found in re.findall(r'\d{7}[M,F]', input_str, re.I)]
With the structure you gave I've built a dict of reg expressions to match components. Then put this back into a dict
There are ways I can imagine this will not work
if age < 10, only 1 digit so you will pick up a digit of patient number
there maybe strings that don't match the re expressions which will mean odd results
It's the most structured way I can think to go....
import re
data = "5BengaluruUrban4598962MSARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital"
md = {
"entrynum": "([0-9]+)",
"district": "([A-Z,a-z]+)",
"patnum_age": "([0-9]+)",
"sex": "([M,F])",
"remainder": "(.*)$"
}
data_dict = {list(md.keys())[i]:tk
for i, tk in
enumerate([tk for tk in re.split("".join(md.values()), data) if tk!=""])
}
print(f"Assumed age:{data_dict['patnum_age'][-2:]}\nparsed:{data_dict}\n")
output
Assumed age:62
parsed:{'entrynum': '5', 'district': 'BengaluruUrban', 'patnum_age': '4598962', 'sex': 'M', 'remainder': 'SARICoughBreathlessnessDM23.07.2020atGovernmenthospital7DakshinaKannada4786665FSARICoughDMHTN23-07-2020atPrivatehospital'}
I have successfully setup Tesseract and can translate the images to text...
text = pytesseract.image_to_string(Image.open(image))
However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?
I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.
After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...
text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...
text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)
Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...
conf = text.groupby(['block_num'])['conf'].mean()
#Srikar Appalaraju is right. Take the following example image:
Now use the following code:
text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()
Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:
lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
.apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
line_conf = []
for i in range(len(lines)):
if lines[i].strip():
line_conf.append((lines[i], round(confs[i],3)))
with the following desired output:
[('Ying Thai Kitchen', 91.667),
('2220 Queen Anne AVE N', 88.2),
('Seattle WA 98109', 90.333),
('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
('‘uw .yingthaikitchen.com', 40.0),
('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
('Order#:17 Table 2', 94.0),
('Date: 7/4/2013 7:28 PM', 86.25),
('Server: Jack (1.4)', 83.0),
('44 Ginger Lover $9.50', 89.0),
('[Pork] [24#]', 43.0),
('Brown Rice $2.00', 95.333),
('Total 2 iten(s) $11.50', 89.5),
('Sales Tax $1.09', 95.667),
('Grand Total $12.59', 95.0),
('Tip Guide', 95.0),
('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
('Thank you very much,', 90.75),
('Cone back again', 92.667)]
The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is
text.groupby(['block_num','par_num','line_num'])['text'].apply(list)
We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?
Column block_num: Block number of the detected text or item
Column par_num: Paragraph number of the detected text or item
Column line_num: Line number of the detected text or item
Column word_num: word number of the detected text or item
But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.
I am trying to work out how I can compare a list of words against a string and report back the word number from list one when they match. I can easily get the unique list of words from a sentence - just removing duplicates, and with enumerate I can get a value for each word, so Mary had a little lamb becomes 1, Mary, 2, had, 3, a etc. But I cannot work out how to then search the original list again and replace each word with its number value (so it becomes 1 2 3 etc).
Any ideas greatly received!
my_list.index(word)
will return the index of the item word within my_list. You can start digging into the documentation here
Thank you for this info. I can see the logic for this and it should work, however I get: line 27, in output=words.index(result) ValueError: ['word1', 'word2'] is not in list With the following code:
def remove_duplicates(words):
output = []
seen = set()
for value in words:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# Remove duplicates from this list.
sentence = input("Enter a sentence ")
words = sentence.split(' ')
result = remove_duplicates(words)
print(result)
Very confusing :(
I have found an answer on here:
positions = [ i+1 for i in range(len(result)) if each == result[i]]
Which works well.
How i find out which person stayed maximum nights? Name and total how many days? (date format MM/DD)
for example
text file contain's
Robin 01/11 01/15
Mike 02/10 02/12
John 01/15 02/15
output expected
('john', 30 )
my code
def longest_stay(fpath):
with open(fpath,'r')as f_handle:
stay=[]
for line in f_handle:
name, a_date, d_date = line.strip().split()
diff = datetime.strptime(d_date, "%m/%d") -datetime.strptime(a_date, "%m/%d")
stay.append(abs(diff.days+1))
return name,max(stay)
It always return first name.
This can also be implemented using pandas. I think it will much simpler using pandas.
One issue I find is that how you want to handle when you have many stayed for max nights. I have addressed that in the following code.
import pandas as pd
from datetime import datetime as dt
def longest_stay(fpath):
# Reads the text file as Dataframe
data = pd.read_csv(fpath + 'test.txt', sep=" ", header = None)
# adding column names to the Data frame
data.columns = ['Name', 'a_date', 'd_date']
# Calculating the nights for each customer
data['nights'] = datetime.strptime(d_date, "%m/%d") - datetime.strptime(a_date, "%m/%d")
# Slicing the data frame by applying the condition and getting the Name of the customer and nights as a tuple (as expected)
longest_stay = tuple( data.ix[data.nights == data.nights.max(), {'Name', 'nights'}])
# In case if many stayed for the longest night. Returns a list of tuples.
longest_stay = [tuple(x) for x in longest_stay]
return longest_stay
Your code fails but not storing the first name, it is because name is going to be set to the last name in the file because you only store the days as you go, hence you always see the last name.
You also add + 1 which does not seem correct as you should not be adding or including the last day as the person does not stay that night. Your code would actually output ('John', 32) the correct name by chance because it is the last in your sample file and the day off by 1.
Just keep track of the best which includes the name and day count as you go using the days stayed as the measure and return that at the end:
from datetime import datetime
from csv import reader
def longest_stay(fpath):
with open(fpath,'r')as f_handle:
mx,best = None, None
for name, a_date, d_date in reader(f_handle,delimiter=" "):
days = (datetime.strptime(d_date, "%m/%d") - datetime.strptime(a_date, "%m/%d")).days
# first iteration or we found
if best is None or mx < days:
best = name, days
return best
Outout:
In [13]: cat test.txt
Robin 01/11 01/15
Mike 02/10 02/12
John 01/15 02/15
In [14]: longest_stay("test.txt")
# 31 days not including the last day as a stay
Out[14]: ('John', 31)
You only need to use abs if the format is not always in the format start-end but be aware would could get the wrong output using the abs value if your dates had years.