Can the Microsoft Cognitive Speech to Text service recognize dates and output the text in a formatted way?
For example when input speech is, "date of birth is five fifteen sixty-four",
that the text output shows, "date of birth is 5/15/64".
Thank you!
When you use Speech-To-Text functionality, you can get results in different formats for a sentence, for example here for a year (1932):
Here is the documentation about each format: https://learn.microsoft.com/fr-fr/azure/cognitive-services/speech-service/rest-apis#nbest
If the output does not fit your needs, I would highly suggest to use a DateTime recognizer (which is used in other solutions like LUIS) from the text that has been generated in STT output. The sources are here and there are packages available depending on your programming language.
Related
I am currently working on a digitalisation project which consists in extracting specific information from pdf-formatted electricity invoices. Once the data is extracted, I would like to store it in an Excel spreadsheet.
The objectives are the following:
First of all, the data to be extracted would be the following:
https://i.stack.imgur.com/6RLo2.png
In this case, the data to be extracted is the information surrounded in red. This would be the CUPS, the total amount and the consumed electricity per period (P1-P6).
Once this is extracted, I would like to display this in an Excel Spreadsheet.
Could you please give me any ideas/tips regarding the extraction of this data? I understand that OCR software would do this best, but do not know how I could extract this specific information.
Thanks for you help and advice.
If there is no text data in your PDF then I don't believe there is a clean and consistent way to do this yet. If your invoice templates are always the same format and resolution, then the pixel coordinates of the text positions should be the same.
This means that you can create a cropped image with only the text you're interested in. Then you can use your OCR tool to extract all the text and you have extracted your data field. You would have to do this for all the data fields that you want to extract.
This would only work for invoices that always have the same format and resolution. So scanned invoices wouldn't work, and dynamic tables make things exponentially more complex as well.
I would check if its possible to simply extract the text using PDF to text 1st then work my cmd text parsing around that output, and loop file to file.
I don't have your sample to test so you would need to adjust to suit your bills
pdftotext -nopgbrk -layout electric.pdf - |findstr /i "cups factura" & pdftotext -nopgbrk -layout -y 200 -W 300 -H 200 electric.pdf
Personally would use the two parts as separate cycles so first pair replace the , with a safe csv character such as * then inject , for the large gap to make them 2 column csv (perhaps replace the Γé¼ with ,€ if necessary since your captured text may be in €uros already)
The second group I would possibly inject , by numeric position to form the desired columns, I only demo 4 column by 2 rows but you want 7 column by 4 rows, so adjust those values to suit. However, you can use any language you are familiar with such as VBA to split how you want to import in to eXcel.
In Excel you may want to use PowerQuery to read the pdf:
https://learn.microsoft.com/en-us/power-query/connectors/pdf
Then you can further process to extract the data you want within PowerQuery.
If you are interested in further data analysis after extraction you may want to consider KNIME as well:
https://hub.knime.com/jyotendra/spaces/Public/latest/Reading%20PDF%20and%20extracting%20information~pNh3GdorF0Z9WGm8
From there export to Excel is also supported.
edit:
after extracting, regex helps to filter for the specific data, e.g. look for key words, length and structure of the data item (e.g. the CUPS number), is it a currency with decimal etc.
edit 2: regex in Excel
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
e.g. look for a new line starting with CUPS followed by a sequence of 15-characters (if you have more details, you can specify the matching pattern more: e.g. starting with E, or 5th character is X or 5, etc.)
I have a dataset (df_test) containing of several news articles (Text_4). Using SpaCy, I've extracted the 'DATE' entities. For those I want to see whether they are in the future or in the past (to identify news articles that reference future events such as product launches) compared to the article's publication date (RP_DateFormatted)
My current code is
for index, row in df_test.iterrows():
doc = nlp(row.Text_4)
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}
... some other steps ... then:
ListDATE3 = [dateparser.parse(replace_all((i.text), od), languages=['en'],
settings={'RELATIVE_BASE': datetime.strptime(row.RP_DateFormatted, '%Y-%m-%d'),
'PREFER_DAY_OF_MONTH': 'last',
'PREFER_DATES_FROM': 'future'}) for i in entities['DATE']]
df_test.PY_Entities_DatesParsed[index] = ListDATE3
I have trouble with the line 'PREFER_DATES_FROM': 'future', for example:
Article was written on August 15th 2005 but no year is given in the text. SpaCy extracts "Aug 15" as Date. The dateparser sets the year to 2006 (because it is in the future). Consequently, I would then believe that the news article talks about the future - which it does not.
Setting 'PREFER_DATES_FROM': 'past' would also not help me in a case when an event is described that happens in February (without a year given in the text). This is likely to be next February but the dateparser would set it to this year's February.
Is there a way to add an if statement to the settings or to create a new function based on the dateparser? Please note that each news articles can have multiple dates (entities['DATE'] is a list for each row in my dataframe).
I am using Python 3.8
I don't think you're going to be able to solve this just with options to DateParser. That interprets dates mechanically given a string, but in order to tell whether these dates are in the past or future you're using knowledge of the surrounding words and context of the article ("at next February's festival...").
This is a pretty hard thing to get right in an automated system. In NLP research this is referred to as "grounding", and includes related problems, like telling who "President of the United States" refers to (what year was it?), or what color "red" is (is it red like a stop sign, or red like red hair?).
What I would do is start by using rule-based techniques to identify whether dates are in the past or future before passing them to date parser. So take some words from around date entities, and if "last" is there then it's in the past, if "next" is there then it's in the future, that sort of thing. See how well it does. (You might think you could just take words before the date entity, but you can also have "February last year was really cold" or something.)
If you want to try a statistical system after that, you could look at using the spancat in spaCy with different kinds of context windows to classify dates as "future" or "past".
I've searched on the internet and couldn't find. The only solution I found was to download kutools, and I can't do it.
I've made a macro that gets some values from an intranet, but I need the type of cell to be in date so I can work around it and filter it.
I don't know if I explained it correctly, and sorry if my english isn't the best.
I made an image to better explain it.
How it currently is / How I want it to be:
You can use the DATE and TIME functions to convert your text to date/time format.
=DATE(2018,MID(A2,4,2),LEFT(A2,2))+TIME(MID(A2,7,2),RIGHT(A2,2),0)
Using Filters:
I have already extracted all the tweets in csv file, I want to seperate twitter text from hashtags and urls, so far I have serarated the hashtags in excel using
Data -> Text to Column
First I don't know how to separate urls using this method
Second is there a better way to do that? All the online links are separating both things at the time of scrapping
TEXT
Learned a new concept today : metamorphic testing. http:/t.co/0is1IUs3aW
variant identification in pooled DNA using R http:/t.co/4PQfUaU
Meta-All: a system for managing metabolic pathway information http:/t.co/2PfJXUxq2X
Here is what it should look like
TEXT URL
Learned a new concept today : metamorphic testing. http:/t.co/0is1IUs3aW
variant identification in pooled DNA using R http:/t.co/4PQfUaU
Meta-All: a system for managing metabolic pathway information http:/t.co/2PfJXUxq2X
Right now both the text and url are in one column I want to put them in different columns
Extract the URL from A2: =MID(A2,FIND("http",A2),500)
The rest from A2: =MID(A2,1,FIND("http",A2)-1)
I would use a simple set of formulas.
=find()
=left()
=Right()
Here are the formula's I used
Here are the results of those formulas
Basically, the find() formula allows you to find where the ""Http:" is in your string. Left() allows you to print() everything to the left of that. Right() lets you get everything to the right.
Here is what I want to do. I want to search in article. With text string "XXXX" and then a number in the same line "22663" and the search result would be listed according to ascending order of the "22663" that it has found.
For example I want to search "highest year" human lived ever. So there should be a number in the same line or near the line. And I want the results would be based on ascending order of the year.
Thinking about Google's Custom Search API (or Web Search API) to search for keyword, then saving the first couple thousand results, and then searching through each file with the above process.