symbol text string - text

Python.
Need to place the symbol (suppose "#") after every three comas in the text.
For example: text = "Reading practice to help you understand simple texts and find specific information in everyday material. Texts include emails, invitations, personal messages, tips, notices and signs. Texts include articles, reports, messages, short stories and reviews."
That should start, I guess, to divide on comas: st_text = text.split(",") Is that right?
then to add certain symbol after every three got splited lines
Something like that?? - print('#'.join(st_text[i:i + 3] for i in range(0, len(st_tex), 3)))
But smth going wrong..
As the result must be: ""Reading practice to help you understand simple texts and find specific information in everyday material. Texts include emails, invitations, personal messages,# tips, notices and signs. Texts include articles, reports,# messages, short stories and reviews."

The join function removes the commas from the text, so I tried other methods and this works:
text = "Reading practice to help you understand simple texts and find specific information in everyday material. Texts include emails, invitations, personal messages, tips, notices and signs. Texts include articles, reports, messages, short stories and reviews."
formatted_text = ''
count = 0
for i in text:
formatted_text = formatted_text + I
if i == ',':
count += 1
if count == 3:
formatted_text = formatted_text + '#'
count = 0
print(formatted_text)

Related

Regular expression match / break

I am doing text analysis on SEC filings (e.g., 10-K), and the documents I have are the complete submission. The complete filing submission includes the 10-K, plus several other documents. Each document resides within the tags ‘<DOCUMENT>’ and ‘</DOCUMENT>’.
What I want: To count the number of words in the 10-K only before the first instance of ‘</DOCUMENT>’
How I want to accomplish it: I want to use a for loop, with a regex (regex_end10k) to indicate where to stop the for loop.
What is happening: No matter where I put my regex match break, the program counts all of the words in the entire document. I have no error, however I cannot get the desired results.
How I know this: I have manually trimmed one filing, while retaining the full document (results below). When I manually remove the undesired documents after the first instance of ‘</DOCUMENT>’, I yield about 750,000 fewer words.
Current output
Note: Apparently I don't have enough SO reputation to embed a screenshot in my post; it defaults to a link.
What I have tried: several variations of where to put the regex match break. No matter what, it almost always counts the entire document. I believe that the two functions may be performed over the entire document. I have tried putting the break statement within get_text_from_html() so that count_words() only performs on the 10-K, but I have had no luck.
The code below is a snippet from a larger function. It's purpose is to (1) strip html tags and (2) count the number of words in the text. If I can provide any additional information, please let me know and I'll update my post.
The remaining code (not shown) extracts firm and report identifiers, (e.g., ‘file’ or ‘cik’) from the header section between tags ‘<SEC-HEADER>’ and ‘</SEC-HEADER>’. Using the same logic, when extracting header information, I use a regex match break logic and it works perfectly. I need help trying to understand why this same logic isn’t working when I try to count the number of words and how to correct my code. Any help is appreciated.
regex_end10k = re.compile(r'</DOCUMENT>', re.IGNORECASE)
for line in f:
def get_text_from_html(html:str):
doc = lxml.html.fromstring(html)
for table in doc.xpath('.//table'): # optional: removes tables from HTML source code
table.getparent().remove(table)
for tag in ["a", "p", "div", "br", "h1", "h2", "h3", "h4", "h5"]:
for element in doc.findall(tag):
if element.text:
element.text = element.text + "\n"
else:
element.text = "\n"
return doc.text_content()
to_clean = f.read()
clean = get_text_from_html(to_clean)
#print(clean[:20000])
def count_words(clean):
words = re.findall(r"\b[a-zA-Z\'\-]+\b",clean)
word_count = len(words)
return word_count
header_vars["words"] = count_words(clean)
match = regex_end10k.search(line) # This should do it, but it doesn't.
if match:
break
You dont need regx, just split your orginal string, and then in the part before count the words, simple example above:
text = 'Text before <DOCUMENT> text after'
splited_text = text.split('<DOCUMENT>')
splited_text_before = splited_text[0]
count_words = len(splited_text_before.split())
print(splited_text_before)
print(count_words)
output
Text before
2

Returns only returning single value from list

I'm trying to return values from list using return but it's only returning single value.
I want to get just numbers (3 tickets) from list of different tickets.
My code is:
tickets = find_elements(locator)
for ticket in tickets:
#Ticket name include full detail like section, row and number of tickets
txt =ticket.text
# splitting text eg: "2 tickets" from "2 tickets · e-ticket"
qty = txt.split()
return qty[0]
It's returning single number.
Please help me solving this and also this this my first question ever posted so pls accept my apologies if guidelines not followed.
Split with the char but not the space.
return txt.split('·')[0].strip()
In the above code we are splitting the text using · and the getting the first part with [0]. Finally, strip() will remove the leading white spaces.
Edit 1:
Here is the approach that you can store the tickets to the list.
ticketsList = []
for ticket in tickets:
#Ticket name include full detail like section, row and number of tickets
txt =ticket.text
# splitting text eg: "2 tickets" from "2 tickets · e-ticket"
qty = txt.split()
ticketsList.append(qty[0])
print(ticketsList)

FuzzyWuzzy for very similar records in Python

I have a dataset with which I want to find the closest string match. For that purpose I'm using FuzzyWuzzy in this way
sol=process.extract(t,dev2,scorer=fuzz.token_sort_ratio)
Where t is the string and dev2 is the list to compare to. My problem is that sometimes it has very similar records and options provided by FuzzyWuzzy seems to be lacking. And I've tested with token_sort, token_set, partial_token sort and set, ratio, partial_ratio, and WRatio.
For example, the string Italy - Serie A gives me the following 2 closest matches.
Token_sort_ratio: (92, 'Italy - Serie D');(86, 'Italian - Serie A')
The one wanted is obviously the second one, but character by character is closer the first one, which is a different league.
This happens as well with teams. If, let's say I have a string Buchtholz I would obtains Buchtholz II before I get TSV Buchtholz.
My main guess now would be to try and weight the presence and absence of several characters more heavily, like single capital letters at the end of the string, so if there is a difference in the letter or an absence it is weighted as less close. Or for () and special characters.
I don't know if there is a way to take this into account or you guys have a better approach to get the string that really matches.
Similarity matches often require knowledge of the data being analysed. i.e. it is not just a blind single round of matching. I recommend that you pass your results through more steps of matching, starting with inclusive/optimistic approaches (like token_set_ratio) with low cut off scores and working toward more exclusive/pessimistic approaches with higher cut off scores until you have a clear winner. If you know more about the text you're analyzing, you can even modify the strings as you progress.
In a case I worked on, I did similarity matches of goods movement descriptions. In the descriptions the numbers sequences were more important than the text. e.g. when looking for a match for "SLURRY VALVE 250MM RAGMAX 2000" the 250 and 2000 part of the string are important, otherwise I get a "SLURRY VALVE 50MM RAGMAX 2000" as the best match instead of "VALVE B/F 250MM,RAGMAX 250RAG2000 RAGON" which is a better result.
I put the similarity match process through two steps: 1. Get a bunch of similar matches using an optimistic matching scorer (token_set_ratio) 2. get the number sequences of these results and pass them through another round of matching with a more strict scorer (token_sort_ratio). Doing this gave me the better result in the example I showed above.
Below is some blocks of code that could be of assistance:
here's a function to get numbers from the sequence. (In your case you might use this to exclude numbers from your string instead?)
def get_numbers_from_string(description):
numbers = ''.join((ch if ch in '0123456789.-' else ' ') for ch in description)
numbers = ' '.join([nr for nr in numbers.split()])
return numbers
and here is a portion of the code I used to put the description match through two rounds:
try:
# get close match from goods move that has material numbers
df_material = pd.DataFrame(process.extract(description,
corpus_material,
scorer=fuzz.token_set_ratio),
columns=['Similar Text','Score']
)
if df_material['Score'][df_material['Score']>=cut_off_accuracy_materials].count()>=1:
similar_text = df_material['Similar Text'].iloc[0]
score = df_material['Score'].iloc[0]
if nr_description_numbers>4:
# if there are multiple matches found, then get best number combination match
df_material = df_material[df_material['Score']>=cut_off_accuracy_materials]
new_corpus = list(df_material['Similar Text'])
new_corpus = np.vectorize(get_numbers_from_string)(new_corpus)
df_material['numbers'] = new_corpus
df_numbers = pd.DataFrame(process.extract(description_numbers,
new_corpus,
scorer=fuzz.token_sort_ratio),
columns=['numbers','Score']
)
similar_text = df_material['Similar Text'][df_material['numbers']==df_numbers['numbers'].iloc[0]].iloc[0]
nr_score = df_numbers['Score'].iloc[0]
hope it helps, and good luck

Tell if specific char in string is a long char or a short char

Be prepared, this is one of those hard questions.
In Farsi or Persian language ی which sounds like y or i and is written in 4 different shapes according to it's place in word. I'll call ی as YA from now for simplification.
take a look at this image
All YA characters are painted in red, in the first word YA is attached to it's previous (right , in Farsi we right from RIGHT to LEFT) character and is free at the end whereas the last YA (3rd word, left-most red char) is free both from left or right.
Having said this long story, I want to find out if a part of a string ends with long YA (YA without points) or short YA (YA with two points beneath it).
i.e تحصیلداری (the 3rd word) ends with long YA but تحصیـ which is a part of 3rd word does not ends with short YA.
Question: How can I say تحصیلداری ends whit which unicode? I just have a simple string, "تحصیلداری", how can I convert its characters to unicode?
I tried the unicodes
string unicodes = "";
foreach (char c in "تحصیلداری")
{
unicodes += c+" "+((int)c).ToString() + Environment.NewLine;
}
MessageBox.Show(unicodes);
result :
but at the end of the day unfortunately all YAs have the same unicode.
Bad news : YA was an example, a real one though. There are also a dozen of other characters like YA with different appearances too.
Additional info :
using this useful link about unicodes I found unicode of different YAs
We solved similar problem the way bellow:
We had a core banking application, the customer sub-system needed a full text search on customers name, family, father name etc.
Different encoding, legacy migrated data, keyboard layouts and Farsi fonts ... made search process inaccurate.
We overcame the problem by replacing problematic characters with some standard one and saving the standard string for search purpose.
After several iterations, the replacement is as bellow that may come in handy:
Formula="UPPER(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(REPLACE(REPLACE(REPLACE
(REPLACE(FirsName || LastName || FatherName,
chr(32),''),
chr(13),''),
chr(9),''),
chr(10),''),
'-',''),
'-',''),
'آ','ا'),
'أ', 'ا'),
'ئ', 'ي'),
'ي', 'ي'),
'ك', 'ک'),
'آإئؤةي','اايوهي'),
'ء',''),
'شأل','شاال'),
'ا.','اله'),
'.',''),
'الله','اله'),
'ؤ','و'),
'إ','ا'),
'ة','ه'),
' ا لله','اله'),
'ا لله','اله'),
' ا لله','اله'))"
Despite there are different YEHs in Unicode, it must noticed that all presentation forms of YEHs are same Unicode character with code 0x06cc. You can not determine presentation forms by their Unicode code.
But you can reach your goal be checking to see what characters is before or after YEH.
You can also use Fardis to see Unicode codes of strings.

Looking for resources for ICD-9 codes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
We have been asked by a client to incorporate ICD-9 codes into a system.
I'm looking for a good resource to get a complete listing of codes and descriptions that will end up in a SQL database.
Unfortunately a web service is out of the question as a fair amount of the time folks will be off line using the application.
I've found http://icd9cm.chrisendres.com/ and http://www.icd9data.com/ but neither offer downloads/exports of the data that I could find.
I also found http://www.cms.hhs.gov/MinimumDataSets20/07_RAVENSoftware.asp which has a database of the ICD-9 codes but they are not in the correct format and I'm not 100% sure how to properly convert (It shows the code 5566 which is really 556.6 but I can't find a rule as to how/when to convert the code to include a decimal)
I'm tagging this with medical and data since I'm not 100% sure where it should really be tagged...any help there would also be appreciated.
Just wanted to chime in on how to correct the code decimal places. First, there are four broad points to consider:
Standard codes have Decimal place XXX.XX
Some Codes Do not have trailing decimal places
V Codes also follow the XXX.XX format --> V54.31
E Codes follow XXXX.X --> E850.9
Thus the general logic of how to fix the errors is
If first character = E:
If 5th character = '':
Ignore
Else replace XXXXX with XXXX.X
Else If 4th-5th Char is not '': (XXXX or XXXXX)
replace XXXXX with XXX + . + remainder (XXX.XX or XXX.X)
(All remaining are XXX)
I implemented this with two SQL Update statements:
Number 1, for Non E-codes:
USE MainDb;
UPDATE "dbo"."icd9cm_diagnosis_codes"
SET "DIAGNOSIS CODE" = SUBSTRING("DIAGNOSIS CODE",1,3)+'.'+SUBSTRING("DIAGNOSIS CODE",4,5)
FROM "dbo"."icd9cm_diagnosis_codes"
WHERE
SUBSTRING("DIAGNOSIS CODE",4,5) != ''
AND
LEFT("DIAGNOSIS CODE",1) != 'E'
Number 2 - For E Codes:
UPDATE "dbo"."icd9cm_diagnosis_codes"
SET "DIAGNOSIS CODE" = SUBSTRING("DIAGNOSIS CODE",1,4)+'.'+SUBSTRING("DIAGNOSIS CODE",5,5)
FROM "dbo"."icd9_Diagnosis_table"
WHERE
LEFT("DIAGNOSIS CODE",1) = 'E'
AND
SUBSTRING("DIAGNOSIS CODE",5,5) != ''
Seemed to do the trick for me (Using SQL Server 2008).
I ran into this same issue a while back and ended up building my own solution from scratch. Recently, I put up an open API for the codes for others to use: http://aqua.io/codes/icd9/documentation
You can just download all codes in JSON (http://api.aqua.io/codes/beta/icd9.json) or pull an individual code (http://api.aqua.io/codes/beta/icd9/250-1.json). Pulling a single code not only gives you the ICD-10 "crosswalk" (equivalents), but also some extra goodies, like relevant Wikipedia links.
I finally found the following:
"The field for the ICD-9-CM Principal and Other Diagnosis Codes is six characters in length, with the decimal point implied between the third and fourth digit for all diagnosis codes other than the V codes. The decimal is implied for V codes between the second and third digit."
So I was able to get a hold of a complete ICD-9 list and reformat as required.
You might find that the ICD-9 codes follow the following format:
All codes are 6 characters long
The decimal point comes between the 3rd and 4th characters
If the code starts with a V character the decimal point comes between the 2nd and 3rd characters
Check this out: http://en.wikipedia.org/wiki/List_of_ICD-9_codes
I struggled with this issue myself for a long time as well. The best resource I have been able to find for these are the zip files here:
https://www.cms.gov/ICD9ProviderDiagnosticCodes/06_codes.asp
It's unfortunate because they (oddly) are missing the decimal places, but as several other posters have pointed out, adding them is fairly easy since the rules are known. I was able to use a regular expression based "find and replace" in my text editor to add them. One thing to watch out for if you go that route is that you can end up with codes that have a trailing "." but no zero after it. That's not valid, so you might need to go through and do another find/replace to clean those up.
The annoying thing about the data files in the link above is that there is no relationship to categories. Which you might need depending on your application. I ended up taking one of the RTF-based category files I found online and re-formatting it to get the ranges of each category. That was still doable in a text editor with some creative regular expressions.
I was able to use the helpful answers here an create a groovy script to decimalize the code and combine long and short descriptions into a tab separated list. In case this helps anyone, I'm including my code here:
import org.apache.log4j.BasicConfigurator
import org.apache.log4j.Level
import org.apache.log4j.Logger
import java.util.regex.Matcher
import java.util.regex.Pattern
Logger log = Logger.getRootLogger()
BasicConfigurator.configure();
Logger.getRootLogger().setLevel(Level.INFO);
Map shortDescMap = [:]
new File('CMS31_DESC_SHORT_DX.txt').eachLine {String l ->
int split = l.indexOf(' ')
String code = l[0..split].trim()
String desc = l[split+1..-1].trim()
shortDescMap.put(code, desc)
}
int shortLenCheck = 40 // arbitrary lengths, but provide some sanity checking...
int longLenCheck = 300
File longDescFile = new File('CMS31_DESC_LONG_DX.txt')
Map cmsRows = [:]
Pattern p = Pattern.compile(/^(\w*)\s+(.*)$/)
new File('parsedICD9.csv').withWriter { out ->
out.write('ICD9 Code\tShort Description\tLong Description\n')
longDescFile.eachLine {String row ->
Matcher m = row =~ p
if (m.matches()) {
String code = m.group(1)
String shortDescription = shortDescMap.get(code)
String longDescription = m.group(2)
if(shortDescription.size() > shortLenCheck){
log.info("Not short? $shortDescription")
}
if(longDescription.size() > longLenCheck){
log.info("${longDescription.size()} == Too long? $longDescription")
}
log.debug("Match 1:${code} -- 2:${longDescription} -- orig:$row")
if (code.startsWith('V')) {
if (code.size() > 3) {
code = code[0..2] + '.' + code[3..-1]
}
log.info("Code: $code")
} else if (code.startsWith('E')) {
if (code.size() > 4) {
code = code[0..3] + '.' + code[4..-1]
}
log.info("Code: $code")
} else if (code.size() > 3) {
code = code[0..2] + '.' + code[3..-1]
}
if (code) {
cmsRows.put(code, ['longDesc': longDescription])
}
out.write("$code\t$shortDescription\t$longDescription\n")
} else {
log.warn "No match for row: $row"
}
}
}
I hope this helps someone.
Sean

Resources