combine pandas text rows based on condition - python-3.x

I have this kind of df:
df = pd.DataFrame({"text_column" : ['question: everybody is kongfu fighting', 'panda: of course', 'question: Why is the world so great ?', 'friend: Everybody is smart', 'and everybody is cool', 'enemy: no that is just not true', 'jordan: i want to add one thing: please', 'do not talk about this.', ' 2nd question : are you sure ?', 'yeah sure' ]})
text_column
0 question: everybody is kongfu fighting
1 panda: of course
2 question: Why is the world so great ?
3 friend: Everybody is smart
4 and everybody is cool
5 enemy: no that is just not true
6 jordan: i want to add one thing: please
7 do not talk about this.
8 2nd question : are you sure ?
9 messi: yeah sure
10 question: you are sure about this ?
11 donald: youre questions are stupid!
I want the following output
type_column new_text_column
0 question: panda: everybody is kongfu fighting of course
1 question: friend: enemy: jordan: 2nd question : messi: Why is the world so great ? Everybody is smart and everybody is cool no that is just not true i want to add one thing: please do not talk about this. are you sure ? yeah sure
2 question: donald: youre questions are stupid!
Basically each question and answer (topic) have to be in one cell.
I could write a function that works but would use apply, which is in general not an optimal solution.
Does anybody have a good idea how to do it?

Define the following functions:
"Specialized" split of the source text field into 2 parts:
def mySplit(txt):
tbl = re.split(': ?', txt, 1)
if len(tbl) == 1:
tbl.insert(0, '')
return pd.Series(tbl, index=['Qn', 'Ans'])
Reformat a group of rows:
def reformat(grp):
t1 = ': '.join(grp.Qn.tolist()) + ':'
t2 = ' '.join(grp.Ans.tolist())
return pd.Series([t1, t2], index=['type_column', 'new_text_column'])
Then, to get the result run:
df.text_column.apply(mySplit)\
.groupby(df2.Qn.str.startswith('question').cumsum())\
.apply(reformat).reset_index(drop=True)
It performs:
Specialized split of text_column into 2 columns (Qn and Ans).
Cut into groups starting on each row with Qn starting with question.
Apply reformat to each group.
Reset the index (discarding the old index).

It's hard to tell from the example what the criteria is for separating.
I'm guessing it's splitting on the colon, so you could try list comprehension
df["type_column"] = [x.split(":")[0] for x in df["text_column"]]
df["new_text_column"] = [x.split(":")[1] for x in df["text_column"]]

Related

Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):
ENSEMBL_IDs = ['ENSG00000040608',
'ENSG00000070371',
'ENSG00000070413']
which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):
genes_list = (['ENSG00000040608.28', 'RTN4R'],
['ENSG00000070371.91', 'CLTCL1'],
['ENSG00000070413.17', 'DGCR2'])
genes_df = pd.DataFrame(genes_list)
The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.
I want to store these in a list. So, in the end, I would be left with a list like follows:
`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`
Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.
I think I am looking for something along the lines of:
`genenames = []
for i in ENSEMBL_IDs:
if i in genes_df.iloc[:,0]:
genenames.append(# corresponding value in genes_df.iloc[:,1])`
I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.
Thank you for your help!
Thanks also for the edit, English is not my first language, so the improvements were insightful.
You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:
m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)
out = genes_df.loc[m, 1].tolist()
Or use a regex with str.match:
pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)
out = genes_df.loc[m, 1].tolist()
Output: ['RTN4R', 'CLTCL1', 'DGCR2']

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Adding numbers from a string

So, my problem is that I can get the numbers to print as integers, but can't get them to add together to form a one number total. right now, I am using the following code:
if command == 4:
total = 0
for item in items:
number = (item.split()[-1])
total += float(number)
print(total)
and my output ends up being:
2.99
1.87
when I would just want:
4.86
it's pretty simple code, and I feel like I'm missing something fairly obvious, but for the life of me I can't figure it out. Not sure why this is printing all weird, but here is a picture of the code as wellpartial
The list of items is populated from user input, here is a picture of the rest of the code I am using full code. I have been entering the items as 'socks, 2.99' if that helps clarifying things
Despite the fact that you're printing the total value in each iteration, you can improve your code and do the sum in one line, just if you wanna implement and you're interested I'll let the code right here:
total = sum(map(float, items))

knex.raw with existing columns array using .first statement

Here is existing code:
knex("products")
.first("id", "name", "ingredients")
...
So, currently it just uses array of column names.
Now I want to add calculated column here. It would consists of "constant" + product.id.
For product with id 1 it would be "api/v1/img/1".
For product with id 222 it would be "api/v1/img/222".
Alias of it should be "image".
I have to use knex.raw somehow. Do not understand how and what is the correct syntax to use it with .first().
I'm sorry, I'm unable to understand the question. What kind of result are you trying to achieve? maybe something like this?
knex("products")
.select('*', knex.raw(`'api/v1/img' || ?? as computed`, ['products.id']))
.first()
Like this: https://runkit.com/embed/9okme0czge8z

Looking for resources for ICD-9 codes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
We have been asked by a client to incorporate ICD-9 codes into a system.
I'm looking for a good resource to get a complete listing of codes and descriptions that will end up in a SQL database.
Unfortunately a web service is out of the question as a fair amount of the time folks will be off line using the application.
I've found http://icd9cm.chrisendres.com/ and http://www.icd9data.com/ but neither offer downloads/exports of the data that I could find.
I also found http://www.cms.hhs.gov/MinimumDataSets20/07_RAVENSoftware.asp which has a database of the ICD-9 codes but they are not in the correct format and I'm not 100% sure how to properly convert (It shows the code 5566 which is really 556.6 but I can't find a rule as to how/when to convert the code to include a decimal)
I'm tagging this with medical and data since I'm not 100% sure where it should really be tagged...any help there would also be appreciated.
Just wanted to chime in on how to correct the code decimal places. First, there are four broad points to consider:
Standard codes have Decimal place XXX.XX
Some Codes Do not have trailing decimal places
V Codes also follow the XXX.XX format --> V54.31
E Codes follow XXXX.X --> E850.9
Thus the general logic of how to fix the errors is
If first character = E:
If 5th character = '':
Ignore
Else replace XXXXX with XXXX.X
Else If 4th-5th Char is not '': (XXXX or XXXXX)
replace XXXXX with XXX + . + remainder (XXX.XX or XXX.X)
(All remaining are XXX)
I implemented this with two SQL Update statements:
Number 1, for Non E-codes:
USE MainDb;
UPDATE "dbo"."icd9cm_diagnosis_codes"
SET "DIAGNOSIS CODE" = SUBSTRING("DIAGNOSIS CODE",1,3)+'.'+SUBSTRING("DIAGNOSIS CODE",4,5)
FROM "dbo"."icd9cm_diagnosis_codes"
WHERE
SUBSTRING("DIAGNOSIS CODE",4,5) != ''
AND
LEFT("DIAGNOSIS CODE",1) != 'E'
Number 2 - For E Codes:
UPDATE "dbo"."icd9cm_diagnosis_codes"
SET "DIAGNOSIS CODE" = SUBSTRING("DIAGNOSIS CODE",1,4)+'.'+SUBSTRING("DIAGNOSIS CODE",5,5)
FROM "dbo"."icd9_Diagnosis_table"
WHERE
LEFT("DIAGNOSIS CODE",1) = 'E'
AND
SUBSTRING("DIAGNOSIS CODE",5,5) != ''
Seemed to do the trick for me (Using SQL Server 2008).
I ran into this same issue a while back and ended up building my own solution from scratch. Recently, I put up an open API for the codes for others to use: http://aqua.io/codes/icd9/documentation
You can just download all codes in JSON (http://api.aqua.io/codes/beta/icd9.json) or pull an individual code (http://api.aqua.io/codes/beta/icd9/250-1.json). Pulling a single code not only gives you the ICD-10 "crosswalk" (equivalents), but also some extra goodies, like relevant Wikipedia links.
I finally found the following:
"The field for the ICD-9-CM Principal and Other Diagnosis Codes is six characters in length, with the decimal point implied between the third and fourth digit for all diagnosis codes other than the V codes. The decimal is implied for V codes between the second and third digit."
So I was able to get a hold of a complete ICD-9 list and reformat as required.
You might find that the ICD-9 codes follow the following format:
All codes are 6 characters long
The decimal point comes between the 3rd and 4th characters
If the code starts with a V character the decimal point comes between the 2nd and 3rd characters
Check this out: http://en.wikipedia.org/wiki/List_of_ICD-9_codes
I struggled with this issue myself for a long time as well. The best resource I have been able to find for these are the zip files here:
https://www.cms.gov/ICD9ProviderDiagnosticCodes/06_codes.asp
It's unfortunate because they (oddly) are missing the decimal places, but as several other posters have pointed out, adding them is fairly easy since the rules are known. I was able to use a regular expression based "find and replace" in my text editor to add them. One thing to watch out for if you go that route is that you can end up with codes that have a trailing "." but no zero after it. That's not valid, so you might need to go through and do another find/replace to clean those up.
The annoying thing about the data files in the link above is that there is no relationship to categories. Which you might need depending on your application. I ended up taking one of the RTF-based category files I found online and re-formatting it to get the ranges of each category. That was still doable in a text editor with some creative regular expressions.
I was able to use the helpful answers here an create a groovy script to decimalize the code and combine long and short descriptions into a tab separated list. In case this helps anyone, I'm including my code here:
import org.apache.log4j.BasicConfigurator
import org.apache.log4j.Level
import org.apache.log4j.Logger
import java.util.regex.Matcher
import java.util.regex.Pattern
Logger log = Logger.getRootLogger()
BasicConfigurator.configure();
Logger.getRootLogger().setLevel(Level.INFO);
Map shortDescMap = [:]
new File('CMS31_DESC_SHORT_DX.txt').eachLine {String l ->
int split = l.indexOf(' ')
String code = l[0..split].trim()
String desc = l[split+1..-1].trim()
shortDescMap.put(code, desc)
}
int shortLenCheck = 40 // arbitrary lengths, but provide some sanity checking...
int longLenCheck = 300
File longDescFile = new File('CMS31_DESC_LONG_DX.txt')
Map cmsRows = [:]
Pattern p = Pattern.compile(/^(\w*)\s+(.*)$/)
new File('parsedICD9.csv').withWriter { out ->
out.write('ICD9 Code\tShort Description\tLong Description\n')
longDescFile.eachLine {String row ->
Matcher m = row =~ p
if (m.matches()) {
String code = m.group(1)
String shortDescription = shortDescMap.get(code)
String longDescription = m.group(2)
if(shortDescription.size() > shortLenCheck){
log.info("Not short? $shortDescription")
}
if(longDescription.size() > longLenCheck){
log.info("${longDescription.size()} == Too long? $longDescription")
}
log.debug("Match 1:${code} -- 2:${longDescription} -- orig:$row")
if (code.startsWith('V')) {
if (code.size() > 3) {
code = code[0..2] + '.' + code[3..-1]
}
log.info("Code: $code")
} else if (code.startsWith('E')) {
if (code.size() > 4) {
code = code[0..3] + '.' + code[4..-1]
}
log.info("Code: $code")
} else if (code.size() > 3) {
code = code[0..2] + '.' + code[3..-1]
}
if (code) {
cmsRows.put(code, ['longDesc': longDescription])
}
out.write("$code\t$shortDescription\t$longDescription\n")
} else {
log.warn "No match for row: $row"
}
}
}
I hope this helps someone.
Sean

Resources