Search and get row from large single string - python-3.x

Hi I have single large string and i need to search set of string from this string and get that row create a data frame with this rows.
large String:
This is democracy’s day.
A day of history and hope.
Of renewal and resolve.
Through a crucible for the ages America has been tested anew and America has risen to the challenge.
Today, we celebrate the triumph not of a candidate, but of a cause, the cause of democracy.
The will of the people has been heard and the will of the people has been heeded.
We have learned again that democracy is precious.
Now i want to search few set of strings from above.
and my final output dataframe should look like below
Searching string
democracy’s day
America has been tested
celebrate the triumph
democracy is precious
Thanks in advance

You can create a regex out of your search strings and compare them for a match against the Large String column using extract. Where there's a match, the match string will be the value in the Searching String column, otherwise it will be null. The dataframe can then be filtered on the Searching String value being not null:
import re
df = pd.DataFrame({ 'Large String': ["This is democracy's day.", "A day of history and hope.","Of renewal and resolve.","Through a crucible for the ages America has been tested anew and America has risen to the challenge.","Today, we celebrate the triumph not of a candidate, but of a cause, the cause of democracy.","The will of the people has been heard and the will of the people has been heeded.","We have learned again that democracy is precious."] })
search_strings = ["democracy's day", "America has been tested", "celebrate the triumph", "democracy is precious"]
regex = '|'.join(map(re.escape, search_strings))
df['Searching String'] = df['Large String'].str.extract(f'({regex})')
df = df[~df['Searching String'].isna()]
print(df)
Output:
Large String Searching String
0 This is democracy's day. democracy's day
3 Through a crucible for the ages America has be... America has been tested
4 Today, we celebrate the triumph not of a candi... celebrate the triumph
6 We have learned again that democracy is precious. democracy is precious
Note:
we use re.escape on the search strings in case they contain special characters for regex e.g. . or ( etc.
if one of the search strings is a subset of another, the list should be sorted by order of decreasing length to ensure the longer matches are captured

Related

Extract value from column with pandas lib (data frame)

original data frame:
Date
Detail
31/03/22
I watch Netflix at home with my family 4 hours
01/04/22
I walk to the market for 3km and I spent 11.54 dollar
02/04/22
my dog bite me, I go to hospital, spend 29.99 dollar
03/04/22
I bought a game on steam 7 games spen 19.23 dollar
result data frame:
Date
Detail
Cost
31/03/22
I watch Netflix at home with my family 4 hours
0
01/04/22
I walk to the market for 3km and I spent 11.54 dollar
11.54
02/04/22
my dog bite me, I go to hospital, spend 29.99 dollar
29.99
03/04/22
I bought a game on steam 7 games spen 19.23 dollar
19.23
Describe my question:
If Detail Column does not contain specific string which is begin with sp.. and end with dollar
then value in Cost col equal zero.
If Detail Column does contain specific string which is begin with sp.. and end with dollar,
then value in Cost col equal value in the middle of specific string which is begin with sp..
and end with dollar.
I try to use regex but it's got first int that contain in the col like
| 01/04/22 | I walk to the market for 3km and I spent 11.54 dollar| 3 |
You should be able to use a regex pattern of a form such as:
df['Cost'] = df['Detail'].str.extract(r'sp\D*([\d\.]*)\D*dollar')
This will look for the literal string sp and then any non-digit characters after it. The capture group (denoted by the ()) looks for any digits or period characters, representing the dollar amount. This is what is returned to the Cost column. The final part of the pattern allows any number of non-digit characters after the dollar amount, followed by the literal string dollar.
The pd.NA for rows which don't have a cost can then be replaced with 0:
df['Cost'] = df['Cost'].replace({pd.NA: 0})
If you want to make any enhancements I used this site to test the regex: https://regexr.com/6ir6o

Excel: How to find six different combinations of words in string?

I have been working for several days on this and have researched everything looking for this answer. I'd appreciate any help you can give.
In Excel I am searching a string of text in column A:
Bought 1 HD Sep 3 2021 325.0 Call # 2.75
I am detecting the first word (in this case "Bought") and detecting the last word before "#" symbol (in this case "Call").
I am then detecting the price following the "#" symbol (in this case "2.75"). This number will go into column B (header "Open") or column C (header "Close") depending on the combination of words found:
Sold/Put=Close
Sold/Call=Open
Bought/Put=Open
Bought/Call=Close
Sold (by itself)=Open
Sold (by itself)=Close.
Bought 1 HD Sep 3 2021 325.0 Call # 2.75
The combination found in the above string is: "Bought Call". Therefore the number at the end ("2.75"), goes into "Open" column.
Here's another example:
Sold 4 AI Sep 17 2021 50.0 Put # 1.5
The combination found in the above string is: "Sold Put". Therefore the number at the end ("1.5") goes into "Close" column.
I am currently using this formula to determine if the string contains "Sold" and "Call" and get the desired number and it does work:
=IF(AND(
ISNUMBER(SEARCH({"Sold","Call"},A10))),
TRIM(MID(A10,SEARCH("#",A10)+LEN("#"),255))," ")
But, I don't know how to search for all the other possible combinations.
The point behind this is to be able to paste the transaction from the broker and have most of the entry process automated. I'm sure many will benefit from this as I've not found anything like this.
I'd appreciate any help and if possible, an explanation of the formula so I can better learn.
Thanks!
I think you have the right idea, but would just extend the IF statement.
Something like the below might work for you:
=IF(ISNUMBER(SEARCH("Call", $A1)),
IF(ISNUMBER(SEARCH({"Bought","Sold"}, $A1)),
NUMBERVALUE(RIGHT($A1, LEN($A1)-SEARCH("#", $A1))),""),
IF(ISNUMBER(SEARCH({"!!!","!!!","Bought","Sold"}, $A1)),
NUMBERVALUE(RIGHT($A1, LEN($A1)-SEARCH("#", $A1))),""))
Just enter in column B and drag down; columns B through E should fill as needed.
For example:
Note that the search for "!!!" is just random characters, it can be anything that you don't think has a good chance of appearing in the string.
Here/screenshots refer:
(requires Office 365 compatible version Excel)
Main lookup
=LET(fn_1,MATCH("*"&$H$7:$H$12&"*",B4,0),fn_2,MATCH("*"&$I$7:$I$12&"*",B4,0),IFERROR(INDEX($J$7:$J$12,MATCH(1,IF($I$7:$I$12="",fn_1*ISNUMBER(fn_2),fn_1*fn_2),0)),))
EDIT:
Other Excel versions:
=IFERROR(INDEX($J$7:$J$12,MATCH(1,IF($I$7:$I$12="",MATCH("*"&$H$7:$H$12&"*",B4,0)*ISNUMBER(MATCH("*"&$I$7:$I$12&"*",B4,0)),MATCH("*"&$H$7:$H$12&"*",B4,0)*MATCH("*"&$I$7:$I$12&"*",B4,0)),0)),)
(all that falls away is the 'Let' formula, replacing fn_1 and fn_2 with respective functions in index formula within the let making first equation somewhat longer, but otherwise identical)
Example applications
Have provided 2 examples of how one might customize to insert numeric in one of the columns (the key part to this question is really how to do lookup in first instance, from thereon it's a matter of finetuning/taking appropriate action)...
Assuming calls/buys are "long" position and strike price go in first col (here, D), and puts/sales are "short" position with strike price going in 2nd col (here, E):
Long - insert strike price col D
=IF(LET(fn_1,MATCH("*"&$H$7:$H$12&"*",B4,0),fn_2,MATCH("*"&$I$7:$I$12&"*",B4,0),IFERROR(INDEX($K$7:$K$12,MATCH(1,IF($I$7:$I$12="",fn_1*ISNUMBER(fn_2),fn_1*fn_2),0)),))=1,MID(SUBSTITUTE(B4," ",""),SEARCH("#",SUBSTITUTE(B4," ",""))+1,LEN(SUBSTITUTE(B4," ",""))),"")
EDIT
Other Excel versions:
=IF(IFERROR(INDEX($K$7:$K$12,MATCH(1,IF($I$7:$I$12="",MATCH("*"&$H$7:$H$12&"*",B4,0)*ISNUMBER(MATCH("*"&$I$7:$I$12&"*",B4,0)),MATCH("*"&$H$7:$H$12&"*",B4,0)*MATCH("*"&$I$7:$I$12&"*",B4,0)),0)),)=1,MID(SUBSTITUTE(B4," ",""),SEARCH("#",SUBSTITUTE(B4," ",""))+1,LEN(SUBSTITUTE(B4," ",""))),"")
Short - insert strike price col E
=IF(LET(fn_1,MATCH("*"&$H$7:$H$12&"*",B4,0),fn_2,MATCH("*"&$I$7:$I$12&"*",B4,0),IFERROR(INDEX($K$7:$K$12,MATCH(1,IF($I$7:$I$12="",fn_1*ISNUMBER(fn_2),fn_1*fn_2),0)),))=2,MID(SUBSTITUTE(B4," ",""),SEARCH("#",SUBSTITUTE(B4," ",""))+1,LEN(SUBSTITUTE(B4," ",""))),"")
EDIT
Other Excel versions:
Follow same routine in previous Edits (remove Let, replace fn_1 & fn_2 with respective formulae...)
Note similarity in all 3 equations above: 2nd and 3rd contain 1st (effectively they just wrap a big old 'if' statement around 1st, use lookup_2 col (here, col K), and use mid/search to extract rate after the hashtag.
Assumes you don't have other hashtags in the sentence..
Customize as required.

Losing records when converting DataFrame to dictionary

I parse a CSV file into a Dataframe. 10,000 records go in, no problems.
Two columns one 'ID', one 'Reviews'.
I try to convert the DF into a dictionary where keys = 'ID', and values = 'Reviews'.
For some reason the new dictionary only contains 680 records.
#read csv data file
data = pd.read_csv("Movie_reviews.csv",
delimiter='\t',
header=None,names=['ID','Reviews'])
reviews = data.set_index('ID').to_dict().get('Reviews')
len(reviews)
output is 680
If I don't append '.get('Reviews')' everything is one big record.
the Dataframe 'data' looks like this
ID Reviews
1 076780192X it always amazes me how people can rate the DV...
2 0767821599 This movie is okay, but, its not worth what th...
3 0782008380 If you love the Highlander 1 movie and the ser...
4 0767726227 This is a great classic collection, if you lik...
5 0780621832 This is the second of John Ford and John Wayne...
6 0310263662 I am an evangelical Christian who believes in ...
7 0767809270 Federal law, in one of its numerous unfunded m...
In case it helps anyone else.
The id's for the movie reviews were not all unique. The .nunique() function revealed that as suggested by #YOLO.
Assigning only the values (Reviews) to the dictionary automatically added unique keys as suggested by #JackHoman resolving my issue.
I think you can do:
Method 1:
reviews = data.set_index('ID')['Reviews'].to_dict()
Method 2: Here we convert reviews to a list for each ID so that we don't lose any information.
reviews = data.groupby('ID')['Reviews'].apply(list).to_dict()

Any simple way to do VLOOKUP combine "linear interpolation" in excel?

I'm making an excel sheet for calculating z-score for infant weight/age (Input: "Baby Month Age", and "Baby weight"). To do that, I need get LMS parameters first for a specific month, from below table.
http://www.who.int/childgrowth/standards/tab_wfa_boys_p_0_5.txt
(For Integer Month number, this can be done by vlookup Method without issue.) For Non-Integer Month number, I need use some kind of "linear interpolation" approach to get an approximate LMS data.
The question is, both Trend method and Vlookup method are not working for me. For Trend method, it is not working as the raw data, like L parameters is not linear data, if I use Trend method, for the several top month, return data will far from existing data. As for Vlookup method, it just finds the closest month data.
I had to use multiple "Match" and "Index" Method to do the "linear interpolation" for myself. However, I wonder whether there is any existing function for that?
My current formula for L parameters is below:
=MOD([Month Age],1)*(INDEX('WHO BOY AGE WEIGHT'!A:D,MATCH([Month Age],'WHO BOY AGE WEIGHT'!A:A)+1,2)-INDEX('WHO BOY AGE WEIGHT'!A:D,MATCH([Month Age],'WHO BOY AGE WEIGHT'!A:A),2))+INDEX('WHO BOY AGE WEIGHT'!A:D,MATCH([Month Age],'WHO BOY AGE WEIGHT'!A:A),2)
If we assume that months increment always by 1 (no gap in month data), you can use something like this formula to interpolate between the two values surrounding the give non-integer value:
=(1-MOD(2.3, 1))*VLOOKUP(2.3,A:S,2)+MOD(2.3, 1)*VLOOKUP(2.3+1,A:S, 2)
Which interpolates L(2.3) from data of L(2) = .197 and L(3) = .1738, resulting in .19004.
You can replace 2.3 by any cell reference. You can also change the lookup column 2 for L into 3 for M, 4 for S etc.
To answer the question whether there is some direct "interpolate" function in Excel, not that I know about, although there is good artillery for statistical estimation.

How do I pull a substring from a string in Python?

Good evening. Today, I was writing a piece of code in Python. I have a string that is called date. It holds the following data:
date='05/04/2014'
So. I want to split this string into several substrings, each holding the day, month or year. These substrings will be called day, month and year, with the respective number in each string. How could I do this?
Also, I would like this method to work for any other date strings, such as:
02/07/2012
Simply use:
day,month,year = date.split('/')
Here you .split(..) the string on the slash (/) and you use sequence unpacking to store the first, second and third group in day, month and year respectively. Here date is the string that contains the date ('05/04/2014') and '/' is the split pattern.
>>> day,month,year = date.split('/')
>>> day
'05'
>>> month
'04'
>>> year
'2014'
Note that day, month and year are still strings (not integers).

Resources