regular expression using pandas string match - python-3.x

Input data:
name Age Zodiac Grade City pahun
0 /extract 30 Aries A Aura a_b_c
1 /abc/236466/touchbar.html 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 /abc/256476/mouse.html 18 Libra AA Gannon h_i_j
I am trying to extract the rows based on the regex on the name column. This regex extracts the numbers which has 6 as length.
For example:
/abc/236466/touchbar.html - 236466
Here is the code I have used
df=df[df['name'].str.match(r'\d{6}') == True]
The above line is not matching at all.
Expected:
name Age Zodiac Grade City pahun
0 /abc/236466/touchbar.html 20 Leo AB Somerville c_d_e
1 /abc/256476/mouse.html 18 Libra AA Gannon h_i_j
Can anyone tell me where am I doing wrong?

str.match only searches for a match at the start of the string.
Use str.contains with a regex like
df=df[df['name'].str.contains(r'/\d{6}/')]
to find entries containing / + 6 digits + /.
Or, to make sure you just match 6 digit chunks and not 7+ digit chunks:
df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]
where
(?<!\d) - makes sure there is no digit on the left
\d{6} - any six digits
(?!\d) - no digit on the right is allowed.

You are almost there, use str.contains instead:
df[df['name'].str.contains(r'\d{6,}')]

Related

Optical differences between characters within a string of equal length

I'm having a data set with different length of string and they get concatenated into a separate column to be made equal via LEN(), TRIM() and REPT().
The formulas I used can be seen in the last row for each column (B:E).
Althought the length of the final string is equal, one can see that the strings within the "Name with equal length" column are not optically identical/ of "same" length.
As I want to use this column for making new file names via VBA, I wanted to explicitly have file names with "optically smooth names". (I hope you get what I mean.)
How can I achieve this? Do I have to calculate the pixel differences within (case-sensitive) letters? If so, how can I do this?
Text
Place
Length of String
Needed Spaces
Name with equal length
Length of Name
SaMPLE_TEXT
P 1
12
2
SaMPLE_TEXT--P 1_.pdf
22
SaMPLE_TexT
P 2
13
1
SaMPLE_TexT-P 2_.pdf
22
SaMPLE_text
P 3
13
1
SaMPLE_text-P 3_.pdf
22
sample_TEXT
P 4
12
2
sample_TEXT--P 4_.pdf
22
SaMPLE_TEXT
P 5
12
2
SaMPLE_TEXT--P 5_.pdf
22
=LEN(TRIM(B1))
=MAX($D$1:$D$6)-LEN(TRIM(B2))+1
=TRIM(A2)&REPT("-";D2)&TRIM(B2)&"_.pdf"
=LEN(E2)

dataframe - Check condition on a number column and modify column

My dataframe is like below:
NameA 401016815
NameB
NameC 414969141
NameD 0403 612 699
How do I get dataframe to do a condition check [ first character is 4 and character length of number is 9 digits] and add a zero at start if the condition is met.
Condition check to see if character length in 12 digits but only contains 9 numbers, the space in between should be removed.
We can use Series.str.len to check the length of the string. Series.startswith
to check the beginning of the string. Series.str.replace to remove blanks. We use Series.mask
to replace or add characters in specific positions:
#df=df.reset_index() #if Names is the index
df['Number'].mask(df['Number'].str.len()>=12,df['Number'].str.replace(' ',''),inplace=True)
start=df['Number'].str.startswith('4').fillna(False)
df['Number'].mask(start,'0'+df['Number'],inplace=True)
print(df)
Output
Names Number
0 NameA 0401016815
1 NameB NaN
2 NameC 0414969141
3 NameD 0403612699

Retrieve column 4 from Column 2 and 3 which contains minimum and maximum conditions along with Column 1 which is a separate value?

Hello I have a table shown below where I have letters in column 1, and min and max ranges for column 2 and 3. I am trying to retrieve the final number in column 4.
I know I can use a VLOOKUP and set the range as TRUE to get the last column. However, how would I factor in multiple columns/criteria to find match the correct range with the correct letter.
For example, I can would like to get value 4 from the last column. I would have to match with "B" and it would be between 0 and $50,000.
A 0 $50,000 1
A $50,001 $100,000 2
A $100,001 $250,000 3
B 0 $50,000 4
B $50,001 $100,000 5
B $100,001 $250,000 6
C 0 $50,000 7
C $50,001 $100,000 8
C $100,001 $250,000 9
Thank you!
Two ways:
If the pattern is the same as to the breaks of the dollar amounts then use this:
=INDEX(D:D,MATCH(G1,A:A,0)+MATCH(H1,$B$1:$B$3)-1)
Where MATCH(G1,A:A,0) returns the first row where the ID is located and MATCH(H1,$B$1:$B$3) finds the relative location of the price in the first pattern. Change $B$1:$B$3 to encompass the whole pattern.
If the patterns are different then you can use this:
=SUMIFS(D:D,A:A,G1,B:B,"<=" & H1,C:C,">=" & H1)
One more for the future when Microsoft releases FILTER():
=FILTER(D:D,(A:A=G1)*(B:B<=H1)*(C:C>=H1))
This is entered normally and does not matter the pattern.

SAS 9.4 Character functions - Why might one cell return unexpected results?

I am manipulating some string variables and certain cells are returning unexpected values with substring and length functions. These cells hold character-formatted dates, as I need to do a few manipulations before converting them to SAS dates, because of the nature of the Excel file they're coming from. Here is an example:
HAVE:
Obs _orig
1 4/3
2 12/16
3 1/13
4 6/2
5 3/10
6 5/4
7 10/14
WANT:
Obs _orig _length _sub_1_2
1 4/3 3 4/
2 12/16 5 12
3 1/13 4 1/
4 6/2 3 6/
5 3/10 4 3/
6 5/4 3 5/
7 10/14 5 10
I am using this code:
data want;
set have;
_strip=strip(_orig);
_sub_1_2=substr(_strip,1,2);
_length=length(_strip);
run;
This is what I get. The discrepancies are bolded.
Obs _orig _length _sub_1_2
1 4/3 5
2 12/16 5 12
3 1/13 4 1/
4 6/2 3 6/
5 3/10 4 3/
6 5/4 5
7 10/14 5 10
Both are cases where SAS calculates length=5 when length should = 3. In both cases, the value for the substring-derived variable is blank altogether. Results are the same if I use compress(), trim(), or trimn() in my code, rather than strip(). Thank you for any help you can provide
Sounds like maybe unprintable characters got in your data. If you PUT _orig $hex.; to the log, what do you see? Should be: 342F332020
152 data want;
153 length orig $5;
154 orig='4/3';
155 len=length(orig);
156 put orig= len=;
157 put orig hex.;
158 run;
orig=4/3 len=3
342F332020
To get rid of non-printable characters, you could try:
_strip=compress(orig,,'kw');
Seems pretty clear to me that your variables have leading spaces or other leading characters that look like spaces on the screen. So for OBS=6 the value of the string is more like " 5/4" which has a length of 5 and the first two characters both look like spaces. If LENGTHN() of your new _sub_1_2 variable is not 0 then it has some non-printing character there. Perhaps something like 'A0'X which some webpages use as a non-breaking space or tab character ('09'x).
I suspect that you don't want the first two characters, but instead want the first word when using / as the delimiter. You can use the LEFT() or STRIP() function to remove the leading blanks. Or COMPRESS() to remove other junk. So you might use COMPRESS() with the k and d modifiers to only keep the digits and slashes.
data want;
set have;
length first $5 ;
first = scan(compress(_orig,'/','kd'),1,'/');
run;

Excel - Treat Greater Than or Less Than Symbol as part of Text Criteria

I have the following table data in MS Excel:
Doctor Patient Age Group
Doctor_Name 1 > 12 yrs old
Doctor_Name 2 < 06 yrs old
Doctor_Name 3 > 12 yrs old
When the formula =COUNTIF(B2:B4,"> 12 yrs old") is executed using this data, it will return 3 when in fact it should only return 2.
I don't know where it is documented, but the function is interpreting the > (greater than) character as an operator and not as part of the string. Try:
=COUNTIF($B$1:$B$8,"=" &"> 12 yrs old")
If an operator is the first character(s) in the text string, it will be interpreted as an operator and not as a character.

Resources