dataframe - Check condition on a number column and modify column - python-3.x

My dataframe is like below:
NameA 401016815
NameB
NameC 414969141
NameD 0403 612 699
How do I get dataframe to do a condition check [ first character is 4 and character length of number is 9 digits] and add a zero at start if the condition is met.
Condition check to see if character length in 12 digits but only contains 9 numbers, the space in between should be removed.

We can use Series.str.len to check the length of the string. Series.startswith
to check the beginning of the string. Series.str.replace to remove blanks. We use Series.mask
to replace or add characters in specific positions:
#df=df.reset_index() #if Names is the index
df['Number'].mask(df['Number'].str.len()>=12,df['Number'].str.replace(' ',''),inplace=True)
start=df['Number'].str.startswith('4').fillna(False)
df['Number'].mask(start,'0'+df['Number'],inplace=True)
print(df)
Output
Names Number
0 NameA 0401016815
1 NameB NaN
2 NameC 0414969141
3 NameD 0403612699

Related

Optical differences between characters within a string of equal length

I'm having a data set with different length of string and they get concatenated into a separate column to be made equal via LEN(), TRIM() and REPT().
The formulas I used can be seen in the last row for each column (B:E).
Althought the length of the final string is equal, one can see that the strings within the "Name with equal length" column are not optically identical/ of "same" length.
As I want to use this column for making new file names via VBA, I wanted to explicitly have file names with "optically smooth names". (I hope you get what I mean.)
How can I achieve this? Do I have to calculate the pixel differences within (case-sensitive) letters? If so, how can I do this?
Text
Place
Length of String
Needed Spaces
Name with equal length
Length of Name
SaMPLE_TEXT
P 1
12
2
SaMPLE_TEXT--P 1_.pdf
22
SaMPLE_TexT
P 2
13
1
SaMPLE_TexT-P 2_.pdf
22
SaMPLE_text
P 3
13
1
SaMPLE_text-P 3_.pdf
22
sample_TEXT
P 4
12
2
sample_TEXT--P 4_.pdf
22
SaMPLE_TEXT
P 5
12
2
SaMPLE_TEXT--P 5_.pdf
22
=LEN(TRIM(B1))
=MAX($D$1:$D$6)-LEN(TRIM(B2))+1
=TRIM(A2)&REPT("-";D2)&TRIM(B2)&"_.pdf"
=LEN(E2)

How to extract numbers from mixed dataframe column and replace with numbers only (inplace)?

Given the following toy dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['1a',np.nan,'10a','100b','0b'],
})
df
A
0 1a
1 NaN
2 10a
3 100b
4 0b
I want to remove all the characters/strings and extract the numbers in A column.
There is an inplace=True method, but how can extract the numbers and replace them inplace?
I want to get:
A
0 1
1 NaN
2 10
3 100
4 0
Here is how I am doing it now:
df.A = df.A.str.extract('(\d+)')
str.extract as the name suggested, doesn't replace, only extracts. Try:
df['A'].replace('(\D.*)','',inplace=True, regex=True)
Output:
A
0 1
1 NaN
2 10
3 100
4 0
More info on the regex pattern here. Basically:
\D matches any non-digit character
.* matches all the characters that following \D.
So the pattern replaces everything from the first non-digit character with the empty string ''.
With your shown samples, please try following. Simple explanation would be: using replace function of pandas, where I am making regex true, then in regex place its mentioned that to replace anything apart from digits with NULL.
df['A'].replace('([^0-9]*)','', regex=True)

regular expression using pandas string match

Input data:
name Age Zodiac Grade City pahun
0 /extract 30 Aries A Aura a_b_c
1 /abc/236466/touchbar.html 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 /abc/256476/mouse.html 18 Libra AA Gannon h_i_j
I am trying to extract the rows based on the regex on the name column. This regex extracts the numbers which has 6 as length.
For example:
/abc/236466/touchbar.html - 236466
Here is the code I have used
df=df[df['name'].str.match(r'\d{6}') == True]
The above line is not matching at all.
Expected:
name Age Zodiac Grade City pahun
0 /abc/236466/touchbar.html 20 Leo AB Somerville c_d_e
1 /abc/256476/mouse.html 18 Libra AA Gannon h_i_j
Can anyone tell me where am I doing wrong?
str.match only searches for a match at the start of the string.
Use str.contains with a regex like
df=df[df['name'].str.contains(r'/\d{6}/')]
to find entries containing / + 6 digits + /.
Or, to make sure you just match 6 digit chunks and not 7+ digit chunks:
df=df[df['name'].str.contains(r'(?<!\d)\d{6}(?!\d)')]
where
(?<!\d) - makes sure there is no digit on the left
\d{6} - any six digits
(?!\d) - no digit on the right is allowed.
You are almost there, use str.contains instead:
df[df['name'].str.contains(r'\d{6,}')]

Compare row with all other previous string in one column and change value of another column in Python

I have a csv file named namelist.csv, it includes:
Index String Size Name
1 AAA123000DDD 10 One
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 25 One
4 AAA123D 20 One
5 ABA 15 One
6 FFFrrrSSSBBB 60 Two
7 FFFrrrSSSBBB 30 Two
8 FFFrrrSS 50 Two
9 AAA12 70 Two
I want to compare row in column String of each name group: if the string in each row is match or is substring of all above rows then remove the previous rows and sum the value of Size column to the value of subtring row.
Example: i take row 3rd: AAA123000DDD, i compare it to 2 row 1st and 2nd, it see that it is a match with 1st row, it will remove the 1st row then sum value of the 1st row column Size to the 3rd row column Size .
then the table will be like:
Index String Size Name
2 AAA123DDDQQQ 20 One
3 AAA123000DDD 35 One
4 AAA123D 20 One
...
the final result will be:
Index String Size Name
3 AAA123000DDD 35 One
4 AAA123D 40 One
5 ABA 15 One
8 FFFrrrSS 140 Two
9 AAA12 70 Two
i think of using groupby of pandas to group all Name column, but i don't know how to apply the comparison of String column and sum of Size column.
I am new to Python so any help I will very appreciate.
Assuming Name is distinct with String, here's how you would do the aggregation. I kept Name so that it also shows in the final DataFrame.
df_group = df.groupby(['String', 'Name'])['Size'].sum().reset_index()
Edit:
To match the substrings (and using the example above that it appears that a substring will not match with multiple strings), you can make a mapping of substrings to full strings and then group by the full string column as before:
all_strings = set(df['Strings'])
substring_dict = dict()
for row in df.itertuples():
for item in all_strings:
if row.String in item:
substring_dict[row.String] = item
def match_substring(x):
return substring_dict[x]
df['full_strings'] = df.String.apply(match_substring)
df_group = df.groupby(['full_strings', 'Name'])['Size'].sum().reset_index()

How to make vlookups in Excel select the EXACT match instead of the first occurence of matching digits

Is there a way to make an excel vlookup use the entire number rather than the first set of matching digits? I am vlookup-ing a concatenated number that has several occurrences of the first 10 digits. When I vlookup 49272480517 from the below table, I get back the first occurrence of a 4927248051 match instead of the full concatenated number 49272480517.
RefDoc.No. DocLi concat
4927248051 1 49272480511
4927248051 2 49272480512
4927248051 3 49272480513
4927248051 4 49272480514
4927248051 5 49272480515
4927248051 6 49272480516
4927248051 7 49272480517
If your concatenated strings are in column C and your lookup value is in H1, use the below formula,
=INDEX(C:C,MATCH(H1,C:C,0),1)

Resources