I have a data frame (df) with emails and numbers like
email euro
0 firstname#firstdomain.com 150
1 secondname#seconddomain.com 50
2 thirdname#thirddomain.com 300
3 kjfslkfj 0
4 fourthname#fourthdomain.com 200
I need to filter all rows with correct emails and euro equal to or greater than 100 and another list with correct emails and euro lower than 100. I know that I can filter by euro like this
df_gt_100 = df.euro >= 100
and
df_lt_100 = df.euro < 100
But I can't find a way to filter the email addresses. I imported the email_validate package and tried things like this
validate_email(df.email)
which gives me a TypeError: expected string or bytes-like object.
Can anyone pls give me a hint how to approach this issue. It'd be nice if I could do this all in one filter with the AND and OR operators.
Thanks in advance,
Manuel
Use apply, chain mask by & for AND and filter by boolean indexing:
from validate_email import validate_email
df1 = df[(df['euro'] > 100) & df['email'].apply(validate_email)]
print (df1)
email euro
0 firstname#firstdomain.com 150
2 thirdname#thirddomain.com 300
4 fourthname#fourthdomain.com 200
Another approach with regex and contains:
df1 = df[(df['euro'] > 100) &df['email'].str.contains(r'[^#]+#[^#]+\.[^#]+')]
print (df1)
email euro
0 firstname#firstdomain.com 150
2 thirdname#thirddomain.com 300
4 fourthname#fourthdomain.com 200
In [30]: from validate_email import validate_email
In [31]: df
Out[31]:
email
0 firstname#firstdomain.com
1 kjfslkfj
In [32]: df['is_valid_email'] = df['email'].apply(lambda x:validate_email(x))
In [33]: df
Out[33]:
email is_valid_email
0 firstname#firstdomain.com True
1 kjfslkfj False
In [34]: df['email'][df['is_valid_email']]
Out[34]:
0 firstname#firstdomain.com
You can use regex expressions to find a match and then use apply on the email column to create a T/F column for where an email exists:
import re
import pandas as pd
pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)") # this is the regex expression to search on
df = pd.DataFrame({'email': ['firstname#domain.com', 'avicii#heaven.com', 'this.is.a.dot#email.com', 'email1234#112.com', 'notanemail'], 'euro': [123, 321, 150, 0, 133]})
df['isemail'] = df['email'].apply(lambda x: True if pattern.match(x) else False)
Result:
email euro isemail
0 firstname#domain.com 123 True
1 avicii#heaven.com 321 True
2 this.is.a.dot#email.com 150 True
3 email1234#112.com 0 True
4 notanemail 133 False
Now you can filter on the isemail column.
validate_email returns a lot of information e.g. smtp etc., and for invalid emails, it throws an EmailNotValidError exception. You can write a function, and apply on pandas series -
from email_validator import validate_email, EmailNotValidError
def validate_e(x):
try:
v = validate_email(x)
return True
except EmailNotValidError as e:
return False
df["Email_validate"] = df['email'].apply(validate_e)
Related
I am new to pandas library.
I am working on a data set which looks like this :
suppose I want to subtract point score in the table.
I want to subtract 100 if the score in **point score** column if the score is below 1000
and subtract 200 if the score is above 1000 . How do I do this.
code :
import pandas as pd
df = pd.read_csv("files/Soccer_Football Clubs Ranking.csv")
df.head(4)
Use:
np.where(df['point score']<1000, df['point score']-100, df['point score']-200)
Demonstration:
test = pd.DataFrame({'point score':[14001,200,1500,750]})
np.where(test['point score']<1000, test['point score']-100, test['point score']-200)
Output:
Based on the comment:
temp = test[test['point score']<1000]
temp=temp['point score']-100
temp2 = test[test['point score']>=1000]
temp2=temp2['point score']-200
temp.append(temp2)
Another solution:
out = []
for cell in test['point score']:
if cell < 1000:
out.append(cell-100)
else:
out.append(cell-200)
test['res'] = out
Fourth solution:
test['point score']-(test['point score']<1000).replace(False,200).replace(True,100)
You can achieve vectorial code without explicitly using numpy (although numpy is used by pandas anyways), and keeping vectorial code:
# example 1
df['map'] = df['point score'] - df['point score'].gt(1000).map({True: 200, False: 100})
# example 2, multiple criteria
# -100 below 1000, -200 below 2000, -300 above 3000
df['cut'] = df['point score'] - pd.cut(df['point score'], bins=[0,1000,2000,float('inf')], labels=[100, 200, 300]).astype(int)
Output:
point score map cut
0 800 700 700
1 900 800 800
2 1000 900 900
3 2000 1800 1800
4 3000 2800 2700
In my excel sheet I have data of below kind...
sys_id Domain location tax_amount tp_category
8746 BLR 60000 link:IT,value:63746
2864 link:EFT,value:874887 HYD 50000
3624 link:Cred,value:897076 CHN 55000
7354 BLR 60000 link:sales,value:83746
I want output in my excel in below format...
sys_id Domain_link Domain_value location . tp_category_link tp_category_value
8746 BLR . IT 63746
2864 EFT 874887 HYD .
3624 Cred 897076 CHN .
7354 BLR . sales 83746
please help me with method or logic I should follow to have data in above format.
I have a huge amount of data and this I need to compare with other excel which is of Target data.
You can use pandas, assuming that your input file is called b1.xlsx:
import pandas as pd
import re
VALUE_LINK_REGEX = re.compile('^.*link:(.*),value:(.*)$')
df = pd.read_excel('b1.xlsx', engine='openpyxl')
cols_to_drop = []
for col in filter(lambda c: is_string_dtype(df[c]), df.columns):
m = df[col].map(lambda x: None if pd.isna(x) else VALUE_LINK_REGEX.match(x))
# skip columns with strings different from the expected format
if m.notna().sum() == 0:
continue
cols_to_drop.append(col)
df[f'{col}_link'] = m.map(lambda x: None if x is None else x.groups()[0])
df[f'{col}_value'] = m.map(lambda x: None if x is None else x.groups()[1])
df.drop(columns=cols_to_drop, inplace=True)
df.to_excel('b2.xlsx', index=False, engine='openpyxl')
This is the resulting df:
sys_id location tax_amount Domain_link Domain_value tp_category_link \
0 8746 BLR 60000 None None IT
1 2864 HYD 50000 EFT 874887 None
2 3624 CHN 55000 Cred 897076 None
3 7354 BLR 60000 None None sales
tp_category_value
0 63746
1 None
2 None
3 83746
And this is b2.xlsx:
I am unable to understand the mask variable in the code below. The code is basically to filter out the words in the given series in which there are more than 2 variables.
# Input
ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])
# Solution
from collections import Counter
mask = ser.map(lambda x: sum([Counter(x.lower()).get(i, 0) for i in list('aeiou')]) >= 2)
ser[mask]
Can someone please help me in understanding this in a better way?
Use Series.str.count with regex for test all values in [] with ignore lowercase/uppercase:
print (ser[ser.str.count('(?i)[aeiou]') >=2])
0 Apple
1 Orange
4 Money
dtype: object
Another solution:
import re
print (ser[ser.str.count('[aeiou]', re.I) >=2])
0 Apple
1 Orange
4 Money
dtype: object
Try this:
import re
mask = ser.str.count('a|e|i|o|u', re.IGNORECASE) >= 2
ser[mask]
Output:
0 Apple
1 Orange
4 Money
dtype: object
import pandas as pd
series=pd.Series(['red','Green','orange','pink','yellow','white'])
for i in series:
a=0
for j in i:
if j in ['a','e','i','o','u','A','E','I','O','U']:
a=a+1
if a>=2:
print(i)
break
I am working with a large pandas dataframe where I have created a new empty column. What I want to do si to iterate over every value within a specific column of the dataframe, do a Boolean check, and then assign a value to the new column based on the output of the value check.
I would think I need to use a for loop to check the individual contents of each cell in my specified column. The problem is that I can't seem to figure out the correct syntax to correctly write a for loop that checks values in a specific column. This is what I have so far.
call_info['% of Net Capital'] = call_info['Call Amount'] / call_info['Net Capital']
for (ColumnData) in call_info['Call Amount']:
columnSeriesObj = call_info[ColumnData]
if columnSeriesObj.any - call_info['Excess Deficit'].any > 0:
call_info['Sufficient Excess?'][ColumnData] = True
else:
call_info['Sufficient Excess?'][ColumnData] = False
I get a KeyError : 38749372
call_info is a pandas dataframe. I am trying to compare call_info['Call Amount'] against call_info['Excess Deficit'] and put a True or false value in call_info['Sufficient Excess?']
**Updated to include an example of my dataframe, and the expected output
This is a snip of a larger csv file:
I have loaded the data from this CSV file using openpyxl load_workbook
From there, I converted the data into a Pandas Dataframe using the following code :
from itertools import islice
data = sheet_ranges.values
cols = next(data)[1:]
data = list(data)
idx = [r[0] for r in data]
data = (islice(r, 1, None) for r in data)
df = pd.DataFrame(data, index=idx, columns=cols)
An example of the expected output is a column within the dataframe that looks like this:
I've been able to do this in Excel, but I am looking to automate the process
I made some demo data, which hopefully represents the problem.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1000, size = [20, 2]), columns = ['call_amount', 'excess_deficit'])
Then you can use the following code to get the result your looking for.
df['sufficient_excess'] = (df['call_amount'] - df['excess_deficit']) > 0
which gives
call_amount excess_deficit sufficient_excess
0 684 559 True
1 629 192 True
2 835 763 True
3 707 359 True
4 9 723 False
5 277 754 False
6 804 599 True
7 70 472 False
8 600 396 True
9 314 705 False
If you need the result changing to have Yes instead of True, let me now
After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x
Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)