Subtracting specific values from column in pandas

Subtracting specific values from column in pandas - python-3.x

I am new to pandas library.
I am working on a data set which looks like this :
suppose I want to subtract point score in the table.
I want to subtract 100 if the score in **point score** column if the score is below 1000
and subtract 200 if the score is above 1000 . How do I do this.
code :
import pandas as pd
df = pd.read_csv("files/Soccer_Football Clubs Ranking.csv")
df.head(4)

Use:
np.where(df['point score']<1000, df['point score']-100, df['point score']-200)
Demonstration:
test = pd.DataFrame({'point score':[14001,200,1500,750]})
np.where(test['point score']<1000, test['point score']-100, test['point score']-200)
Output:
Based on the comment:
temp = test[test['point score']<1000]
temp=temp['point score']-100
temp2 = test[test['point score']>=1000]
temp2=temp2['point score']-200
temp.append(temp2)
Another solution:
out = []
for cell in test['point score']:
if cell < 1000:
out.append(cell-100)
else:
out.append(cell-200)
test['res'] = out
Fourth solution:
test['point score']-(test['point score']<1000).replace(False,200).replace(True,100)

You can achieve vectorial code without explicitly using numpy (although numpy is used by pandas anyways), and keeping vectorial code:
# example 1
df['map'] = df['point score'] - df['point score'].gt(1000).map({True: 200, False: 100})
# example 2, multiple criteria
# -100 below 1000, -200 below 2000, -300 above 3000
df['cut'] = df['point score'] - pd.cut(df['point score'], bins=[0,1000,2000,float('inf')], labels=[100, 200, 300]).astype(int)
Output:
point score map cut
0 800 700 700
1 900 800 800
2 1000 900 900
3 2000 1800 1800
4 3000 2800 2700

Related

Python code to split my excel column value based on delimiter & write 1st split value to same column and 2nd to new column created next to that

In my excel sheet I have data of below kind...
sys_id Domain location tax_amount tp_category
8746 BLR 60000 link:IT,value:63746
2864 link:EFT,value:874887 HYD 50000
3624 link:Cred,value:897076 CHN 55000
7354 BLR 60000 link:sales,value:83746
I want output in my excel in below format...
sys_id Domain_link Domain_value location . tp_category_link tp_category_value
8746 BLR . IT 63746
2864 EFT 874887 HYD .
3624 Cred 897076 CHN .
7354 BLR . sales 83746
please help me with method or logic I should follow to have data in above format.
I have a huge amount of data and this I need to compare with other excel which is of Target data.

You can use pandas, assuming that your input file is called b1.xlsx:
import pandas as pd
import re
VALUE_LINK_REGEX = re.compile('^.*link:(.*),value:(.*)$')
df = pd.read_excel('b1.xlsx', engine='openpyxl')
cols_to_drop = []
for col in filter(lambda c: is_string_dtype(df[c]), df.columns):
m = df[col].map(lambda x: None if pd.isna(x) else VALUE_LINK_REGEX.match(x))
# skip columns with strings different from the expected format
if m.notna().sum() == 0:
continue
cols_to_drop.append(col)
df[f'{col}_link'] = m.map(lambda x: None if x is None else x.groups()[0])
df[f'{col}_value'] = m.map(lambda x: None if x is None else x.groups()[1])
df.drop(columns=cols_to_drop, inplace=True)
df.to_excel('b2.xlsx', index=False, engine='openpyxl')
This is the resulting df:
sys_id location tax_amount Domain_link Domain_value tp_category_link \
0 8746 BLR 60000 None None IT
1 2864 HYD 50000 EFT 874887 None
2 3624 CHN 55000 Cred 897076 None
3 7354 BLR 60000 None None sales
tp_category_value
0 63746
1 None
2 None
3 83746
And this is b2.xlsx:

Python validate_email package not producing correct result [duplicate]

I have a data frame (df) with emails and numbers like
email euro
0 firstname#firstdomain.com 150
1 secondname#seconddomain.com 50
2 thirdname#thirddomain.com 300
3 kjfslkfj 0
4 fourthname#fourthdomain.com 200
I need to filter all rows with correct emails and euro equal to or greater than 100 and another list with correct emails and euro lower than 100. I know that I can filter by euro like this
df_gt_100 = df.euro >= 100
and
df_lt_100 = df.euro < 100
But I can't find a way to filter the email addresses. I imported the email_validate package and tried things like this
validate_email(df.email)
which gives me a TypeError: expected string or bytes-like object.
Can anyone pls give me a hint how to approach this issue. It'd be nice if I could do this all in one filter with the AND and OR operators.
Thanks in advance,
Manuel

Use apply, chain mask by & for AND and filter by boolean indexing:
from validate_email import validate_email
df1 = df[(df['euro'] > 100) & df['email'].apply(validate_email)]
print (df1)
email euro
0 firstname#firstdomain.com 150
2 thirdname#thirddomain.com 300
4 fourthname#fourthdomain.com 200
Another approach with regex and contains:
df1 = df[(df['euro'] > 100) &df['email'].str.contains(r'[^#]+#[^#]+\.[^#]+')]
print (df1)
email euro
0 firstname#firstdomain.com 150
2 thirdname#thirddomain.com 300
4 fourthname#fourthdomain.com 200

In [30]: from validate_email import validate_email
In [31]: df
Out[31]:
email
0 firstname#firstdomain.com
1 kjfslkfj
In [32]: df['is_valid_email'] = df['email'].apply(lambda x:validate_email(x))
In [33]: df
Out[33]:
email is_valid_email
0 firstname#firstdomain.com True
1 kjfslkfj False
In [34]: df['email'][df['is_valid_email']]
Out[34]:
0 firstname#firstdomain.com

You can use regex expressions to find a match and then use apply on the email column to create a T/F column for where an email exists:
import re
import pandas as pd
pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)") # this is the regex expression to search on
df = pd.DataFrame({'email': ['firstname#domain.com', 'avicii#heaven.com', 'this.is.a.dot#email.com', 'email1234#112.com', 'notanemail'], 'euro': [123, 321, 150, 0, 133]})
df['isemail'] = df['email'].apply(lambda x: True if pattern.match(x) else False)
Result:
email euro isemail
0 firstname#domain.com 123 True
1 avicii#heaven.com 321 True
2 this.is.a.dot#email.com 150 True
3 email1234#112.com 0 True
4 notanemail 133 False
Now you can filter on the isemail column.

validate_email returns a lot of information e.g. smtp etc., and for invalid emails, it throws an EmailNotValidError exception. You can write a function, and apply on pandas series -
from email_validator import validate_email, EmailNotValidError
def validate_e(x):
try:
v = validate_email(x)
return True
except EmailNotValidError as e:
return False
df["Email_validate"] = df['email'].apply(validate_e)

Horizontal grouped Barplot with Bokeh using dataframe

I have a Dataframe and I want to group by Type, and then Flag and plot a graph for count of ID and another graph grouped by Type , Flag and sum of Total column in Bokeh.
')
p.hbar(df,
plot_width=800,
plot_height=800,
label='Type',
values='ID',
bar_width=0.4,
group = ' Type', 'Flag'
legend='top_right')
[![Expected Graph ][2]][2]
If it's not possible with Bokeh what other package can I use to get a good looking graph( Vibrant colours with white background)

You can do this with the holoviews library, which uses bokeh as a backend.
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension("bokeh")
df = pd.DataFrame({
"type": list("ABABCCAD"),
"flag": list("YYNNNYNY"),
"id": list("DEFGHIJK"),
"total": [40, 100, 20, 60, 77, 300, 60, 50]
})
# Duplicate the dataframe
df = pd.concat([df] * 2)
print(df)
type flag id total
0 A Y 1 40
1 B Y 2 100
2 A N 3 20
3 B N 4 60
4 C N 5 77
5 C Y 6 300
6 A N 7 60
7 D Y 8 50
Now that we have our data, lets work on plotting it:
def mainplot_hook(plot, element):
plot.state.text(
y="xoffsets",
x="total",
text="total",
source=plot.handles["source"],
text_align="left",
y_offset=9,
x_offset=5
)
def sideplot_hook(plot, element):
plot.state.text(
y="xoffsets",
x="count",
text="count",
source=plot.handles["source"],
text_align="left",
y_offset=9,
x_offset=5
)
# Create single bar plot for sum of the total column
total_sum = df.groupby(["type", "flag"])["total"].sum().reset_index()
total_sum_bars = hv.Bars(total_sum, kdims=["type", "flag"], vdims="total")
# Create our multi-dimensional bar plot
all_ids = sorted(df["id"].unique())
counts = df.groupby(["type", "flag"])["id"].value_counts().rename("count").reset_index()
id_counts_hmap = hv.Bars(counts, kdims=["type", "flag", "id"], vdims="count").groupby("type")
main_plot = (total_sum_bars
.opts(hooks=[mainplot_hook],
title="Total Sum",
invert_axes=True)
)
side_plots = (
id_counts_hmap
.redim.values(id=all_ids, flag=["Y", "N"])
.redim.range(count=(0, 3))
.opts(
opts.NdLayout(title="Counts of ID"),
opts.Bars(color="#1F77B4", height=250, width=250, invert_axes=True, hooks=[sideplot_hook]))
.layout("type")
.cols(2)
)
final_plot = main_plot + side_plots
# Save combined output as html
hv.save(final_plot, "my_plot.html")
# Save just the main_plot as html
hv.save(main_plot, "main_plot.html")
As you can see, the code to make plots in holoviews can be a little tricky but it's definitely a tool I would recommend you pick up. Especially if you deal with high dimensional data regularly, it makes plotting it a breeze once you get the syntax down.

Identifying groups of two rows that satisfy three conditions in a dataframe

I have the df below and want to identify any two orders that satisfy all the following condtions:
Distance between pickups less than X miles
Distance between dropoffs less Y miles
Difference between order creation times less Z minutes
Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.
The df I currently have looks like the following:
DAY  Order pickup_lat pickup_long dropoff_lat dropoff_long created_time
1/3/19 234e 32.69 -117.1 32.63 -117.08 3/1/19 19:00
1/3/19 235d 40.73 -73.98 40.73 -73.99 3/1/19 23:21
1/3/19 253w 40.76 -73.99 40.76 -73.99 3/1/19 15:26
2/3/19 231y 36.08 -94.2 36.07 -94.21 3/2/19 0:14
3/3/19 305g 36.01 -78.92 36.01 -78.95 3/2/19 0:09
3/3/19 328s 36.76 -119.83 36.74 -119.79 3/2/19 4:33
3/3/19 286n 35.76 -78.78 35.78 -78.74 3/2/19 0:43
I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.
I hope I am explaining my desired output correctly. Thanks for looking!

I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:
created dataframe with all possible orders combination,
computed all needed measures and for all of the combinations, I added those measures column to the dataframe,
find the indices of the rows which fulfill the mentioned conditions.
The code:
#create dataframe with all combination
from itertools import combinations
index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)
from haversine import haversine
def distance(row):
loc_0 = (row[0],row[1]) # (lat, lon)
loc_1 = (row[2],row[3])
return haversine(loc_0,loc_1,unit='mi')
#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)
#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)
#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')
#Thresholds
Z = 600
Y = 400
X = 400
#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X = combined["pickup_dist_mi"]<X
dropoff_dist_Y = combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]
The output for your data:
Order Order_1 time_diff_min dropoff_dist_mi pickup_dist_mi
(0, 5) 234e 328s 573.0 322.988195 231.300179
(1, 2) 235d 253w 475.0 2.072803 0.896893
(4, 6) 305g 286n 34.0 19.766096 10.233550
Hope I understand you well and that will help.

Using your dataframe as above. Drop the index. I'm presuming your created_time column is in datetime format.
import pandas as pd
from geopy.distance import geodesic
Cross merge the dataframe to get all possible combinations of 'Order'.
df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
Remove all the rows where the orders are equal.
df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()
Drop duplicate rows where Order_x, Order_y == [a, b] and [b, a]
# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))
# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)
Create a column calculate the time difference in minutes.
df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')
Create a column and calculate the distance between drop offs.
df_all['dropoff'] = df_all.apply(
(lambda row: geodesic(
(row['dropoff_lat_x'], row['dropoff_long_x']),
(row['dropoff_lat_x'], row['dropoff_long_y'])
).miles),
axis=1
)
Create a column and calculate the distance between pickups.
df_all['pickup'] = df_all.apply(
(lambda row: geodesic(
(row['pickup_lat_x'], row['pickup_long_x']),
(row['pickup_lat_x'], row['pickup_long_y'])
).miles),
axis=1
)
Filter the results as desired.
X = 1500
Y = 2000
Z = 100
mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z
print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])
Order_x Order_y time dropoff pickup
10 235d 231y 53.0 1059.026620 1059.026620
11 235d 305g 48.0 260.325370 259.275948
13 235d 286n 82.0 249.306279 251.929905
25 231y 305g 5.0 853.308110 854.315567
27 231y 286n 29.0 865.026077 862.126593
34 305g 286n 34.0 11.763787 7.842526

Calculate raster landscape proportion (percentage) within multiple overlaping polygon (shapefiles)?

I think the most easiest way is to extract the raster values within each polygon and calculate the proportion. Is it possible to do so without reading the entire grid as an array?
I have 23 yearly global classified raster (resolution = 0.00277778 degree) from 1992 - 2015 and a polygon vector with 354 shapes (which overlap at some parts). Because of the overlap (Self-intersection) it is not easy to work with them as raster. Both projected in "+proj=longlat +datum=WGS84 +no_defs".
The raster consists of classes from 10 - 220
The polygon has ABC_ID from 1 - 449
For one Year it looks like:
classification and shape example
I need to create a table like:
example table
I already tried to achieve this with:
Zonal Statistics
Pk tools (extract vector sample from raster)
LecoS (Overlay raster metrics)
Cross-Classification and Tabulation" of SAGA GIS (problems with extent)
FRAGSTATS (i was not able to load in the shp file)
Raster --> Extraction --> Clipper dose not work (Ring Self-intersection)
I have heard that Tabulate Area from ArcMap can do this but it would be nice if there is an open source solution to this.

I have managed to do it with Python "rasterio" and "geopandas"
It now creates a table like:
example result
since i did not found something similar like the extract comand in R "raster" it took more than only 2 lines but instead of calculating half the night it now takes only 2 min for one year.
The results are the same. It is based on the ideas of "gene" from "https://gis.stackexchange.com/questions/260304/extract-raster-values-within-shapefile-with-pygeoprocessing-or-gdal/260380"
import rasterio
from rasterio.mask import mask
import geopandas as gpd
import pandas as pd
print('1. Read shapefile')
shape_fn = "D:/path/path/multypoly.shp"
raster_fn = "D:/path/path/class_1992.tif"
# set max and min class
raster_min = 10
raster_max = 230
output_dir = 'C:/Temp/'
write_zero_frequencies = True
show_plot = False
shapefile = gpd.read_file(shape_fn)
# extract the geometries in GeoJSON format
geoms = shapefile.geometry.values # list of shapely geometries
records = shapefile.values
with rasterio.open(raster_fn) as src:
print('nodata value:', src.nodata)
idx_area = 0
# for upslope_area in geoms:
for index, row in shapefile.iterrows():
upslope_area = row['geometry']
lake_id = row['ABC_ID']
print('\n', idx_area, lake_id, '\n')
# transform to GeJSON format
from shapely.geometry import mapping
mapped_geom = [mapping(upslope_area)]
print('2. Cropping raster values')
# extract the raster values values within the polygon
out_image, out_transform = mask(src, mapped_geom, crop=True)
# no data values of the original raster
no_data=src.nodata
# extract the values of the masked array
data = out_image.data[0]
# extract the row, columns of the valid values
import numpy as np
# row, col = np.where(data != no_data)
clas = np.extract(data != no_data, data)
# from rasterio import Affine # or from affine import Affine
# T1 = out_transform * Affine.translation(0.5, 0.5) # reference the pixel centre
# rc2xy = lambda r, c: (c, r) * T1
# d = gpd.GeoDataFrame({'col':col,'row':row,'clas':clas})
range_min = raster_min # min(clas)
range_max = raster_max # max(clas)
classes = range(range_min, range_max + 2)
frequencies, class_limits = np.histogram(clas,
bins=classes,
range=[range_min, range_max])
if idx_area == 0:
# data_frame = gpd.GeoDataFrame({'freq_' + str(lake_id):frequencies})
data_frame = pd.DataFrame({'freq_' + str(lake_id): frequencies})
data_frame.index = class_limits[:-1]
else:
data_frame['freq_' + str(lake_id)] = frequencies
idx_area += 1
print(data_frame)
data_frame.to_csv(output_dir + 'upslope_area_1992.csv', sep='\t')

I can do it with the R command extract and summaries it with table as explained by "Spacedman" see: https://gis.stackexchange.com/questions/23614/get-raster-values-from-a-polygon-overlay-in-opensource-gis-solutions
shapes <- readOGR("C://data/.../shape)
LClass_1992 <- raster("C://.../LClass_1992.tif")
value_list <- extract (LClass, shapes )
stats <- lapply(value_list,table)
[[354]]
10 11 30 40 60 70 80 90 100 110 130 150 180 190 200 201 210
67 303 233 450 1021 8241 65 6461 2823 88 6396 5 35 125 80 70 1027
But it takes very long (half the night).
I will try to do it with Python maybe it will be faster.
Maybe someone had done something similar and can share the code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Subtracting specific values from column in pandas - python-3.x

Related

Python code to split my excel column value based on delimiter & write 1st split value to same column and 2nd to new column created next to that

Python validate_email package not producing correct result [duplicate]

Horizontal grouped Barplot with Bokeh using dataframe

Identifying groups of two rows that satisfy three conditions in a dataframe

Calculate raster landscape proportion (percentage) within multiple overlaping polygon (shapefiles)?

Categories

Resources