Extract the string after particular pattern value before and after - python-3.x
I have pandas data frame and I'd like to extract the values after pb~ and before _ or ' ' or ''.
so it's like the string pb~value_ or pb~value' ' or pb~value''.
import pandas as pd
data = {'PName': ['ag~fbai-churnsoon_mk~de_at~lia_sa~fcs_tg~fbai_ts~alldevice-allgender-13-65_md~c_pb~fcbk_rt~cpm',
'pb~precision disclosed desktop_sz~300x600_pd~halfp-dmp-hubble w tablets_ch~dis_dt~dt_fm~ban_it~poe_vv~si_ad~as_rt~cpm_tg~rtg_sa~redc_ts~none_md~w_ff~pr-teas-rt']}
# Creates pandas DataFrame.
df = pd.DataFrame(data)
print(df)
# print the data
expected Output
PName Values
ag~fbai-churnsoon_mk~de_at~lia_sa~fcs_tg~fbai_ts~alldevice-allgender-13-65_md~c_pb~fcbk_rt~cpm fcbk
pb~precision disclosed desktop_sz~300x600_pd~halfp-dmp-hubble w tablets_ch~dis_dt~dt_fm~ban_it~poe_vv~si_ad~as_rt~cpm_tg~rtg_sa~redc_ts~none_md~w_ff~pr-teas-rt precision
I tried with
df['value'] = df['PName'].str.extract("")
but not able to figure out how can I extract the values.
import pandas as pd
import re
data = {'PName': ['ag~fbai-churnsoon_mk~de_at~lia_sa~fcs_tg~fbai_ts~alldevice-allgender-13-65_md~c_pb~fcbk_rt~cpm',
'pb~precision disclosed desktop_sz~300x600_pd~halfp-dmp-hubble w tablets_ch~dis_dt~dt_fm~ban_it~poe_vv~si_ad~as_rt~cpm_tg~rtg_sa~redc_ts~none_md~w_ff~pr-teas-rt']}
# Creates pandas DataFrame.
df = pd.DataFrame(data)
df['value'] = df['PName'].apply(lambda x :re.findall('pb~([\s\S]*?)(?:_| )',x)[0])
df
PName value
0 ag~fbai-churnsoon_mk~de_at~lia_sa~fcs_tg~fbai_... fcbk
1 pb~precision disclosed desktop_sz~300x600_pd~h... precision
Try non-greedy(lazy) matching
df['PName'].str.extract(r'pb~(.+?)[_ ]')
Out[55]:
0
0 fcbk
1 precision
Related
Python Pandas Writing Value to a Specific Row & Column in the data frame
I have a Pandas df of Stock Tickers with specific dates, I want to add the adjusted close for that date using yahoo finance. I iterate through the dataframe, do the yahoo call for that Ticker and Date, and the correct information is returned. However, I am not able to add that information back to the original df. I have tried various loc, iloc, and join methods, and none of them are working for me. The df shows the initialized zero values instead of the new value. import pandas as pd import yfinance as yf from datetime import timedelta # Build the dataframe df = pd.DataFrame({'Ticker':['BGFV','META','WIRE','UG'], 'Date':['5/18/2021','5/18/2021','4/12/2022','6/3/2019'], }) # Change the Date to Datetime df['Date'] = pd.to_datetime(df.Date) # initialize the adjusted close df['Adj_Close'] = 0.00 # You'll get a column of all 0s # iterate through the rows of the df and retrieve the Adjusted Close from Yahoo for i in range(len(df)): ticker = df.iloc[i]['Ticker'] start = df.iloc[i]['Date'] end = start + timedelta(days=1) # YF call data = yf.download(ticker, start=start, end=end) # Get just the adjusted close adj_close = data['Adj Close'] # Write the acjusted close to the dataframe on the correct row df.iloc[i]['Adj_Close'] = adj_close print(f'i value is {i} and adjusted close value is {adj_close} \n') print(df)
The simplest way to do is to use loc as below- # change this line df.loc[i,'Adj_Close'] = adj_close.values[0]
You can use: def get_adj_close(x): # You needn't specify end param because period is already set to 1 day df = df = yf.download(x['Ticker'], start=x['Date'], progress=False) return df['Adj Close'][0].squeeze() df['Adj_Close'] = df.apply(get_adj_close, axis=1) Output: >>> df Ticker Date Adj_Close 0 BGFV 2021-05-18 27.808811 1 META 2021-05-18 315.459991 2 WIRE 2022-04-12 104.320045 3 UG 2019-06-03 16.746983
Replacing a string value in Python
I have a column named "status" full of string values either "legitimate" or "phishing". I'm trying to convert them into a 0 for "legitimate" or 1 for "phishing". Currently my approach is to replace "legitimate" with a string value of "0", and "phishing" with a string value of "1", then convert the strings "0" and "1" to the int values 0 and 1. I'm getting the error: TypeError: '(0, status legitimate Name: 0, dtype: object)' is an invalid key with the following code, what am I doing wrong? df2 = pd.read_csv('dataset_phishing.csv', usecols=[87], dtype=str) leg = 'legitimate' phi = 'phishing' for i in df2.iterrows(): if df2[i] == leg: df2[i].replace('legitimate', '0') else if df2[i] == phi: df2[i].replace('phishing', '1')
Here iterrow gives you tuple which can't be used as index, that why you get that error. Here is a simple solution: import pandas as pd df2=pd.DataFrame([["legitimate"],["phishing"]],columns=["status"]) leg = 'legitimate' phi = 'phishing' for i in range(len(df2)): df2.iloc[i]["status"]='1' if df2.iloc[i]["status"]==phi else '0' print(df2) Here is more pythonic way to do this: import pandas as pd import numpy as np df2=pd.DataFrame([["legitimate"],["phishing"]],columns=["status"]) leg = 'legitimate' phi = 'phishing' df2["status"]=np.where(df2["status"]==phi,'1','0') print(df2) Hope this helps you
Here is another way to do this import pandas as pd import numpy as np data = {'status': ["legitimate", "phishing"]} df = pd.DataFrame(data) leg = 'legitimate' phi = 'phishing' df.loc[df["status"] == leg, "status"] = 0 df.loc[df["status"] == phi, "status"] = 1 print(df)
Altair/Vega-Lite heatmap: Filter top k
I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code. Dateframe: ,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7 0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0 1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383 2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0 3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0 4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914 5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0 6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0 7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843 8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0 9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172 10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0 11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868 12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694 13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0 14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0 15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106 16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0 17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0 18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0 19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0 20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265 21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0 22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0 23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436 24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208 25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0 Code block: import pandas as pd import numpy as np import altair as alt from vega_datasets import data from altair_saver import save # Read in the file and fill empty cells with zero df = pd.read_excel("path\to\df") doNotMelt = df.drop(df.iloc[:,1:], axis=1) df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance') # Tell altair to plot as many rows as is necessary alt.data_transformers.disable_max_rows() alt.Chart(df_melted).mark_rect().encode( alt.X('SampleID:N'), alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')), alt.Color('Relative_abundance:Q') )
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon. If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms. For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code. Simple approach: import pandas as pd import altair as alt import numpy as np alt.data_transformers.disable_max_rows() df = pd.read_csv('df.csv', index_col=0) doNotMelt = df.drop(df.iloc[:,1:], axis=1) df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance') alt.Chart(df_melted).mark_rect().encode( alt.X('SampleID:N'), alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')), alt.Color('Relative_abundance:Q') ).transform_filter( alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a']) ) Flexible approach: set n to how many of the top entries you want to see import pandas as pd import altair as alt import numpy as np alt.data_transformers.disable_max_rows() df = pd.read_csv('df.csv', index_col=0) doNotMelt = df.drop(df.iloc[:,1:], axis=1) df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance') n = 5 # number of entries to display alt.Chart(df_melted).mark_rect().encode( alt.X('SampleID:N'), alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')), alt.Color('Relative_abundance:Q') ).transform_joinaggregate( mean_rel_ab = 'mean(Relative_abundance)', count_of_samples = 'valid(Relative_abundance)', groupby = ['Lowest_Taxon'] ).transform_window( rank='rank(mean_rel_ab)', sort=[alt.SortField('mean_rel_ab', order='descending')], frame = [None, None] ).transform_filter( (alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)
Why is plot returning "ValueError: could not convert string to float:" when a dataframe column of floats is being passed to the plot function?
I am trying to plot a dataframe I have created from an excel spreadsheet using either matplotlib or matplotlib and pandas ie. df.plot. However, python keeps returning a cannot convert string to float error. This is confusing since when I print the column of the dataframe it appears to be all float values. I've tried printing the values of the dataframe column and using the pandas.plot syntax. I've also tried saving the column to a new variable. import pandas as pd from matplotlib import pyplot as plt import glob import openpyxl import math from openpyxl.utils.dataframe import dataframe_to_rows from openpyxl.styles import Border, Side, Alignment import seaborn as sns import itertools directory = 'E:\some directory' #QA_directory = directory + '**/*COPY.xlsx' wb = openpyxl.load_workbook(directory + '\\Calcs\\' + "excel file.xlsx", data_only = 'True') plt.figure(figsize=(16,9)) axes = plt.axes() plt.title('Drag Amplification', fontsize = 16) plt.xlabel('Time (s)', fontsize = 14) plt.ylabel('Cf', fontsize = 14) d = pd.DataFrame() n=[] for sheets in wb.sheetnames: if '2_1' in sheets and '2%' not in sheets and '44%' not in sheets: name = sheets[:8] print(name) ws = wb[sheets] data = ws.values cols = next(data)[1:] data = list(data) idx = [r[0] for r in data] data = (itertools.islice(r, 1, None) for r in data) df = pd.DataFrame(data, index=idx, columns=cols) df = df.dropna() #x = df['x/l'] #y = df.Cf print(df.columns) print(df.Cf.values) x=df['x/l'].values plt.plot(x, df.Cf.values) """x = [wb[sheets].cell(row=row,column=1).value for row in range(1,2000) if wb[sheets].cell(row=row,column=1).value] print(x) Cf = [wb[sheets].cell(row=row,column=6).value for row in range(1,2000) if wb[sheets].cell(row=row,column=1).value] d[name+ 'x'] = pd.DataFrame(x) d[name + '_Cf'] = pd.Series(Cf, index=d.index) print(name)""" print(df) plt.show() I'm expecting a plot of line graphs with the values of x/l on the x access and Cf on the 'y' with a line for each of the relevant sheets in the workbook. Any insights as to why i am getting this error would be appreciated!
pandas dataframe output need to be a string instead of a list
I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list. import pandas as pd def answer_one(): df_copy = [df['# Summer'].idxmax()] return (df_copy) df = pd.read_csv('olympics.csv', index_col=0, skiprows=1) for col in df.columns: if col[:2]=='01': df.rename(columns={col:'Gold'+col[4:]}, inplace=True) if col[:2]=='02': df.rename(columns={col:'Silver'+col[4:]}, inplace=True) if col[:2]=='03': df.rename(columns={col:'Bronze'+col[4:]}, inplace=True) if col[:1]=='№': df.rename(columns={col:'#'+col[1:]}, inplace=True) names_ids = df.index.str.split('\s\(') df.index = names_ids.str[0] # the [0] element is the country name (new index) df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that) df = df.drop('Totals') df.head() answer_one() But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes: def answer_one(): df_copy = df['# Summer'].idxmax() return (df_copy) >>> 1 Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do: def answer_one(): df_copy = df['# Summer'].max() return (df_copy) >>> 30 Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead. Here is how I would write your function: def get_max_as_string(data, column_name): """ Return Max Value from a column as a string.""" return data[column_name].max().astype(str) get_max_as_string(df, '# Summer') >>> '30'