Finding Specific word in a pandas column and assigning to a new column and replicate the row - python-3.x

I am trying to find specific words from a pandas column and assign it to a new column and column may contain two or more words. Once I find it I wish to replicate the row by creating it for that word.
import pandas as pd
import numpy as np
import re
wizard=pd.read_excel(r'C:\Python\L\Book1.xlsx'
,sheet_name='Sheet1'
, header=0)
test_set = {'941', '942',}
test_set2={'MN','OK','33/3305'}
wizard['ZTYPE'] = wizard['Comment'].apply(lambda x: any(i in test_set for i in x.split()))
wizard['ZJURIS']=wizard['Comment'].apply(lambda x: any(i in test_set2 for i in x.split()))
wizard_new = pd.DataFrame(np.repeat(wizard.values,3,axis=0))
wizard_new.columns = wizard.columns
wizard_new.head()
I am getting true and false, however unable to split it.
Above is how the sample data reflects. I need to find anything like this '33/3305', Year could be entered as '19', '2019', and quarter could be entered are 'Q1'or '1Q' or 'Q 1' or '1 Q' and my test set lists.
ZJURIS = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in wizard.comment()])))
def to_category(x):
for w in x.lower().split(" "):
if w in ZJURIS:
return ZJURIS[w]
return None
Finally, apply the method on the column and save the result to a new one:
wizard["ZJURIS"] = wizard["comment"].apply(to_category)
I tried the above solution well it did not
Any suggestions how to do I get the code to work.
Sample data.
data={ 'ID':['351362278576','351539320880','351582465214','351609744560','351708198604'],
'BU':['SBS','MAS','NAS','ET','SBS'],
'Comment':['940/941/w2-W3NYSIT/SUI33/3305/2019/1q','OK SUI 2Q19','941 - 3Q2019NJ SIT - 3Q2019NJ SUI/SDI - 3Q2019','IL,SUI,2016Q4,2017Q1,2017Q2','1Q2019 PA 39/5659 39/2476','UT SIT 1Q19-3Q19']
}
df = pd.DataFrame(data)
Based on the data sample data set attached is the output.

Related

Altair/Vega-Lite heatmap: Filter top k

I am attempting to create a heatmap and retain only the top 5 samples based on the mean of the relative abundance. I am able to sort the heatmap properly, but I can't figure out how to retain only the top 5, in this case samples c, e, b, y, a. I am pasting a subset of the df with the image. I've tried myriad permutations of the "Top K Items Tutorial" link at the altair-viz website. I'd prefer to use altair to do the filtering if possible, as opposed to filtering the df itself in the python code.
Dateframe:
,Lowest_Taxon,p1,p2,p3,p4,p5,p6,p7
0,a,0.03241281,0.0,0.467738067,3.14456785,0.589519651,13.5744323,0.0
1,b,0.680669,9.315121951,2.848404893,13.99058458,2.139737991,16.60779366,7.574639383
2,c,40.65862829,1.244878049,71.01223315,4.82197541,83.18777293,0.0,0.0
3,d,0.0,0.0,0.0,0.548471137,0.272925764,0.925147183,0.0
4,e,0.090755867,13.81853659,5.205085152,27.75721011,1.703056769,19.6691898,12.27775914
5,f,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,g,0.187994295,0.027317073,0.0,0.0,0.0,0.02242781,0.0
7,h,0.16854661,0.534634146,1.217318302,7.271813154,1.73580786,0.57751612,0.57027843
8,i,0.142616362,2.528780488,1.163348525,0.34279446,0.0,0.0,0.0
9,j,1.711396344,0.694634146,0.251858959,4.273504274,0.087336245,1.721334455,0.899027172
10,k,0.0,1.475121951,0.0,0.0,0.0,5.573310906,0.0
11,l,0.194476857,0.253658537,1.517150396,2.413273002,0.949781659,5.147182506,1.650452868
12,m,0.0,1.736585366,0.0,0.063988299,0.0,8.42724979,0.623951694
13,n,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,o,4.68689226,0.12097561,0.0,0.0,0.0,0.0,0.0
15,p,0.0,0.885853659,0.0,0.0,0.0,0.913933277,0.046964106
16,q,0.252819914,0.050731707,0.023986568,0.0,0.087336245,0.0,0.0
17,r,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,s,0.0,0.0,0.0,0.0,0.0,0.0,0.0
19,t,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20,u,0.0,0.058536585,0.089949628,0.356506239,0.0,0.285954584,1.17410265
21,v,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,w,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23,x,1.471541553,2.396097561,0.593667546,0.278806161,0.065502183,0.280347631,0.952700436
24,y,0.0,0.32,0.0,0.461629873,0.0,7.804878049,18.38980208
25,z,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Code block:
import pandas as pd
import numpy as np
import altair as alt
from vega_datasets import data
from altair_saver import save
# Read in the file and fill empty cells with zero
df = pd.read_excel("path\to\df")
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
# Tell altair to plot as many rows as is necessary
alt.data_transformers.disable_max_rows()
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
)
If you know what you want to show is the entries with c, e, b, y and a (and it will not change later) you could simply apply a transform_filter on the field Lowest_Taxon.
If you want to calculate on the spot which ones make it into the top five, it needs a bit more effort, i.e. a combination of joinaggregate, window and filter transforms.
For both I paste an example below. By the way, I converted the original data that you pasted into a csv file which is imported by the code snippets. You can make it easier for others to to use your pandas toy data by providing it as a dict, which can then be simply read directly in the code.
Simple approach:
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_filter(
alt.FieldOneOfPredicate(field='Lowest_Taxon', oneOf=['c', 'e', 'b', 'y', 'a'])
)
Flexible approach:
set n to how many of the top entries you want to see
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.disable_max_rows()
df = pd.read_csv('df.csv', index_col=0)
doNotMelt = df.drop(df.iloc[:,1:], axis=1)
df_melted = pd.melt(df, id_vars = doNotMelt, var_name = 'SampleID', value_name = 'Relative_abundance')
n = 5 # number of entries to display
alt.Chart(df_melted).mark_rect().encode(
alt.X('SampleID:N'),
alt.Y('Lowest_Taxon:N', sort=alt.EncodingSortField(field='Relative_abundance', op='mean', order='descending')),
alt.Color('Relative_abundance:Q')
).transform_joinaggregate(
mean_rel_ab = 'mean(Relative_abundance)',
count_of_samples = 'valid(Relative_abundance)',
groupby = ['Lowest_Taxon']
).transform_window(
rank='rank(mean_rel_ab)',
sort=[alt.SortField('mean_rel_ab', order='descending')],
frame = [None, None]
).transform_filter(
(alt.datum.rank <= (n-1) * alt.datum.count_of_samples + 1)

Is there a loop function for automating data pull request from Google Trends?

I want to create a loop that helps me to pull data from Google Trends via PyTrends. I need to iterate through a lot of keywords but Google Trends allows only to compare five keywords at the time, hence I need to iterate through the keywords manually and create a dataframe in pandas. However, it seems something is off.
I get data but my dataframe with pandas creates the dataframe with values that are shifted in different rows and with duplicate "NaN" values.
instead of 62 rows I get 372 rows(with duplicate "NaN").
from pytrends.request import TrendReq
import pandas as pd
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df1 = pd.DataFrame()
for i in kw_list:
kw_list = i
pytrend.build_payload([kw_list], timeframe='2015-10-14 2015-12-14', geo='FR')
df1 = df1.append(pytrend.interest_over_time())
print(df1.head)
I want to have one coherent dataframe, with the columns 'cool', 'fun', 'big','house', 'phone', 'garden' and their respective values in each column on the same row. Like e.g. a dataframe with 62 rows and 6 columns.
I'm probably going to lose some rep points because this is an old question (oldish, anyway), but I was struggling with the same problem and I solved it like this:
import pytrends
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
kw_list = ['cool', 'fun', 'big','house', 'phone', 'garden']
df_gtrends_kw = {}
df_gtrends = pd.DataFrame()
for kw in kw_list:
pytrend.build_payload(kw_list = [kw], timeframe='today 12-m')
df_gtrends_kw[kw] = pytrend.interest_by_region(resolution='COUNTRY')
df_gtrends = pd.concat([df_gtrends_kw[key] for key in kw_list], join = 'inner', axis = 1)
According to the official doc, one has to specify the axes along which one is to glue the dataframes; in this case, the Column names, since the index name is the same for each dataframe.

Python3 - using pandas to group rows, where two colums contain values in forward or reverse order: v1,v2 or v2,v1

I'm fairly new to python and pandas, but I've written code that reads an excel workbook, and groups rows based on the values contained in two columns.
So where Col_1=A and Col_2=B, or Col_1=B and Col_2=A, both would be assigned a GroupID=1.
sample spreadsheet data, with rows color coded for ease of visibility
I've manged to get this working, but I wanted to know if there's a more simpler/efficient/cleaner/less-clunky way to do this.
import pandas as pd
df = pd.read_excel('test.xlsx')
# get column values into a list
col_group = df.groupby(['Header_2','Header_3'])
original_list = list(col_group.groups)
# parse list to remove 'reverse-duplicates'
new_list = []
for a,b in original_list:
if (b,a) not in new_list:
new_list.append((a,b))
# iterate through each row in the DataFrame
# check to see if values in the new_list[] exist, in forward or reverse
for index, row in df.iterrows():
for a,b in new_list:
# if the values exist in forward direction
if (a in df.loc[index, "Header_2"]) and (b in df.loc[index,"Header_3"]):
# GroupID value given, where value is index in the new_list[]
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# else check if value exists in the reverse direction
if (b in df.loc[index, "Header_2"]) and (a in df.loc[index,"Header_3"]):
df.loc[index,"GroupID"] = new_list.index((a,b))+1
# Finally write the DataFrame to a new spreadsheet
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'Sheet1')
I know of the pandas.groupby([columnA, columnB]) option, but I couldn't figure a way to create groups that contained both (v1, v2) and (v2,v1).
A boolean mask should do the trick:
import pandas as pd
df = pd.read_excel('test.xlsx')
mask = ((df['Header_2'] == 'A') & (df['Header_3'] == 'B') |
(df['Header_2'] == 'B') & (df['Header_3'] == 'A'))
# Label each row in the original DataFrame with
# 1 if it matches the specified criteria, and
# 0 if it does not.
# This column can now be used in groupby operations.
df.loc[:, 'match_flag'] = mask.astype(int)
# Get rows that match the criteria
df[mask]
# Get rows that do not match the criteria
df[~mask]
EDIT: updated answer to address the groupby requirement.
I would do something like this.
import pandas as pd
df = pd.read_excel('test.xlsx')
#make the ordering consistent
df["group1"] = df[["Header_2","Header_3"]].max(axis=1)
df["group2"] = df[["Header_2","Header_3"]].min(axis=1)
#group them together
df = df.sort_values(by=["group1","group2"])
If you need to deal with more than two columns, I can write up a more general way to do this.

pandas dataframe output need to be a string instead of a list

I have a requirement that the result value should be a string. But when I calculate the maximum value of dataframe it gives the result as a list.
import pandas as pd
def answer_one():
df_copy = [df['# Summer'].idxmax()]
return (df_copy)
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(')
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
answer_one()
But here the answer_one() will give me a List as an output and not a string. Can someone help me know how this came be converted to a string or how can I get the answer directly from dataframe as a string. I don't want to convert the list to a string using str(df_copy).
Your first solution would be as #juanpa.arrivillaga put it: To not wrap it. Your function becomes:
def answer_one():
df_copy = df['# Summer'].idxmax()
return (df_copy)
>>> 1
Another thing that you might not be expecting but idxmax() will return the index of the max, perhaps you want to do:
def answer_one():
df_copy = df['# Summer'].max()
return (df_copy)
>>> 30
Since you don't want to do str(df_copy) you can do df_copy.astype(str) instead.
Here is how I would write your function:
def get_max_as_string(data, column_name):
""" Return Max Value from a column as a string."""
return data[column_name].max().astype(str)
get_max_as_string(df, '# Summer')
>>> '30'

Python3, with pandas.dataframe, how to select certain data by some rules to show

I have a pandas.dataframe, and I want to select certain data by some rules.
The following codes generate the dataframe
import datetime
import pandas as pd
import numpy as np
today = datetime.date.today()
dates = list()
for k in range(10):
a_day = today - datetime.timedelta(days=k)
dates.append(np.datetime64(a_day))
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(10, 3)),
columns=('other1', 'actual', 'other2'),
index=['{}'.format(i) for i in range(10)])
df.insert(0, 'dates', dates)
df['err_m'] = np.random.rand(10, 1)*0.1
df['std'] = np.random.rand(10, 1)*0.05
df['gain'] = np.random.rand(10, 1)
Now, I want select by the following rules:
1. compute the sum of 'err_m' and 'std', then sort the df so that the sum is descending
2. from the result of step 1, select the part where 'actual' is > 50
Thanks
Create a new column and then sort by this one:
df['errsum'] = df['err_m'] + df['std']
# Return a sorted dataframe
df_sorted = df.sort('errsum', ascending = False)
Select the lines you want
# Create an array with True where the condition is met
selector = df_sorted['errsum'] > 50
# Return a view of sorted_dataframe with only the lines you want
df_sorted[selector]

Resources