How to Put Ages into intervals - python-3.x

I have a list of ages in an existing dataframe. I would like to put these ages into intervals/Age Groups such as (10-20), (20-30),etc. Please see excample below
I am unsure where to begin coding this as i get an "bins" errors when using all bins related code

Here's what you can do:
import pandas as pd
def checkAgeRange(age):
las_dig=age%10
range_age=str.format('{0}-{1}',age-las_dig,((age-las_dig)+10))
return range_age
d={'AGE':[19,13,45,65,23,12,28]}
dataFrame= pd.DataFrame(data=d)
dataFrame['AgeGroup']=dataFrame['AGE'].apply(checkAgeRange)
print(dataFrame)
# Output: AGE AgeGroup
0 19 10-20
1 13 10-20
2 45 40-50
3 65 60-70
4 23 20-30
5 12 10-20
6 28 20-30
Some explanation of code above:
d={'AGE':[19,13,45,65,23,12,28]}
dataFrame= pd.DataFrame(data=d)
# Making a simple dataframe here
dataFrame['AgeGroup']=dataFrame['AGE'].apply(checkAgeRange)
# applying our checkAgeRange function here
def checkAgeRange(age):
las_dig=age%10
range_age=str.format('{0}-{1}',age-las_dig,((age-las_dig)+10))
return range_age
# This method extracts the las digit from age and then forms the range as a string. You can change the data-structure here according to your needs.
Hope this answers your question. Cheers!

Related

PerformanceWarning: DataFrame is highly fragmented. How to convert in to a more efficient way via pd.concat with designated column name

I got following warning while running under python 3.8 with the newest pandas.
PerformanceWarning: DataFrame is highly fragmented.
this is the place where I compile my data into one single dataframe, and also where the problem pops up.
def get_all_score():
df = pd.DataFrame()
for name, code in get_code().items():
global count
count += 1
print("ticker:" + name, "trade_code:" + code, "The {} data updated".format(count))
try:
df[name] = indicator_score(code)['total']
time.sleep(0.33334)
except:
continue
return df
I tried to look up in the forum, but I can't figure out how to manipulate with two variables, df[name]is my column name, and indicator_score(code)['total'] is my column output data, all the fractured dataframes are added horizontally, shown as bellow:
a b c <<< zz
1 30 40 10 21
2 41 50 11 33
3 44 66 20 29
4 51 71 19 10
5 31 88 31 60
6 60 95 40 70
.
.
.
what would be a neat way to use pd.concat() to solve my issue? thanks.
This is my workaround on this issue, but it seems not that reliable, one little glitch can totally ruin the past process. Here are my code:
def get_all_score():
df = pd.DataFrame()
name_list = []
for name, code in get_code().items():
global count
count += 1
print("ticker:" + name, "trade_code:" + code, "The {} data updated".format(count))
try:
name_list.append(name)
df = pd.concat([df, indicator_score(code)['总分']], axis=1)
# df[name] = indicator_score(code)['总分']
# time.sleep(0.33334)
except:
name_list.remove(name)
continue
df.columns = name_list
return df
I tried to replace name for column name before concat process, however I failed to do so. I only figured out how to replace column name after the concat process. This is such a pain. Does anyone have a better way to do so?
df[name] = indicator_score(code)['总分'].copy()
should solve your poor performance issue, I suppose, give it a try mate.

Fastest way to detect and append duplicates base on specific column in dataframe

Here are samples data:
name age gender school
Michael Z 21 Male Lasalle
Lisa M 22 Female Ateneo
James T 21 Male UP
Michael Z. 23 Male TUP
Here are the expected output I need:
name age gender similar name on_lasalle on_ateneo on_up on_tup
Michael Z 21 Male Michael Z. True False False True
Lisa M 22 Female False True False False
James T 21 Male False False True False
I've been trying to use fuzzywuzzy on my python script. The data I am getting is coming from bigquery, then I am comverting it to dataframe to clean some stuff. After that, I am converting the dataframe to a list of dictionaries.
Notice the above data where Michael Z. from TUP was appended to Michael Z from school Lasalle since they have similar names with 100% similarity rate using fuzz.token_set_ratio
What I want is to get all similar rows base on names and append it to the current dictionary we are looking at (including their school).
Here is the code and the loop to get similar rows base on names:
data_dict_list = data_df.to_dict('records')
for x in range(0, len(data_dict_list)):
for y in range(x, len(data_dict_list)):
if not data_dict_list[x]['is_duplicate']:
similarity = fuzz.token_set_ratiod(data_dict_list[x]['name'], data_dict_list[y]['name'])
if similarity >= 90:
data_dict_list[x]['similar_names'].update('similar_name': data_dict_list[y]['name'])
...
data_dict_list[x]['is_duplicate'] = True
The runtime of this script is very slow, as sometimes, I am getting 100,000+ data !!! So it will loop through all of that data.
How will I be able to speed up the process of this?
Suggesting pandas is much appreciated as I am having a hard time figuring out how to loop data in it.
As a first step you can simply replace the import of fuzzywuzzy with rapidfuzz:
from rapidfuzz import fuzz
which should already improve the performance quite a bit. You can further improve the performance by comparing complete lists of strings in rapidfuzz in the following way:
>> import pandas as pd
>> from rapidfuzz import process, fuzz
>> df = pd.DataFrame(data={'name': ['test', 'tests']})
>> process.cdist(df['name'], df['name'], scorer=fuzz.token_set_ratio, score_cutoff=90)
array([[100, 0],
[ 0, 100]], dtype=uint8)
which returns a matrix of result where all elements with a score below 90 are set to 0. For large datasets you can enable multithreading using the workers argument:
process.cdist(df['name'], df['name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=90)

df.mean() / jupyter / pandas alternating axis for output

I haven't posted many questions, but, I have found a very strange behavior causing alternating output. I'm hoping someone can help shed some light on this.
I am using jupyter and I am creating some data like this:
# Use the following data for this assignment:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(12345)
df = pd.DataFrame([np.random.normal(32000,200000,3650),
np.random.normal(43000,100000,3650),
np.random.normal(43500,140000,3650),
np.random.normal(48000,70000,3650)],
index=[1992,1993,1994,1995])
df
Now in the next cell I have a couple lines to get the transpose of the DF and then get the mean and standard deviations. However, when I run this cell multiple times it seems that I am getting different output from .mean()
df = df.T
values = df.mean(axis=0)
std = df.std(axis=0)
values
I am using shift enter to run this second cell and this is what I will get:
1992 33312.107476
1993 41861.859541
1994 39493.304941
1995 47743.550969
dtype: float64
And when I run the cell again using shift + enter (Output truncated but you should get the idea)
0 5447.716574
1 126449.084350
2 41091.469083
3 -61754.197831
4 223744.364842
5 94746.779056
6 57607.078825
7 109812.089923
8 28283.060354
9 69768.157194
10 32952.030326
11 40222.026635
12 64786.632304
13 17025.266684
14 111334.168830
15 96067.788206
16 -68157.985363
I have tried changing the axis parameter and removing the axis parameter but the output remains the same
Here is a screen shot incase anyone is interested in duplicating what I have done:
Jupyter window on my end
Thanks for reading.
Your problem is that in your second cell, you are re-assigning your df to be df.T, so every time, it is transposing your dataframe again. So what you can do is: Don't use df = df.T, just say this instead:
values = df.T.mean(axis=0)
std = df.T.std(axis=0)
Or even better, use axis=1 (apply it to columns instead of rows) without transposing:
values = df.mean(axis=1)
std = df.std(axis=1)
You can use describe
df.T.describe()
Out[267]:
1992 1993 1994 1995
count 3650.000000 3650.000000 3650.000000 3650.000000
mean 34922.760627 41574.363827 43186.197526 49355.777683
std 200618.445749 98495.601455 140639.407130 70408.448642
min -632057.636640 -292484.131067 -435217.159232 -181304.694667
25% -98715.272565 -24771.835741 -49460.639563 -973.422386
50% 34446.219184 41474.621854 43323.557410 49281.270881
75% 170722.706967 107502.446843 136286.933017 97422.070284
max 714855.084396 453834.306915 516751.566696 295427.273677

identifying decrease in values in spark (outliers)

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

Reorder Rows into Columns in Pandas (Python 3, Pandas)

Right now, my code takes scraped web data from a file (BigramCounter.txt), and then finds all the bigrams within that file so that the data looks like this:
Counter({('the', 'first'): 45, ('on', 'purchases'): 42, ('cash', 'back'): 39})
After this, I try to feed it into a pandas DataFrame where it spits this df out:
the on cash
first purchases back
0 45 42 39
This is very close to what I need but not quite. First off, the DF does not read my attempt to name the columns. Furthermore, I was hoping for something formatted more like this where its two COLUMNS and the Words are not split between Cells:
Words Frequency
the first 45
on purchases 42
cash back 39
For reference, here is my code. I think I may need to reorder an axis somewhere but I'm not sure how? Any ideas?
import re
from collections import Counter
main_c = Counter()
words = re.findall('\w+', open('BigramCounter.txt', encoding='utf-8').read())
bigrams = Counter(zip(words,words[1:]))
main_c.update(bigrams) #at this point it looks like Counter({('the', 'first'): 45, etc...})
comm = [[k,v] for k,v in main_c]
frame = pd.DataFrame(comm)
frame.columns = ['Word', 'Frequency']
frame2 = frame.unstack()
frame2.to_csv('text.csv')
I think I see what you're going for, and there are many ways to get there. You were really close. My first inclination would be to use a series, especially since you'd (presumably) just be getting rid of the df index when you write to csv, but it doesn't make a huge difference.
frequencies = [[" ".join(k), v] for k,v in main_c.items()]
pd.DataFrame(frequencies, columns=['Word', 'Frequency'])
Word Frequency
0 the first 45
1 cash back 39
2 on purchases 42
If, as I suspect, you want word to be the index, add frame.set_index('Word')
Word Frequency
the first 45
cash back 39
on purchases 42

Resources