How to understand python data frames better - python-3.x

Being a beginner in Python, I often face this problem - Let's say I am working with a data frame and want to execute an operation on one of the column. It can be just removing the decimal point from the value or maybe I want to take out the month from the date column. But often the solutions I would find online - it is generally shown with a single value or a data point like this:
a = 11.0
int(a)
11
Now, the same solution can't be applied if I have a data frame or a column. Again If I want to add time with date
d = date.today()
d
datetime.date(2018, 3, 30)
datetime.combine(d, datetime.min.time())
datetime.datetime(2018, 3, 30, 0, 0)
In the same manner, this solution can not be used for a data frame. That will throw an error. Obviously I have a lacking in knowledge here, I am not being able to make it work in terms of data frames. Can you please point me towards the topic which might help me understand these problems in terms of data frames ? or maybe show an example how its done ?

You should have a look at pandas library to manipulate dataframes : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
This is an exemple to apply a function for each value of a given column:
import pandas as pd
def myFunction(a_string):
return a_string.upper()
data = pd.read_csv('data.csv')
print(data)
data['City'] = data['City'].apply(myFunction)
print(data)
Data at beginning :
Name City Age
Robert Paris 32
Max Dallas 24
Raj Delhi 27
Data after:
Name City Age
Robert PARIS 32
Max DALLAS 24
Raj DELHI 27
Here myFunction uppercase the string but could be used the same way for other kind of operations.
Hope that helps.

Related

Fastest way to detect and append duplicates base on specific column in dataframe

Here are samples data:
name age gender school
Michael Z 21 Male Lasalle
Lisa M 22 Female Ateneo
James T 21 Male UP
Michael Z. 23 Male TUP
Here are the expected output I need:
name age gender similar name on_lasalle on_ateneo on_up on_tup
Michael Z 21 Male Michael Z. True False False True
Lisa M 22 Female False True False False
James T 21 Male False False True False
I've been trying to use fuzzywuzzy on my python script. The data I am getting is coming from bigquery, then I am comverting it to dataframe to clean some stuff. After that, I am converting the dataframe to a list of dictionaries.
Notice the above data where Michael Z. from TUP was appended to Michael Z from school Lasalle since they have similar names with 100% similarity rate using fuzz.token_set_ratio
What I want is to get all similar rows base on names and append it to the current dictionary we are looking at (including their school).
Here is the code and the loop to get similar rows base on names:
data_dict_list = data_df.to_dict('records')
for x in range(0, len(data_dict_list)):
for y in range(x, len(data_dict_list)):
if not data_dict_list[x]['is_duplicate']:
similarity = fuzz.token_set_ratiod(data_dict_list[x]['name'], data_dict_list[y]['name'])
if similarity >= 90:
data_dict_list[x]['similar_names'].update('similar_name': data_dict_list[y]['name'])
...
data_dict_list[x]['is_duplicate'] = True
The runtime of this script is very slow, as sometimes, I am getting 100,000+ data !!! So it will loop through all of that data.
How will I be able to speed up the process of this?
Suggesting pandas is much appreciated as I am having a hard time figuring out how to loop data in it.
As a first step you can simply replace the import of fuzzywuzzy with rapidfuzz:
from rapidfuzz import fuzz
which should already improve the performance quite a bit. You can further improve the performance by comparing complete lists of strings in rapidfuzz in the following way:
>> import pandas as pd
>> from rapidfuzz import process, fuzz
>> df = pd.DataFrame(data={'name': ['test', 'tests']})
>> process.cdist(df['name'], df['name'], scorer=fuzz.token_set_ratio, score_cutoff=90)
array([[100, 0],
[ 0, 100]], dtype=uint8)
which returns a matrix of result where all elements with a score below 90 are set to 0. For large datasets you can enable multithreading using the workers argument:
process.cdist(df['name'], df['name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=90)

Transform dataframe from wide to long while breaking apart column names

I have a dataframe that looks like this:
CustomerID CustomerStatus CustomerTier Order.Blue.Good Order.Green.Bad Order.Red.Good
----------------------------------------------------------------------------------------------------------------
101 ACTIVE PREMIUM NoticeABC: Good 5 NoticeYAF: Bad 1 NoticeAFV: Good 4
102 INACTIVE DIAMOND NoticeTAC: Bad 3
I'm trying to transform it to look like this:
CustomerID CustomerStatus CustomerTier Color Outcome NoticeCode NoticeDesc
----------------------------------------------------------------------------------------------------------------
101 ACTIVE PREMIUM Blue Good NoticeABC Good 5
101 ACTIVE PREMIUM Green Bad NoticeYAF Bad 1
101 ACTIVE PREMIUM Red Good NoticeAFV Good 4
102 INACTIVE DIAMOND Green Bad NoticeTAC Bad3
I believe this is just a wide-to-long data transformation, which I tried using this approach:
df = pd.wide_to_long(df, ['Order'], i=['CustomerID','CustomerStatus','CustomerTier'], j='Color', sep='.')
However, this is returning an empty dataframe. I'm sure I'm doing something wrong with the separator--perhaps because there are 2 of them in the column names?
I feel like splitting the column names into Color, Outcome, NoticeCode, and NoticeDesc would be relatively easy once I figure out how to do this conversion, but just struggling with this part!
Any helpful tips to point me in the right direction would be greatly appreciated! Thank you!
I believe this would need to be solved with two separate calls to pd.wide_to_long as so:
# To set "Outcome" column
df = pd.wide_to_long(df,
stubnames = ['Order.Blue', 'Order.Green', 'Order.Red'],
i = ['CustomerID','CustomerStatus','CustomerTier'],
j = 'Outcome',
sep = '.')
df = pd.wide_to_long(df
stubnames = 'Order',
i = ['CustomerID', 'CustomerStatus', 'CustomerTier'],
j = 'Color',
sep= '.')
Then, to split the Notice column, you could use pd.str.split as so:
df[['NoticeCode', 'NoticeDesc']] = df['Outcome'].str.split(': ', expand=True)
Let me know how this goes and we can workshop a bit!

How to add missing data from one dataframe to another?

I am working on a project that requires to fill out missing data from one Excel sheet to another. For example:
table A:
card name address zipcode
123 steve chicago 60601
321 Joy New York 10083
222 Andy San Francisco 43211
table B:
card name address zipcode
321 steve nan nan
123 Joy nan nan
123 nan nan nan
For this project, I need fill out table B according to table A. I do have idea about using Excel VLOOKUP function to fill out all of columns, but I guess if the number of data file getting huge in future, then I may use python to do this. (eg, same data format but from different branches)
In Python, the merge function can do this but it takes too much time. Is there any useful function in pandas, numpy, or any other third-party library that can help me do this? Thanks all!
Here is what I have tried:
df.merge(table A, table B, on = 'card', how = 'right')
it does work but I have to rename columns to match each features. And I also know we can do this on SQL very fast and effiency, just wanna do this on python :)
Of course pandas library can do this and more. I am currently writing a business intelligence program. And I do a lot of operations like this with pandas
There are many ways to do this, but since I don't see your code, you can do it in the simplest and most understandable way. Turn at the point where you are stuck.thank you
searchdata = Atabledata[['name','adress','zipcode']]
for i in search['name']:
Btabledata.loc[Btabledata['name']== i, Btabledata['adress']] = Atabledata['adress']
Btabledata.loc[Btabledata['name'] == i, Btabledata['zipcode']] = Atabledata['zipcode']
print(Btabledata)

Use KDTree/KNN Return Closest Neighbors

I have two python pandas dataframes. One contains all NFL Quarterbacks' College Football statistics since 2007 and a label on the type of player they are (Elite, Average, Below Average). The other dataframe contains all of the college football qbs' data from this season along with a prediction label.
I want to run some sort of analysis to determine the two closest NFL comparisons for every college football qb based on their labels. I'd like to add to two comparable qbs as two new columns to the second dataframe.
The feature names in both dataframes are the same. Here is what the dataframes look like:
Player Year Team GP Comp % YDS TD INT Label
Player A 2020 ASU 12 65.5 3053 25 6 Average
For the example above, I'd like two find the two closest neighbors to Player A that also have the label "Average" from the first dataframe.
The way I thought of doing this was to use Scipy's KDTree and run a query tree:
tree = KDTree(nfl[features], leafsize=nfl[features].shape[0]+1)
closest = []
for row in college.iterrows():
distances, ndx = tree.query(row[features], k=2)
closest.append(ndx)
print(closest)
However, the print statement returned an empty list. Is this the right way to solve my problem?
.iterrows(), will return namedtuples (index, Series) where index is obviously the index of the row, and Series is the features values with the index of those being the columns names (see below).
As you have it, row is being stored as that tuple, so when you have row[features], that won't really do anything. What you're really after is that Series which the features and values Ie row[1]. So you can either call that directly, or just break them up in your loop by doing for idx, row in df.iterrows():. Then you can just call on that Series row.
Scikit learn is a good package here to use (actually built on Scipy so you'll notice same syntax). You'll have to edit the code to your specifications (like filter to only have the "Average" players, maybe you are one-hot encoding the category columns and in that case may need to add that to the features,etc.), but to give you an idea (And I made up these dataframes just for an example...actually the nfl one is accurate, but the college completely made up), you can see below using the kdtree and then taking each row in the college dataframe to see which 2 values it's closest to in the nfl dataframe. I obviously have it print out the names, but as you can see with print(closest), the raw arrays are there for you.
import pandas as pd
nfl = pd.DataFrame([['Tom Brady','1999','Michigan',11,61.0,2217,16,6,'Average'],
['Aaron Rodgers','2004','California',12,66.1,2566,24,8,'Average'],
['Payton Manning','1997','Tennessee',12,60.2,3819,36,11,'Average'],
['Drew Brees','2000','Perdue',12,60.4,3668,26,12,'Average'],
['Dan Marino','1982','Pitt',12,58.5,2432,17,23,'Average'],
['Joe Montana','1978','Notre Dame',11,54.2,2010,10,9,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
college = pd.DataFrame([['Joe Smith','2019','Illinois',11,55.6,1045,15,7,'Average'],
['Mike Thomas','2019','Wisconsin',11,67,2045,19,11,'Average'],
['Steve Johnson','2019','Nebraska',12,57.3,2345,9,19,'Average']],
columns = ['Player','Year','Team','GP','Comp %','YDS','TD','INT','Label'])
features = ['GP','Comp %','YDS','TD','INT']
from sklearn.neighbors import KDTree
tree = KDTree(nfl[features], leaf_size=nfl[features].shape[0]+1)
closest = []
for idx, row in college.iterrows():
X = row[features].values.reshape(1, -1)
distances, ndx = tree.query(X, k=2, return_distance=True)
closest.append(ndx)
collegePlayer = college.loc[idx,'Player']
closestPlayers = [ nfl.loc[x,'Player'] for x in ndx[0] ]
print ('%s closest to: %s' %(collegePlayer, closestPlayers))
print(closest)
Output:
Joe Smith closest to: ['Joe Montana', 'Tom Brady']
Mike Thomas closest to: ['Joe Montana', 'Tom Brady']
Steve Johnson closest to: ['Dan Marino', 'Tom Brady']

Reorder Rows into Columns in Pandas (Python 3, Pandas)

Right now, my code takes scraped web data from a file (BigramCounter.txt), and then finds all the bigrams within that file so that the data looks like this:
Counter({('the', 'first'): 45, ('on', 'purchases'): 42, ('cash', 'back'): 39})
After this, I try to feed it into a pandas DataFrame where it spits this df out:
the on cash
first purchases back
0 45 42 39
This is very close to what I need but not quite. First off, the DF does not read my attempt to name the columns. Furthermore, I was hoping for something formatted more like this where its two COLUMNS and the Words are not split between Cells:
Words Frequency
the first 45
on purchases 42
cash back 39
For reference, here is my code. I think I may need to reorder an axis somewhere but I'm not sure how? Any ideas?
import re
from collections import Counter
main_c = Counter()
words = re.findall('\w+', open('BigramCounter.txt', encoding='utf-8').read())
bigrams = Counter(zip(words,words[1:]))
main_c.update(bigrams) #at this point it looks like Counter({('the', 'first'): 45, etc...})
comm = [[k,v] for k,v in main_c]
frame = pd.DataFrame(comm)
frame.columns = ['Word', 'Frequency']
frame2 = frame.unstack()
frame2.to_csv('text.csv')
I think I see what you're going for, and there are many ways to get there. You were really close. My first inclination would be to use a series, especially since you'd (presumably) just be getting rid of the df index when you write to csv, but it doesn't make a huge difference.
frequencies = [[" ".join(k), v] for k,v in main_c.items()]
pd.DataFrame(frequencies, columns=['Word', 'Frequency'])
Word Frequency
0 the first 45
1 cash back 39
2 on purchases 42
If, as I suspect, you want word to be the index, add frame.set_index('Word')
Word Frequency
the first 45
cash back 39
on purchases 42

Resources