PerformanceWarning: DataFrame is highly fragmented. How to convert in to a more efficient way via pd.concat with designated column name - python-3.x

I got following warning while running under python 3.8 with the newest pandas.
PerformanceWarning: DataFrame is highly fragmented.
this is the place where I compile my data into one single dataframe, and also where the problem pops up.
def get_all_score():
df = pd.DataFrame()
for name, code in get_code().items():
global count
count += 1
print("ticker:" + name, "trade_code:" + code, "The {} data updated".format(count))
try:
df[name] = indicator_score(code)['total']
time.sleep(0.33334)
except:
continue
return df
I tried to look up in the forum, but I can't figure out how to manipulate with two variables, df[name]is my column name, and indicator_score(code)['total'] is my column output data, all the fractured dataframes are added horizontally, shown as bellow:
a b c <<< zz
1 30 40 10 21
2 41 50 11 33
3 44 66 20 29
4 51 71 19 10
5 31 88 31 60
6 60 95 40 70
.
.
.
what would be a neat way to use pd.concat() to solve my issue? thanks.

This is my workaround on this issue, but it seems not that reliable, one little glitch can totally ruin the past process. Here are my code:
def get_all_score():
df = pd.DataFrame()
name_list = []
for name, code in get_code().items():
global count
count += 1
print("ticker:" + name, "trade_code:" + code, "The {} data updated".format(count))
try:
name_list.append(name)
df = pd.concat([df, indicator_score(code)['总分']], axis=1)
# df[name] = indicator_score(code)['总分']
# time.sleep(0.33334)
except:
name_list.remove(name)
continue
df.columns = name_list
return df
I tried to replace name for column name before concat process, however I failed to do so. I only figured out how to replace column name after the concat process. This is such a pain. Does anyone have a better way to do so?

df[name] = indicator_score(code)['总分'].copy()
should solve your poor performance issue, I suppose, give it a try mate.

Related

How to Put Ages into intervals

I have a list of ages in an existing dataframe. I would like to put these ages into intervals/Age Groups such as (10-20), (20-30),etc. Please see excample below
I am unsure where to begin coding this as i get an "bins" errors when using all bins related code
Here's what you can do:
import pandas as pd
def checkAgeRange(age):
las_dig=age%10
range_age=str.format('{0}-{1}',age-las_dig,((age-las_dig)+10))
return range_age
d={'AGE':[19,13,45,65,23,12,28]}
dataFrame= pd.DataFrame(data=d)
dataFrame['AgeGroup']=dataFrame['AGE'].apply(checkAgeRange)
print(dataFrame)
# Output: AGE AgeGroup
0 19 10-20
1 13 10-20
2 45 40-50
3 65 60-70
4 23 20-30
5 12 10-20
6 28 20-30
Some explanation of code above:
d={'AGE':[19,13,45,65,23,12,28]}
dataFrame= pd.DataFrame(data=d)
# Making a simple dataframe here
dataFrame['AgeGroup']=dataFrame['AGE'].apply(checkAgeRange)
# applying our checkAgeRange function here
def checkAgeRange(age):
las_dig=age%10
range_age=str.format('{0}-{1}',age-las_dig,((age-las_dig)+10))
return range_age
# This method extracts the las digit from age and then forms the range as a string. You can change the data-structure here according to your needs.
Hope this answers your question. Cheers!

Pandas: new column using data from multiple other file

I would like to add a new column in a pandas dataframe df, filled with data that are in multiple other files.
Say my df is like this:
Sample Pos
A 5602
A 3069483
B 51948
C 231
And I have three files A_depth-file.txt, B_depth-file.txt, C_depth-file.txt like this (showing A_depth-file.txt):
Pos Depth
1 31
2 33
3 31
... ...
5602 52
... ...
3069483 40
The desired output df would have a new column Depth as follows:
Sample Pos Depth
A 5602 52
A 3069483 40
B 51948 32
C 231 47
I have a method that works but it takes about 20 minutes to fill a df with 712 lines, searching files of ~4 million lines (=positions). Would anyone know a better/faster way to do this?
The code I am using now is:
import pandas as pd
from io import StringIO
with open("mydf.txt") as f:
next(f)
List=[]
for line in f:
df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True)
f2basename = df.iloc[:, 0].values[0]
f2 = f2basename + "_depth-file.txt"
df2 = pd.read_csv(f2, sep='\t')
df = pd.merge(df, df2, on="Pos", how="left")
List.append(df)
df = pd.concat(List, sort=False)
with open("mydf.txt") as f: to open the file to which I wish to add data
next(f) to pass the header
List=[] to create a new empty array called List
for line in f: to go over mydf.txt line by line and reading them with df = pd.read_fwf(StringIO(line), header=None)
df.rename(columns = {df.columns[1]: "Pos"}, inplace=True) to rename lost header name for Pos column, used later when merging line with associated file f2
f2basename = df.iloc[:, 0].values[0] getting basename of associated file f2 based on 1st column of mydf.txt
f2 = f2basename + "_depth-file.txt"to get full associated file f2 name
df2 = pd.read_csv(f2, sep='\t') to read file f2
df = pd.merge(df, df2, on="Pos", how="left")to merge the two files on column Pos, essentially adding Depth column to mydf.txt
List.append(df)adding modified line to the array List
df = pd.concat(List, sort=False) to concatenate elements of the List array into a dataframe df
Additional NOTES
In reality, I may need to search not only three files but several hundreds.
I didn't test the execution time, but should be faster if you read your 'mydf.txt' file in a dataframe too using read_csv and then use groupby and groupby apply.
If you know in advance that you have 3 samples and 3 relative files storing the depth, you can make a dictionary to read and store the three respective dataframes in advance and use them when needed.
df = pd.read_csv('mydf.txt', sep='\s+')
files = {basename : pd.read_csv(basename + "_depth-file.txt", sep='\s+') for basename in ['A', 'B', 'C']}
res = df.groupby('Sample').apply(lambda x : pd.merge(x, files[x.name], on="Pos", how="left"))
The final res would look like:
Sample Pos Depth
Sample
A 0 A 5602 52.0
1 A 3069483 40.0
B 0 B 51948 NaN
C 0 C 231 NaN
There are NaN values because I am using the sample provided and I don't have files for B and C (I used a copy of A), so values are missing. Provided that your files contain a 'Depth' for each 'Pos' you should not get any NaN.
To get rid of the multiindex made by groupby you can do:
res.reset_index(drop=True, inplace=True)
and res becomes:
Sample Pos Depth
0 A 5602 52.0
1 A 3069483 40.0
2 B 51948 NaN
3 C 231 NaN
EDIT after comments
Since you have a lot of files, you can use the following solution: same idea, but it does not require to read all the files in advance. Each file will be read when needed.
def merging_depth(x):
td = pd.read_csv(x.name + "_depth-file.txt", sep='\s+')
return pd.merge(x, td, on="Pos", how="left")
res = df.groupby('Sample').apply(merging_depth)
The result is the same.

Python Pandas Pivot Table - counting points

I have an issue with Pivot table in Python. Let's say that I have below values in list:
team_A_id = [1,5,10]
team_A_result = 0
and below data frame:
id points
3 36
4 0
5 11
7 6
10 23
How could I using (perhaps) "for loop" find by team A id in list points and count them. Output should be:
result_team_A = 34
Thanks for any help
You are looking for isin and sum
team_A_id = [1,5,10]
df.loc[df.id.isin(team_A_id),'points'].sum()
Out[136]: 34
this will return the rows for the team A:
df.iloc[team_A_id]
result team A can be obtained by:
df['points].sum()
TLDR:
df.iloc[team_A_id]['points].sum()

Assigning variables to cells in a Pandas table (Python)

I'm working on a script that takes test data from a website, assigns the data to a variable, then creates a pie chart of the responses for later analysis. I'm able to pull the data without a problem and format the information into a table, but I can't figure out how to assign a specific variable to a cell in the table.
For example, say question 1 had 20% of students answer A, 20% answer B, 30% answer C, and 30% answer D. I would like to take this information and assign it to the variables 1A for A, 1B, for B, etc.
I think the answer lies in this code. I've tried splitting columns and rows, but it looks like the column header doesn't correlate to the data below it. I'm also attaching the results of 'print(df)' below.
header = table.find_all('tr')[2]
cols = header.find_all('td')
cols = [ele.text.strip() for ele in cols]
cols = cols[0:3] + cols[4:8] + cols[9:]
df = pd.DataFrame(data, columns = cols)
print(df)
A/1 B/2 C/3 D/4 CORRECT MC ANSWER
0 6 84 1 9 B
1 6 1 91 2 C
2 12 1 14 72 D
3 77 3 11 9 A
4 82 7 8 2 A
Do you want try something like this with 'autopct'?
df1 = df.T.set_axis(['Question '+str(i+1) for i in df.T.columns.values], axis=1, inplace=False).iloc[:4]
ax = df1.plot.pie(subplots=True,autopct='%1.1f%%',layout=(5,1),figsize=(3,15),legend=False)

Reorder Rows into Columns in Pandas (Python 3, Pandas)

Right now, my code takes scraped web data from a file (BigramCounter.txt), and then finds all the bigrams within that file so that the data looks like this:
Counter({('the', 'first'): 45, ('on', 'purchases'): 42, ('cash', 'back'): 39})
After this, I try to feed it into a pandas DataFrame where it spits this df out:
the on cash
first purchases back
0 45 42 39
This is very close to what I need but not quite. First off, the DF does not read my attempt to name the columns. Furthermore, I was hoping for something formatted more like this where its two COLUMNS and the Words are not split between Cells:
Words Frequency
the first 45
on purchases 42
cash back 39
For reference, here is my code. I think I may need to reorder an axis somewhere but I'm not sure how? Any ideas?
import re
from collections import Counter
main_c = Counter()
words = re.findall('\w+', open('BigramCounter.txt', encoding='utf-8').read())
bigrams = Counter(zip(words,words[1:]))
main_c.update(bigrams) #at this point it looks like Counter({('the', 'first'): 45, etc...})
comm = [[k,v] for k,v in main_c]
frame = pd.DataFrame(comm)
frame.columns = ['Word', 'Frequency']
frame2 = frame.unstack()
frame2.to_csv('text.csv')
I think I see what you're going for, and there are many ways to get there. You were really close. My first inclination would be to use a series, especially since you'd (presumably) just be getting rid of the df index when you write to csv, but it doesn't make a huge difference.
frequencies = [[" ".join(k), v] for k,v in main_c.items()]
pd.DataFrame(frequencies, columns=['Word', 'Frequency'])
Word Frequency
0 the first 45
1 cash back 39
2 on purchases 42
If, as I suspect, you want word to be the index, add frame.set_index('Word')
Word Frequency
the first 45
cash back 39
on purchases 42

Resources