How do I remove square brackets from my dataframe? - python-3.x

Im trying to create a comparison between my predicted and actual values.
Here is my try:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['Op1', 'Op2', 'S2', 'S3', 'S4', 'S7', 'S8', 'S9', 'S11', 'S12','S13', 'S14', 'S15', 'S17', 'S20', 'S21']], df.unit)
predicted = []
actual = []
for i in range(1,len(df.unit.unique())):
xp = df[(df.unit == i) & (df.cycles == len(df[df.unit == i].cycles))]
xa = xp.cycles.values
xp = xp.values[0,2:].reshape(1,-2)
predicted.append(reg.predict(xp))
actual.append(xa)
and to display the dataframe:
data = {'Actual cycles': actual, 'Predicted cycles': predicted }
df_2 = pd.DataFrame(data)
df_2.head()
I will get an output:
Actual cycles Predicted cycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
ignoring the values that are way off, how do I remove the square brackets in the dataframe? and is there a neater way to write my code? Thank you!

print(df_2)
Actualcycles Predictedcycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
df=df_2.apply(lambda x:x.str.strip('[]'))
print(df)
Actualcycles Predictedcycles
0 192 56.7530579842869
1 287 50.76877712361329
2 179 42.72575900074571
3 189 42.876506912637524
4 269 47.40087182743173

Below is a minimal example of your cycles column with brackets:
import pandas as pd
df = pd.DataFrame({
'cycles' : [[192], [287], [179], [189], [269]]
})
This code gets you the column without brackets:
df['cycles'] = df['cycles'].str[0]
The output looks like this:
print(df)
cycles
0 192
1 287
2 179
3 189
4 269

Related

Aggregating the total count with Json Data

I have an API endpoint which has the details of confirmed / recovered / tested count for each state
https://data.covid19india.org/v4/min/data.min.json
I would like to to aggregate the total count of confirmed / recovered / tested across each state.. What is the easiest way to achieve the results?
To write the final results in pandas we can procede by adding this to the code.
import Pandas as Pd
columns = ('Confirmed', 'Deceased', 'Recovered', 'Tested')
Panda = pd.DataFrame(data = StateWiseData).T # T for transpose
print(Panda)
The output will be:
confirmed deceased recovered tested
AN 7557 129 7420 0
AP 2003342 13735 1975448 9788047
AR 52214 259 50695 398545
AS 584434 5576 570847 326318
BR 725588 9649 715798 17107895
CH 65066 812 64213 652657
CT 1004144 13553 989728 338344
DL 1437334 25079 1411881 25142853
DN 10662 4 10620 72410
GA 173221 3186 169160 0
GJ 825302 10079 815041 10900176
HP 211746 3553 206094 481328
HR 770347 9667 760004 3948145
JH 347730 5132 342421 233773
JK 324295 4403 318838 139552
KA 2939767 37155 2882331 9791334
KL 3814305 19494 3631066 3875002
LA 20491 207 20223 110068
LD 10309 51 10194 234256
MH 6424651 135962 6231999 8421643
ML 74070 1281 69859 0
MN 111212 1755 105751 13542
MP 792101 10516 781499 3384824
MZ 52472 200 46675 0
NL 29589 610 27151 116359
OR 1001698 7479 986334 2774807
PB 600266 16352 583426 2938477
PY 122934 1808 120330 567923
RJ 954023 8954 944917 5852578
SK 29340 367 27185 0
TG 654989 3858 644747 0
TN 2600885 34709 2547005 4413963
TR 82092 784 80150 607962
TT 0 0 0 0
UP 1709119 22792 1685954 23724581
UT 342749 7377 329006 2127358
WB 1543496 18371 1515789 0
Yes, my interpretation was incorrect earlier. We have to get the districts total and add them.
import json
file = open('data.min.json')
dictionary = json.load(file)
stateCodes = ['AN', 'AP', 'AR', 'AS', 'BR', 'CH', 'CT', 'DL', 'DN', 'GA', 'GJ', 'HP', 'HR', 'JH', 'JK', 'KA', 'KL', 'LA', 'LD', 'MH', 'ML', 'MN', 'MP', 'MZ', 'NL', 'OR', 'PB', 'PY', 'RJ', 'SK', 'TG', 'TN', 'TR', 'TT', 'UP', 'UT', 'WB']
StateWiseData = {}
for state in stateCodes:
StateInfo = dictionary[state]
Confirmed = 0
Recovered = 0
Tested = 0
Deceased = 0
StateData = {}
if "districts" in StateInfo:
for District in StateInfo['districts']:
DistrictInfo = StateInfo['districts'][District]['total']
if 'confirmed' in DistrictInfo:
if type(Confirmed) == type(DistrictInfo['confirmed']):
Confirmed += (DistrictInfo['confirmed'])
if 'recovered' in DistrictInfo:
if type(Recovered) == type(DistrictInfo['recovered']):
Recovered += (DistrictInfo['recovered'])
if 'tested' in DistrictInfo:
if type(Tested) == type(DistrictInfo['tested']):
Tested += (DistrictInfo['tested'])
if 'deceased' in DistrictInfo:
if type(Deceased) == type(DistrictInfo['deceased']):
Deceased += (DistrictInfo['deceased'])
StateData['confirmed'] = Confirmed
StateData['deceased'] = Deceased
StateData['recovered'] = Recovered
StateData['tested'] = Tested
StateWiseData[state] = StateData
print(StateWiseData)

Pandas appends duplicate rows even though if statement shouldn't be triggered

I have many csv files that only have one row of data. I need to take data from two of the cells and put them into a master csv file ('new_gal.csv'). Initially this will only contain the headings, but no data.
#The file I am pulling from:
file_name = "N4261_pacs160.csv"
#I have the code written to separate gal_name, cat_name, and cat_num (N4261, pacs, 160)
An example of the csv is given here. I am trying to pull "flux" and "rms" from this file. (Sorry it isn't aligned nicely; I can't figure out the formatting).
name band ra dec raerr decerr flux snr snrnoise stn rms strn fratio fwhmxfit fwhmyfit flag_elong edgeflag flag_blend warmat
obsid ssomapflag dist angle
HPPSC160A_J121923.1+054931 red 184.846389 5.8254 0.000151 0.00015
227.036 10.797 21.028 16.507 13.754 37.448 1.074 15.2 11 0.7237
f 0 f 1342199758 f 1.445729 296.577621
I read this csv and pull the data I need
with open(file_name, 'r') as table:
reader = csv.reader(table, delimiter=',')
read = iter(reader)
next(read)
for row in read:
fluxP = row[6]
errP = row[10]
#Open the master csv with pandas
df = pd.read_csv('new_gal.csv')
The master csv file has format:
Galaxy Cluster Mult. Detect. LumDist z W1 W1 err W2 W2 err W3 W3 err W4 W4 err 70 70 err 100 100 err 160 160 err 250 250 err 350 350 err 500 500 err
The main problem I have, is that I want to search the "Galaxy" column in the 'new_gal.csv' for the galaxy name. If it is not there, I need to add a new row with the galaxy name and the flux and error measurement. When I run this multiple times, I get duplicate rows even though I have the append command nested in the if statement. I only want it to append a new row if the galaxy name is not already there; otherwise, it should only change the values of the flux and error measurements for that galaxy.
if cat_name == 'pacs':
if gal_name not in df["Galaxy"]:
df = df.append({"Galaxy": gal_name}, ignore_index=True)
if cat_num == "70":
df.loc[df.Galaxy == gal_name, ["70"]] = fluxP
df.loc[df.Galaxy == gal_name, ["70 err"]] = errP
elif cat_num == "100":
df.loc[df.Galaxy == gal_name, ["100"]] = fluxP
df.loc[df.Galaxy == gal_name, ["100 err"]] = errP
elif cat_num == "160":
df.loc[df.Galaxy == gal_name, ["160"]] = fluxP
df.loc[df.Galaxy == gal_name, ["160 err"]] = errP
else:
if cat_num == "70":
df.loc[df.Galaxy == gal_name, ["70"]] = fluxP
df.loc[df.Galaxy == gal_name, ["70 err"]] = errP
elif cat_num == "100":
df.loc[df.Galaxy == gal_name, ["100"]] = fluxP
df.loc[df.Galaxy == gal_name, ["100 err"]] = errP
elif cat_num == "160":
df.loc[df.Galaxy == gal_name, ["160"]] = fluxP
df.loc[df.Galaxy == gal_name, ["160 err"]] = errP
After running the code 5 times with the same file, I have 5 identical lines in the table.
I think I've got something that'll work after tinkering with it this morning...
Couple points... You shouldn't incrementally build in pandas...get the data setup done externally then do 1 build. In what I have below, I'm building a big dictionary from the small csv files and then using merge to put that together with the master file.
If your .csv files aren't formatted properly, you can either try to replace the split character below or switch over to csv reader that is a bit more powerful.
You should put all of the smaller .csv files in a folder called 'orig_data' to make this work.
main prog
# galaxy compiler
import os, re
import pandas as pd
# folder location for the small .csvs, NOT the master
data_folder = 'orig_data' # this folder should be in same directory as program
result = {}
splitter = r'(.+)_([a-zA-Z]+)([0-9]+)\.' # regex to break up file name into 3 groups
for file in os.listdir(data_folder):
file_data = {}
# split up the filename and process
galaxy, cat_name, cat_num = re.match(splitter, file).groups()
#print(galaxy, cat_name, cat_num)
with open(os.path.join(data_folder, file), 'r') as src:
src.readline() # read the header and disregard it
data = src.readline().replace(' ','').strip().split(',') # you can change the split char
flux = float(data[2])
rms = float(data[3])
err_tag = cat_num + ' err'
file_data = { 'cat_name': cat_name,
cat_num: flux,
err_tag: rms}
result[galaxy] = file_data
df2 = pd.DataFrame.from_dict(result, orient='index')
df2.index.rename('galaxy', inplace=True)
# check the resulting build!
#print(df2)
# build master dataframe
master_df = pd.read_csv('master_data.csv')
#print(master_df.head())
# merge the 2 dataframes on galaxy name. See the dox on merge for other
# options and whether you want an "outer" join or other type of join...
master_df = master_df.merge(df2, how='outer', on='galaxy')
# convert boolean flags properly
conv = {'t': True, 'f': False}
master_df['flag_nova'] = master_df['flag_nova'].map(conv).astype('bool')
print(master_df)
print()
print(master_df.info())
print()
print(master_df.describe())
example data files in orig_data folder
filename: A99_dbc100.csv
band,weight,flux,rms
junk, 200.44,2e5,2e-8
filename: B250_pacs100.csv
band,weight,flux,rms
nada,2.44,19e-5, 74
...etc.
example master csv
galaxy,color,stars,flag_nova
A99,red,15,f
B250,blue,4e20,t
N1000,green,3e19,f
X99,white,12,t
Result:
galaxy color stars ... 200 err 100 100 err
0 A99 red 1.500000e+01 ... NaN 200000.00000 2.000000e-08
1 B250 blue 4.000000e+20 ... NaN 0.00019 7.400000e+01
2 N1000 green 3.000000e+19 ... 88.0 NaN NaN
3 X99 white 1.200000e+01 ... NaN NaN NaN
[4 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 9 columns):
galaxy 4 non-null object
color 4 non-null object
stars 4 non-null float64
flag_nova 4 non-null bool
cat_name 3 non-null object
200 1 non-null float64
200 err 1 non-null float64
100 2 non-null float64
100 err 2 non-null float64
dtypes: bool(1), float64(5), object(3)
memory usage: 292.0+ bytes
None
stars 200 200 err 100 100 err
count 4.000000e+00 1.0 1.0 2.000000 2.000000e+00
mean 1.075000e+20 1900000.0 88.0 100000.000095 3.700000e+01
std 1.955121e+20 NaN NaN 141421.356103 5.232590e+01
min 1.200000e+01 1900000.0 88.0 0.000190 2.000000e-08
25% 1.425000e+01 1900000.0 88.0 50000.000143 1.850000e+01
50% 1.500000e+19 1900000.0 88.0 100000.000095 3.700000e+01
75% 1.225000e+20 1900000.0 88.0 150000.000048 5.550000e+01
max 4.000000e+20 1900000.0 88.0 200000.000000 7.400000e+01

passing parameters in groupby aggregate function

I have dataframe which I've referenced as df in the code and I'm applying aggregate functions on multiple columns of each group. I also applied user-defined lambda functions f4, f5, f6, f7. Some functions are very similar like f4, f6 and f7 where only parameter value are different. Can I pass these parameters from dictionary d, so that I have to write only one function instead of writing multiple functions?
f4 = lambda x: len(x[x>10]) # count the frequency of bearing greater than threshold value
f4.__name__ = 'Frequency'
f5 = lambda x: len(x[x<3.4]) # count the stop points with velocity less than threshold value 3.4
f5.__name__ = 'stop_frequency'
f6 = lambda x: len(x[x>0.2]) # count the points with velocity greater than threshold value 0.2
f6.__name__ = 'frequency'
f7 = lambda x: len(x[x>0.25]) # count the points with accelration greater than threshold value 0.25
f7.__name__ = 'frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f6,
'acc_rate':f7,
'bearing':['sum', f4],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
I like to write a function like
f4(p) = lambda x: len(x[x>p])
f4.__name__ = 'Frequency'
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f5, 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(0.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
The csv file of dataframe df is available at given link for more clarity of data.
https://drive.google.com/open?id=1R_BBL00G_Dlo-6yrovYJp5zEYLwlMPi9
It is possible, but not easy, solution by neilaronson.
Also solution is simplify by sum of True values of boolean mask.
def f4(p):
def ipf(x):
return (x < p).sum()
#your solution
#return len(x[x < p])
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2),
'acc_rate':f4(.25),
'bearing':['sum', f4(10)],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
EDIT: You can also pass parameter for greater or less:
def f4(p, op):
def ipf(x):
if op == 'greater':
return (x > p).sum()
elif op == 'less':
return (x < p).sum()
else:
raise ValueError("second argument has to be greater or less only")
ipf.__name__ = 'Frequency'
return ipf
d = {'acceleration':['mean', 'median', 'min'],
'velocity':[f4(3.4, 'less'), 'sum' ,'count', 'median', 'min'],
'velocity_rate':f4(0.2, 'greater'),
'acc_rate':f4(.25, 'greater'),
'bearing':['sum', f4(10, 'greater')],
'bearing_rate':'sum',
'Vincenty_distance':'sum'}
df1 = df.groupby(['userid','trip_id','Transportation_Mode','segmentid'], sort=False).agg(d)
#flatenning MultiIndex in columns
df1.columns = df1.columns.map('_'.join)
#MultiIndex in index to columns
df1 = df1.reset_index(level=2, drop=False).reset_index()
print (df1.head())
userid trip_id segmentid Transportation_Mode acceleration_mean \
0 141 1.0 1 walk 0.061083
1 141 2.0 1 walk 0.109148
2 141 3.0 1 walk 0.106771
3 141 4.0 1 walk 0.141180
4 141 5.0 1 walk 1.147157
acceleration_median acceleration_min velocity_Frequency velocity_sum \
0 -1.168583e-02 -2.994428 1000.0 1506.679506
1 1.665535e-09 -3.234188 464.0 712.429005
2 -3.055414e-08 -3.131293 996.0 1394.746071
3 9.241707e-09 -3.307262 340.0 513.461259
4 -2.609489e-02 -3.190424 493.0 729.702854
velocity_count velocity_median velocity_min velocity_rate_Frequency \
0 1028 1.294657 0.284747 288.0
1 486 1.189650 0.284725 134.0
2 1020 1.241419 0.284733 301.0
3 352 1.326324 0.339590 93.0
4 504 1.247868 0.284740 168.0
acc_rate_Frequency bearing_sum bearing_Frequency bearing_rate_sum \
0 169.0 81604.187066 884.0 -371.276356
1 89.0 25559.589869 313.0 -357.869944
2 203.0 -71540.141199 57.0 946.382581
3 78.0 9548.920765 167.0 -943.184805
4 93.0 -24021.555784 67.0 535.333624
Vincenty_distance_sum
0 1506.679506
1 712.429005
2 1395.328768
3 513.461259
4 731.823664

Pandas: create multiple aggregate columns and merge multiple data frames in an elegant way

I am using the following code to create a few new aggregated columns based on the column version. Then merged the 4 new data frames.
new_df = df[['version','duration']].groupby('version').mean().rename(columns=lambda x: ('mean_' + x)).reset_index().fillna(0)
new_df1 = df[['version','duration']].groupby('version').std().rename(columns=lambda x: ('std_' + x)).reset_index().fillna(0)
new_df2 = df[['version','ts']].groupby('version').min().rename(columns=lambda x: ('min_' + x)).reset_index().fillna(0)
new_df3 = df[['version','ts']].groupby('version').max().rename(columns=lambda x: ('max_' + x)).reset_index().fillna(0)
new_df3
import pandas
df_a = pandas.merge(new_df,new_df1, on = 'version')
df_b = pandas.merge(df_a,new_df2, on = 'version')
df_c = pandas.merge(df_b,new_df3, on = 'version')
df_c
The output looks like below:
version mean_duration std_duration min_ts max_ts
0 1400422 451 1 2018-02-28 09:42:15 2018-02-28 09:42:15
1 7626065 426 601 2018-01-25 11:01:58 2018-01-25 11:15:22
2 7689209 658 473 2018-01-30 11:09:31 2018-02-01 05:19:23
3 7702304 711 80 2018-01-30 17:49:18 2018-01-31 12:27:20
The code works fine, but I am wondering is there a more elegant/clean way to do this? Thank you!
Using functools reduce modify your result (merge)
import functools
l=[new_df1,new_df3,new_df3]
functools.reduce(lambda left,right: pd.merge(left,right,on=['version']), l)
Or let us using agg recreate what you need
s=df.groupby('version').agg({'duration':['mean','std'],'ts':['min','max']}).reset_index()
s.columns=s.columns.map('_'.join)

Missing value imputation in Python

I have two huge vectors item_clusters and beta. The element item_clusters [ i ] is the cluster id to which the item i belongs. The element beta [ i ] is a score given to the item i. Scores are {-1, 0, 1, 2, 3}.
Whenever the score of a particular item is 0, I have to impute that with the average non-zero score of other items belonging to the same cluster. What is the fastest possible way to to this?
This is what I have tried so far. I converted the item_clusters to a matrix clusters_to_items such that the element clusters_to_items [ i ][ j ] = 1 if the cluster i contains item j, else 0. After that I am running the following code.
# beta (1x1.3M) csr matrix
# num_clusters = 1000
# item_clusters (1x1.3M) numpy.array
# clust_to_items (1000x1.3M) csr_matrix
alpha_z = []
for clust in range(0, num_clusters):
alpha = clust_to_items[clust, :]
alpha_beta = beta.multiply(alpha)
sum_row = alpha_beta.sum(1)[0, 0]
num_nonzero = alpha_beta.nonzero()[1].__len__() + 0.001
to_impute = sum_row / num_nonzero
Z = np.repeat(to_impute, beta.shape[1])
alpha_z = alpha.multiply(Z)
idx = beta.nonzero()
alpha_z[idx] = beta.data
interact_score = alpha_z.tolist()[0]
# The interact_score is the required modified beta
# This is used to do some work that is very fast
The problem is that this code has to run 150K times and it is very slow. It will take 12 days to run according to my estimate.
Edit: I believe, I need some very different idea in which I can directly use item_clusters, and do not need to iterate through each cluster separately.
I don't know if this means I'm the popular kid here or not, but I think you can vectorize your operations in the following way:
def fast_impute(num_clusters, item_clusters, beta):
# get counts
cluster_counts = np.zeros(num_clusters)
np.add.at(cluster_counts, item_clusters, 1)
# get complete totals
totals = np.zeros(num_clusters)
np.add.at(totals, item_clusters, beta)
# get number of zeros
zero_counts = np.zeros(num_clusters)
z = beta == 0
np.add.at(zero_counts, item_clusters, z)
# non-zero means
cluster_means = totals / (cluster_counts - zero_counts)
# perform imputations
imputed_beta = np.where(beta != 0, beta, cluster_means[item_clusters])
return imputed_beta
which gives me
>>> N = 10**6
>>> num_clusters = 1000
>>> item_clusters = np.random.randint(0, num_clusters, N)
>>> beta = np.random.choice([-1, 0, 1, 2, 3], size=len(item_clusters))
>>> %time imputed = fast_impute(num_clusters, item_clusters, beta)
CPU times: user 652 ms, sys: 28 ms, total: 680 ms
Wall time: 679 ms
and
>>> imputed[:5]
array([ 1.27582017, -1. , -1. , 1. , 3. ])
>>> item_clusters[:5]
array([506, 968, 873, 179, 269])
>>> np.mean([b for b, i in zip(beta, item_clusters) if i == 506 and b != 0])
1.2758201701093561
Note that I did the above manually. It would be a lot easier if you were using higher-level tools, say like those provided by pandas:
>>> df = pd.DataFrame({"beta": beta, "cluster": item_clusters})
>>> df.head()
beta cluster
0 0 506
1 -1 968
2 -1 873
3 1 179
4 3 269
>>> df["beta"] = df["beta"].replace(0, np.nan)
>>> df["beta"] = df["beta"].fillna(df["beta"].groupby(df["cluster"]).transform("mean"))
>>> df.head()
beta cluster
0 1.27582 506
1 -1.00000 968
2 -1.00000 873
3 1.00000 179
4 3.00000 269
My suspicion is that
alpha_beta = beta.multiply(alpha)
is a terrible idea, because you only need the first elements of the row sums, so you're doing a couple million multiply-adds in vain, if I'm not mistaken:
sum_row = alpha_beta.sum(1)[0, 0]
So, write down the discrete formula for beta * alpha, then pick the row you need and derive the formula for its sum.

Resources