I have the following code and output
mean = dataframe.groupby('LABEL')['RESP'].mean()
minimum = dataframe.groupby('LABEL')['RESP'].min()
maximum = dataframe.groupby('LABEL')['RESP'].max()
std = dataframe.groupby('LABEL')['RESP'].std()
df = [mean, minimum, maximum]
And the following output
[LABEL
0.0 -1.193420
1.0 0.713425
2.0 -1.066513
3.0 -0.530640
4.0 -2.130600
6.0 0.084747
7.0 1.190506
Name: RESP, dtype: float64,
LABEL
0.0 -1.396179
1.0 -0.233459
2.0 -1.631165
3.0 -1.271057
4.0 -2.543640
6.0 -0.418091
7.0 -0.004578
Name: RESP, dtype: float64,
LABEL
0.0 0.042247
1.0 0.295534
2.0 0.128233
3.0 0.243975
4.0 0.088077
6.0 0.085615
7.0 0.693196
Name: RESP, dtype: float64
]
However I want the output to be a dictionary as
{label_value: [mean, min, max, std_dev]}
For example
{1: [1, 0, 2, 1], 2: [0, -1, 1, 1], ... }
I'm assuming your starting Dataframe is equivalent to one I've synthesised.
calculate all of the aggregate values in one call to aggregate. rounded values so output fits in this answer
reset_index() on aggregate then to_dict()
list comprehension to reformat dict to your specification
df = pd.DataFrame([[l, random.random()] for l in range(8) for k in range(500)], columns=["LABEL","RESP"])
d = df.groupby("LABEL")["RESP"].agg([np.mean, np.min, np.max, np.std]).round(4).reset_index().to_dict(orient="records")
{e["LABEL"]:[e["mean"],e["amin"],e["amax"],e["std"]] for e in d}
output
{0: [0.5007, 0.0029, 0.997, 0.2842],
1: [0.4967, 0.0001, 0.9993, 0.2855],
2: [0.4742, 0.0003, 0.9931, 0.2799],
3: [0.5175, 0.0062, 0.9996, 0.2978],
4: [0.4909, 0.0018, 0.9952, 0.2912],
5: [0.4787, 0.0077, 0.9976, 0.291],
6: [0.4878, 0.0009, 0.9942, 0.2806],
7: [0.4989, 0.0066, 0.9982, 0.278]}
Related
I am new to Python and using pandas.
I am trying to convert a data in dictionary to csv file.
Here is the Dictionary
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
And I use the below method to save and read the dictionary as csv file
with open('test.csv', 'w') as f:
for key in data_new:
f.write("%s,%s\n" % (key, data_new[key]))
df1 = pd.read_csv("test.csv")
df1
And when I read df1 I get it in the below format
but I want all rows to be columns so I used transpose function as below
However from the above output you see bathrooms is index 0 but I want index to start from bedrooms because with the below output if I try tdf1.info() I do not see bedroom data at all.
Could you please guide me how I can fix this?
Regards
Aravind Viswanathan
I think it would be easier to just use pandas to both write and read your csv file. Does this satisfy what you're trying to do?
import pandas as pd
data_new = {'bedrooms': 2.0, 'bathrooms': 3.0, 'sqft_living': 1200,
'sqft_lot': 5800, 'floors': 2.0,
'waterfront': 1, 'view': 1, 'condition': 2, 'sqft_above': 1200,
'sqft_basement': 20,
'yr_built': 1925, 'yr_renovated': 2003, 'city': "Shoreline"}
df1 = pd.DataFrame.from_dict([data_new])
df1.to_csv('test.csv', index=None) # index=None prevents index being added as column 1
df2 = pd.read_csv('test.csv')
print(df1)
print(df2)
Output:
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
bedrooms bathrooms sqft_living ... yr_built yr_renovated city
0 2.0 3.0 1200 ... 1925 2003 Shoreline
[1 rows x 13 columns]
Identical.
I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.
I have a text file that contain string, floating point numbers, integers and separated with double space
cat input.txt
nms val pet dzl
sdt 2.5 3.5 1
tyu 2.8 7.5 5
I want to load the txt file and want to define every rows values to the new variable inorder to perform some task inside the loop
My trial script
import numpy as np
import pandas as pd
main_file=np.loadtxt("input.txt")
for file in main_file:
a= should be sdt
b= should be 2.5
c= should be 3.5
d= should be 1
similarly for second row i want to do the same so that
a= should be tyu
b= should be 2.8
c= should be 7.5
d= should be 5
Error: ValueError: could not convert string to float: 'sdt'
How can I fix this.
I guess in your exampe a combination of open with readlines and split with two whitespaces is doing the job.
# Using readlines()
file1 = open('input.txt', 'r')
lines = file1.readlines()
count = 0
# Strips the newline character
for i, line in enumerate(lines):
if i==0:
continue
a,b,c,d = line.split(' '*3)
print(f'In line {i} a is {a}, b is {b}, c is {c} and d is {d}.')
This is the output:
In line 1 a is sdt, b is 2.5, c is 3.5 and d is 1
In line 2 a is tyu, b is 2.8, c is 7.5 and d is 5
If you have more than 4 columns, you can use split() and apply this to one variable. This will be a list and you can select the items by zero-based counting.
In [244]: cat test.csv
nms val pet dzl
sdt 2.5 3.5 1
tyu 2.8 7.5 5
An easy way to load a csv is with pandas - I use the engine and sep to handle the white space separator (rather than the default comma):
In [245]: df = pd.read_csv('test.csv', sep='\s+', engine='python')
In [246]: df
Out[246]:
nms val pet dzl
0 sdt 2.5 3.5 1
1 tyu 2.8 7.5 5
Since you tagged dataframe and pandas, I'll assume you can take it from there.
genfromtxt can read it as well, but for strings that aren't floats it inserts a nan:
In [247]: data = np.genfromtxt('test.csv')
In [248]: data
Out[248]:
array([[nan, nan, nan, nan],
[nan, 2.5, 3.5, 1. ],
[nan, 2.8, 7.5, 5. ]])
np.loadtxt given the same thing raises errors because it can't convert those strings to float. That should be clear from the docs.
With a few more parameters you can get a nice structured array:
In [251]: data = np.genfromtxt('test.csv', dtype=None, names=True, encoding=None)
In [252]: data
Out[252]:
array([('sdt', 2.5, 3.5, 1), ('tyu', 2.8, 7.5, 5)],
dtype=[('nms', '<U3'), ('val', '<f8'), ('pet', '<f8'), ('dzl', '<i8')])
data['val'] gives all the val column. Or you can iterate on the rows with:
In [258]: for row in data:
...: print(row)
...:
('sdt', 2.5, 3.5, 1)
('tyu', 2.8, 7.5, 5)
A similar structured array from loadtxt:
In [262]: data = np.loadtxt('test.csv', dtype='str,f,f,i', skiprows=1,encoding=None)
In [263]: data
Out[263]:
array([('', 2.5, 3.5, 1), ('', 2.8, 7.5, 5)],
dtype=[('f0', '<U'), ('f1', '<f4'), ('f2', '<f4'), ('f3', '<i4')])
With base readlines you can get a list of strings for each row, and parse those as you want:
In [264]: with open('test.csv','r') as f: lines = f.readlines()
In [265]: lines
Out[265]: ['nms val pet dzl\n', 'sdt 2.5 3.5 1\n', 'tyu 2.8 7.5 5\n']
In [266]: lines[1]
Out[266]: 'sdt 2.5 3.5 1\n'
In [267]: lines[1].split()
Out[267]: ['sdt', '2.5', '3.5', '1']
In [268]: a,b,c,d = lines[1].split()
In [269]: b
Out[269]: '2.5'
I would like to Vectorization my dataframe with NumPy arrays but I got an error
Here is the code :
Here I initialize my dataframe
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': [True,False,True,False],
'E': pd.Categorical(["test", "Draft", "test", "Draft"]),
'F': 'foo'})
df2
output:
A B C D E F
0 1.0 2013-01-02 1.0 True test foo
1 1.0 2013-01-02 1.0 False Draft foo
2 1.0 2013-01-02 1.0 True test foo
3 1.0 2013-01-02 1.0 False train foo
Here I define the function to apply to dataframe's columns
def IsBillingValid2(xE,yBilling):
if(xE not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
return True
else:
return False
Here I launch my function
df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
Here is the Error:
output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-2946139111570059> in <module>
16 return False
17
---> 18 df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
19
<command-2041881674588848> in IsBillingValid(xStageName, yBilling)
207 def IsBillingValid(xStageName,yBilling):
208
--> 209 if(xStageName not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
210 return True
211
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thanks for your help
You don't need apply especially when you want vectorized operation.
Use pandas.Series.isin:
df2['BillingPostalCode_Det_StageName_Det'] = ~df2["E"].isin({"Draft", "Cancelled"}) & df2["D"]
print(df2)
Output:
A B C D E F BillingPostalCode_Det_StageName_Det
0 1.0 2013-01-02 1.0 True test foo True
1 1.0 2013-01-02 1.0 False Draft foo False
2 1.0 2013-01-02 1.0 True test foo True
3 1.0 2013-01-02 1.0 False Draft foo False
So I have a dataset that looks as such
ACCx ACCy ACCz ECG RESP LABEL BINARY
0 0.9554 -0.2220 -0.5580 0.021423 -1.148987 0.0 0
1 0.9258 -0.2216 -0.5538 0.020325 -1.124573 0.0 0
2 0.9082 -0.2196 -0.5392 0.016525 -1.152039 0.0 0
3 0.8974 -0.2102 -0.5122 0.016708 -1.158142 0.0 0
4 0.8882 -0.2036 -0.4824 0.011673 -1.161194 0.0 0
... ... ... ... ... ... ... ...
695 0.9134 -0.1400 0.1074 0.003479 2.299500 7.0 0
696 0.9092 -0.1394 0.0994 0.000778 2.305603 7.0 0
697 0.9084 -0.1414 0.0934 -0.001694 2.297974 7.0 0
698 0.9116 -0.1416 0.0958 -0.003799 2.354431 7.0 0
699 0.9156 -0.1396 0.1022 -0.006546 2.355957 7.0 0
Now the values of Binary is 1 if LABEL is 2, as shown below
ACCx ACCy ACCz ECG RESP LABEL BINARY
200 0.8776 -0.1030 -0.2968 -0.011673 -1.222229 2.0 1
201 0.8758 -0.1018 -0.2952 -0.001556 -1.202393 2.0 1
202 0.8760 -0.1030 -0.2918 0.022385 -1.222229 2.0 1
203 0.8786 -0.1038 -0.2950 0.049622 -1.228333 2.0 1
204 0.8798 -0.1050 -0.2930 0.084457 -1.210022 2.0 1
... ... ... ... ... ... ... ...
295 0.8756 -0.1052 -0.2694 -0.106430 -0.883484 2.0 1
296 0.8760 -0.1036 -0.2680 -0.108719 -0.880432 2.0 1
297 0.8760 -0.1056 -0.2638 -0.106750 -0.888062 2.0 1
298 0.8768 -0.1064 -0.2560 -0.099792 -0.889587 2.0 1
299 0.8792 -0.1064 -0.2510 -0.094894 -0.865173 2.0 1
I need to plot a scatter plot against the RESP values but the colour must be different for the values where binary is 1
I used the following code to plot the scatter plot
def plot_coloured(dataframe):
"""
Function 2: plot_coloured(dataframe)
Parameters: dataframe: Stress data DataFrame
Output: Plot
"""
plt.figure(figsize=(12, 6))
plt.scatter(x=[i for i in range(0, 700)],
y=dataframe["RESP"])
And got the following image
The image for scatterplot between resp and indices
I would like to know how I can change the colour of the points on the plot where the value of binary is 1
I have heard about the c argument in plt,scatter() but I do not know if it helps here
Use a Boolean mask to create separate dataframes based upon the desired condition, and then plot both dataframes with different colors
import pandas as pd
import matplotlib.pyplot as plt
data = {'ACCx': [0.9554, 0.9258, 0.9082, 0.8974, 0.8882, 0.9134, 0.9092, 0.9084, 0.9116, 0.9156, 0.8776, 0.8758, 0.876, 0.8786, 0.8798, 0.8756, 0.876, 0.876, 0.8768, 0.8792],
'ACCy': [-0.222, -0.2216, -0.2196, -0.2102, -0.2036, -0.14, -0.1394, -0.1414, -0.1416, -0.1396, -0.103, -0.1018, -0.103, -0.1038, -0.105, -0.1052, -0.1036, -0.1056, -0.1064, -0.1064],
'ACCz': [-0.558, -0.5538, -0.5392, -0.5122, -0.4824, 0.1074, 0.0994, 0.0934, 0.0958, 0.1022, -0.2968, -0.2952, -0.2918, -0.295, -0.293, -0.2694, -0.268, -0.2638, -0.256, -0.251],
'ECG': [0.021422999999999998, 0.020325, 0.016525, 0.016708, 0.011673000000000001, 0.003479, 0.000778, -0.0016940000000000002, -0.0037990000000000003, -0.006546, -0.011673000000000001, -0.001556, 0.022385, 0.049622, 0.084457, -0.10643, -0.10871900000000001, -0.10675, -0.09979199999999999, -0.094894],
'RESP': [-1.148987, -1.124573, -1.152039, -1.158142, -1.161194, 2.2995, 2.305603, 2.297974, 2.354431, 2.355957, -1.222229, -1.202393, -1.222229, -1.228333, -1.210022, -0.883484, -0.880432, -0.8880620000000001, -0.8895870000000001, -0.865173],
'LABEL': [0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0],
'BINARY': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
df = pd.DataFrame(data)
# create separate dataframes with desired condition
mask = (df.BINARY == 1)
resp_1 = df[mask]
resp_others = df[~mask]
# plot the two dataframes
plt.figure(figsize=(12, 6))
plt.scatter(x=resp_1.index, y=resp_1.RESP, color='g', label='BINARY=1')
plt.scatter(x=resp_others.index, y=resp_others.RESP, label='BINARY!=1')
plt.legend()