Slicing Pandas series

Slicing Pandas series - python-3.x

I have the following code:
total_csv = pd.read_csv('total.csv',header=0).iloc[:,:]
column28=total_csv ['28']
column27=total_csv ['27']
column26=total_csv ['26']
column25=total_csv ['25']
column24=total_csv ['24']
column23=total_csv ['23']
master_values=(column23,column24,column25,column26,column27,column28)
In [68]:master_values
Out[68]:
(0 6867.488928
Name: 23, dtype: float64, 0 6960.779317
Name: 24, dtype: float64, 0 7007.540137
Name: 25, dtype: float64, 0 7031.11444
Name: 26, dtype: float64, 0 7127.469389
Name: 27, dtype: float64, 0 7408.207806
Name: 28, dtype: float64)
But I want master_values to be (6867.488928,6960.779317,7007.540137,7031.11444,7127.469389,7408.207806).
Currently, the way I read total_csv is the following:
In [69]: total_csv
Out[69]:
z 23 24 25 ...
0 CCS 6867.488928 6960.779317 7031.11444 ...
How could I read master_values to be (6867.488928,6960.779317,7007.540137,7031.11444,7127.469389,7408.207806)?

Are the columnXX variables necessary ?
Maybe just try following:
master_values = pd.read_csv('total.csv',header=0).iloc[0]
and if you need a tuple as indicated by the parentheses you can do it like that:
master_values = tuple(pd.read_csv('total.csv',header=0).iloc[0])

You could try this:
total_csv.to_numpy()[0][0].split(' ')[1:]

Related

how do I use .assign for values in a column

I have a dataframe which looks like this:
date symbol numerator denominator
4522 2021-10-06 PAG.SG 1.0 18
1016 2020-11-23 IPA.V 1.0 5
412 2020-04-17 LRK.AX 1.0 30
1884 2021-06-03 BOUVETO.ST 1.0 1
2504 2021-04-28 VKGYO.IS 1.0 100
3523 2021-07-08 603355.SS 1.0 1
3195 2021-08-23 IDAI 1.0 1
3238 2021-08-19 6690.TWO 1.0 1000
3430 2021-07-19 CAXPD 1.0 10
2642 2021-04-15 035720.KS 1.0 1
dtypes:
date: object
symbol: object
numerator: float64
denominator: int64
I am trying to use pd.assign to assign a classifier to this df in the form of
df = df.assign(category = ['forward' if numerator > denominator else 'reverse' for numerator in df[['numerator', 'denominator']]])
But I'm receiving a TypeError stating:
TypeError: Invalid comparison between dtype=int64 and str
I have tried casting them explicitly, with:
df = df.assign(category = ['forward' if df['numerator'] > df['denominator'] else 'reverse' for df['numerator'] in df])
But receive another TypeError stating:
TypeError: '>' not supported between instances of 'str' and 'int'
Which is confusing because I'm not comparing strings, I'm comparing int and float.
Any help would be greatly appreciated.

You still can do that with np.where
import numpy as np
df = df.assign(category = np.where(df['numerator']>df['denominator'],
'forward',
'reverse')

Adding band description to rioxarray to_raster()

I've seen that one can add band descriptions to a geotiff image using rasterio [1]. How would I do the same thing when saving an array to a raster with rioxarray?
I tried adding the names as coords, but when I save an re-open the raster, the bands are named [1, 2, 3, 4] instead of ['R', 'G', 'B', 'NIR'].
import numpy as np
import xarray as xa
import rioxarray as rioxa
bands = ['R', 'G', 'B', 'NIR']
im_arr = np.random.randint(0, 255, size=(4, 400, 400))
im_save = xa.DataArray(im_arr, dims=('band', 'y', 'x'),
coords={'x': np.arange(0, 400), 'y': np.arange(0, 400),
'band': bands})
path = 'test.tiff'
im_save.rio.to_raster(path)
im_load = rioxa.open_rasterio(path)
print(im_load)
<xarray.DataArray (band: 4, y: 400, x: 400)> [640000 values with dtype=int32]
Coordinates:
band (band) int32 1 2 3 4
y (y) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
x (x) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
spatial_ref int32 0 Attributes:
scale_factor: 1.0
add_offset: 0.0
grid_mapping: spatial_ref

You should consider switching from a 3d DataArray to a Dataset with 4 variables, each representing a separate band.
If you name the variables correctly, it should get written to the tiff:
import numpy as np
import xarray as xa
import rioxarray as rioxa
bands = ['R', 'G', 'B', 'NIR']
xa_dataset = xa.Dataset()
for band in bands:
xa_dataset[band] = xa.DataArray(np.random.randint(0, 255, (400, 400), dtype="uint8"), dims=('y', 'x'),
coords={'x': np.arange(0, 400), 'y': np.arange(0, 400)})
# see the structure
print(xa_dataset)
# <xarray.Dataset>
# Dimensions: (x: 400, y: 400)
# Coordinates:
# * x (x) int64 0 1 2 3 4 5 6 7 8 ... 391 392 393 394 395 396 397 398 399
# * y (y) int64 0 1 2 3 4 5 6 7 8 ... 391 392 393 394 395 396 397 398 399
# Data variables:
# R (y, x) uint8 18 41 126 79 64 215 105 ... 29 137 243 23 150 23 224
# G (y, x) uint8 1 18 90 195 45 8 150 68 ... 96 194 22 58 118 210 198
# B (y, x) uint8 125 90 165 226 153 253 212 ... 162 217 221 162 18 17
# NIR (y, x) uint8 161 195 149 168 40 182 146 ... 18 114 38 119 23 110 26
# write to disk
xa_dataset.rio.to_raster("test.tiff")
# load
im_load = rioxa.open_rasterio('test.tiff')
print(im_load)
# <xarray.DataArray (band: 4, y: 400, x: 400)>
# [640000 values with dtype=uint8]
# Coordinates:
# * band (band) int64 1 2 3 4
# * y (y) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
# * x (x) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
# spatial_ref int64 0
# Attributes:
# scale_factor: 1.0
# add_offset: 0.0
# long_name: ('R', 'G', 'B', 'NIR')
# grid_mapping: spatial_ref
You can see the band names are now included in the attributes as long_name.
Running gdalinfo, you can see the band description has been set:
Driver: GTiff/GeoTIFF
Files: test.tiff
Size is 400, 400
Origin = (-0.500000000000000,-0.500000000000000)
Pixel Size = (1.000000000000000,1.000000000000000)
Image Structure Metadata:
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( -0.5000000, -0.5000000)
Lower Left ( -0.500, 399.500)
Upper Right ( 399.500, -0.500)
Lower Right ( 399.500, 399.500)
Center ( 199.500, 199.500)
Band 1 Block=400x5 Type=Byte, ColorInterp=Red
Description = R
Mask Flags: PER_DATASET ALPHA
Band 2 Block=400x5 Type=Byte, ColorInterp=Green
Description = G
Mask Flags: PER_DATASET ALPHA
Band 3 Block=400x5 Type=Byte, ColorInterp=Blue
Description = B
Mask Flags: PER_DATASET ALPHA
Band 4 Block=400x5 Type=Byte, ColorInterp=Alpha
Description = NIR

CRS error while clipping rioxarray to shapefile

I'm trying to clip a rioxarray dataset to a shapefile, but get the following error:
> data_clipped = data.rio.clip(shape.geometry.apply(mapping))
MissingCRS: CRS not found. Please set the CRS with 'set_crs()' or 'write_crs()'. Data variable: precip
This error seems straightforward, but I can't figure out which CRS needs to be set. Both the dataset and the shapefile have CRS values that rio can find:
> print(data.rio.crs)
EPSG:4326
> print(shape.crs)
epsg:4326
The dataarray within the dataset, called 'precip', does not have a CRS, but it also doesn't seem to respond to the set_crs() command:
> print(data.precip.rio.crs)
None
> data.precip.rio.set_crs(data.rio.crs)
> print(data.precip.rio.crs)
None
What am I missing here?
For reference, rioxarray set_crs() documentation - this shows set_crs() working on data arrays, unlike my experience with data.precip
My data, in case I have something unusual:
> print(data)
<xarray.Dataset>
Dimensions: (x: 541, y: 411)
Coordinates:
* y (y) float64 75.0 74.9 74.8 74.7 74.6 ... 34.3 34.2 34.1 34.0
* x (x) float64 -12.0 -11.9 -11.8 -11.7 ... 41.7 41.8 41.9 42.0
time object 2020-01-01 00:00:00
spatial_ref int64 0
Data variables:
precip (y, x) float64 nan nan nan ... 1.388e-17 1.388e-17 1.388e-17
Attributes:
Conventions: CF-1.6
history: 2021-01-05 01:36:52 GMT by grib_to_netcdf-2.16.0: /opt/ecmw...
> print(shape)
ID name orgn_name geometry
0 Albania Shqipëria MULTIPOLYGON (((19.50115 40.96230, 19.50563 40...
1 Andorra Andorra POLYGON ((1.43992 42.60649, 1.45041 42.60596, ...
2 Austria Österreich POLYGON ((16.00000 48.77775, 16.00000 48.78252...

This issue is resolved if the set_crs() is used in the same command as the clip operation:
data_clipped = data.precip.rio.set_crs('WGS84').rio.clip(shape.geometry.apply(mapping))

Series to dictionary

I have the following code and output
mean = dataframe.groupby('LABEL')['RESP'].mean()
minimum = dataframe.groupby('LABEL')['RESP'].min()
maximum = dataframe.groupby('LABEL')['RESP'].max()
std = dataframe.groupby('LABEL')['RESP'].std()
df = [mean, minimum, maximum]
And the following output
[LABEL
0.0 -1.193420
1.0 0.713425
2.0 -1.066513
3.0 -0.530640
4.0 -2.130600
6.0 0.084747
7.0 1.190506
Name: RESP, dtype: float64,
LABEL
0.0 -1.396179
1.0 -0.233459
2.0 -1.631165
3.0 -1.271057
4.0 -2.543640
6.0 -0.418091
7.0 -0.004578
Name: RESP, dtype: float64,
LABEL
0.0 0.042247
1.0 0.295534
2.0 0.128233
3.0 0.243975
4.0 0.088077
6.0 0.085615
7.0 0.693196
Name: RESP, dtype: float64
]
However I want the output to be a dictionary as
{label_value: [mean, min, max, std_dev]}
For example
{1: [1, 0, 2, 1], 2: [0, -1, 1, 1], ... }

I'm assuming your starting Dataframe is equivalent to one I've synthesised.
calculate all of the aggregate values in one call to aggregate. rounded values so output fits in this answer
reset_index() on aggregate then to_dict()
list comprehension to reformat dict to your specification
df = pd.DataFrame([[l, random.random()] for l in range(8) for k in range(500)], columns=["LABEL","RESP"])
d = df.groupby("LABEL")["RESP"].agg([np.mean, np.min, np.max, np.std]).round(4).reset_index().to_dict(orient="records")
{e["LABEL"]:[e["mean"],e["amin"],e["amax"],e["std"]] for e in d}
output
{0: [0.5007, 0.0029, 0.997, 0.2842],
1: [0.4967, 0.0001, 0.9993, 0.2855],
2: [0.4742, 0.0003, 0.9931, 0.2799],
3: [0.5175, 0.0062, 0.9996, 0.2978],
4: [0.4909, 0.0018, 0.9952, 0.2912],
5: [0.4787, 0.0077, 0.9976, 0.291],
6: [0.4878, 0.0009, 0.9942, 0.2806],
7: [0.4989, 0.0066, 0.9982, 0.278]}

Why is my data not recognized as time series?

I have daily (day) data on calories intake for one person (cal2), which I get from a Stata dta file.
I run the code below:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import read_csv
from matplotlib.pylab import rcParams
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True,
index = 'day', convert_dates=True)
print(d.dtypes)
print(d.shape)
print(d.index)
print(d.head)
plt.plot(d)
This is how the data looks like:
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
The prints reveal the following:
day datetime64[ns]
cal2 float32
dtype: object
(251, 2)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
dtype='int64', length=251)
And here is the problem - the data should identify as dtype='datatime64[ns]'.
However, it clearly does not. Why not?

There is a discrepancy between the code provided, the data and the types shown.
This is because irrespective of the type of cal2, the index = 'day' argument
in pd.read_stata() should always render day the index, albeit not as the
desired type.
With that said, the problem can be reproduce as follows.
First, create the dataset in Stata:
clear
input double day float cal2
15350 3668.433
15351 3652.25
15352 3647.866
15353 3646.684
15354 3661.9414
15355 3656.952
end
format %td day
save time_series_calories
describe
Contains data from time_series_calories.dta
obs: 6
vars: 2
size: 72
----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------
day double %td
cal2 float %9.0g
----------------------------------------------------------------------------------------------------
Sorted by:
Second, load the data in Pandas:
import pandas as pd
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, convert_dates=True)
print(d.head)
day cal2
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
print(d.dtypes)
day datetime64[ns]
cal2 float32
dtype: object
print(d.shape)
(6, 2)
print(d.index)
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
In order to change the index as desired, you can use pd.set_index():
d = d.set_index('day')
print(d.head)
cal2
day
2002-01-10 3668.433350
2002-01-11 3652.249756
2002-01-12 3647.866211
2002-01-13 3646.684326
2002-01-14 3661.941406
2002-01-15 3656.951660
print(d.index)
DatetimeIndex(['2002-01-10', '2002-01-11', '2002-01-12', '2002-01-13',
'2002-01-14', '2002-01-15'],
dtype='datetime64[ns]', name='day', freq=None)
If day is a string in the Stata dataset, then you can do the following:
d['day'] = pd.to_datetime(d.day)
d = d.set_index('day')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Slicing Pandas series - python-3.x

Are the columnXX variables necessary ? Maybe just try following: master_values = pd.read_csv('total.csv',header=0).iloc[0] and if you need a tuple as indicated by the parentheses you can do it like that: master_values = tuple(pd.read_csv('total.csv',header=0).iloc[0])

You could try this: total_csv.to_numpy()[0][0].split(' ')[1:]

Related

how do I use .assign for values in a column

Adding band description to rioxarray to_raster()

CRS error while clipping rioxarray to shapefile

Series to dictionary

Why is my data not recognized as time series?

Categories

Resources