Generating random samples from the data - python-3.x

I have 506 points in the data set. I have to generate random sample from this data such as i have to select 303 points without replacement and remaining 203 points i need to select from these 303 points.
I have written the following code.
def generating_samples(input_data, target_data):
selected_rows = np.random.choice(len(input_data), 303)
replacing_rows = np.random.choice(selected_rows,203)
selected_columns = np.random.choice(3,13,1)
sample_data = input_data[selected_rows[:,None],selected_columns]
target_of_sample_data = target_data[selected_rows]
#replicating data
replicated_sample_data = sample_data[replacing_rows]
target_of_replicated_sample_data = target_data[replacing_rows]
#concatenating data
sampled_input_data = np.vstack(sample_data, replicated_sample_data)
target_of_sample_data = target_of_sample_data.reshape(-1,1)
target_of_replicated_sample_data = target_of_replicated_sample_data.reshape(-1,1)
sampled_target_data = np.vstack(target_of_sample_data,target_of_replicated_sample_data)
return sampled_input_data , sampled_target_data, selected_rows,selected_columns
def grader_samples(a,b,c,d):
length = (len(a)==506 and len(b)==506)
sampled = (len(a)-len(set([str(i) for i in a]))==203)
rows_length = (len(c)==303)
column_length= (len(d)>=3)
assert(length and sampled and rows_length and column_length)
return True
a,b,c,d = generating_samples(x, y)
But am getting following error in this.
IndexError Traceback (most recent call last)
<ipython-input-14-ca772632e834> in <module>
7 return True
----> 9 a,b,c,d = generating_samples(x, y)
10 grader_samples(a,b,c,d)
<ipython-input-13-bcf904f160e5> in generating_samples(input_data, target_data)
14 #replicating data
---> 15 replicated_sample_data = sample_data[replacing_rows]
16 target_of_replicated_sample_data = target_data[replacing_rows]
IndexError: index 391 is out of bounds for axis 0 with size 303

Use :replicated_sample_data = input_data[replacing_rows] because the replicating sample data comes from original dataset.
And sample data is already sampled from the original dataset , so it is a subset of our original dataset and results in out of index error

An error occurred due to indexing and nonunique samples.
Use :
selected_rows = np.random.choice(len(input_data), 303, replace=False)
because of the replace=False will get 303 unique sample index. Index values can be used to extract rows from input_data.
For replicating sample we can select
replacing_rows = np.random.choice(len(selected_rows),203,replace=False)
in replacing_rows we will get a unique sample index.
Now we can select replacing samples from the sample dataset.
replicated_sample_data = sample_data[replacing_rows]


How to extract many groups of cells separated by a specified number of rows in excel using python and write it to an other file?

I have a csv file which has around 58 million cells containing numerical data. I want to extract data from every 16 cells which are 49 rows apart.
Let me describe it clearly.
The data I need to extract
The above image shows the the first set of data that is to be extracted (rows 23 to 26, columns 92 to 95). This data has to be written in another file csv file (preferably in a row).
Then I will move down 49 rows (row 72), then extract 4rows x 4columns. Shown in image below.
Next set of data
Similarly, I need to keep going till I reach the end of the file.
Third set
The next set will be the image shown above.
I have to keep going till I reach the end of the file and extract thousands of such data.
I had written a code for this but its not working. I don't know where is the mistake. I will also attach it here.
import pandas as pd
import numpy
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
arrY = []
ex = 0
for i in range(len(df)):
if i == 0:
for j in range(4):
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
for j in range(4):
if j+22+i*(49) >= len(df):
ex = 1
# print(j)
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
if ex == 1:
# print(arrY)
a = []
for i in range(len(arrY) - 3):
p = arrY[i]+arrY[i+1]+arrY[i+2]+arrY[i+3]
numpy.savetxt('myfile.csv', a, delimiter=',')
Using the above code, I didn't get the result I wanted.
Please help with this and correct where I have gone wrong.
I couldn't attach my csv file here, Please try to use any sample sheet that you have or can create a simple one.
Thanks in advance! Have a great day.
i don't know what exactly you are doing in your code
but i wrote my own
import csv
from itertools import chain
CSV_PATH = 'TS_trace31.csv'
new_data = []
with open(CSV_PATH, 'r') as csvfile:
reader = csv.reader(csvfile)
# row_num for storing big jumps e.g. 23, 72, 121 ...
row_num = 23
# n for storing the group number 0 - 3
# with n we can find the 23, 24, 25, 26
n = 0
# row_group for storing every 4 group rows
row_group = []
# looping over every row in main file
for row in reader:
if reader.line_num == row_num + n:
# for the first time this is going to be 23 + 0
# then we add one number to the n
# so the next cycle will be 24 and so on
n += 1
# add each row to it group
# check if we are at the end of the group e.g. 26
if n == 4:
# reset the group number
n = 0
# add the jump to main row number
row_num += 49
# combine all the row_group to a single row
# clear the row_group for next set of rows
# and finally write all the rows in a new file
with open('myfile.csv', 'w') as new_csvfile:
writer = csv.writer(new_csvfile)

perform principal component analysis on moving window of data

This was partly answered by #WhoIsJack but not completely solved given the errors I get. Basically, I'm trying to perform principal component analysis on a rolling window of data. For example, I'd run PCA on the last 200 days in the df, move forward 1 day and do PCA again on the last 200 days. So as you move forward each day, you'd include the next day's measurement and exclude the last measurement.
You have a random df:
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
Here's window size:
window = 200
Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
Define PCA fit-transform function. Instead of attempting to return the result, it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
Use rolling to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
The results should be contained here:
However, when I generate the results only the first row of data looks to contain PCAs while the rest of the rows are zero.
I also tried the following function:
def rolling_pca(x, window):
r = x.rolling(window=window)
pca = PCA(3)
y =
z = pca.fit_transform(y)
return z
window = 200
Which I thought would generate a new df with rolling PCAs:
data = df.apply(rolling_pca, window=window)
But I got the following error: setting an array element with a sequence.
I've also tried manually calculating with below. I get: "unsupported operand type(s) for /: 'Rolling' and 'int'"
def rolling_pca(x, window):
# create rolling dataframe
r = x.rolling(window=window)
# demand data
X = np.matrix(r)
X_dm = X - np.mean(X, axis = 0)
#Eigenvalue decomposition (of covariance matrix)
Cov_X = np.cov(X_dm, rowvar = False)
eigen = np.linalg.eig(Cov_X)
eig_values_X = np.matrix(eigen[0])
eig_vectors_X = np.matrix(eigen[1])
#transformed data
Y_dm = X_dm * eig_vectors_X
#assign transformed yields
yields_trans = Y_dm.copy()
# get PCs
pc1_yields = x.copy()
pcas = yields_trans[:,0:3]
return pcas
#assign window length
window = 300
rolling_pca(data, window=window)
And tried below. Get error: "LinAlgError: 0-dimensional array given. Array must be at least two-dimensional"
def pca(x):
# demand data
X = np.matrix(x.values)
X_dm = X - np.mean(X, axis = 0)
#Eigenvalue decomposition (of covariance matrix)
Cov_X = np.cov(X_dm, rowvar = False)
eigen = np.linalg.eig(Cov_X)
eig_values_X = np.matrix(eigen[0])
eig_vectors_X = np.matrix(eigen[1])
#transformed data
Y_dm = X_dm * eig_vectors_X
#assign transformed yields
yields_trans = Y_dm.copy()
# get 3 PCs
pcas = yields_trans[:,0:3]
final_pcas = pd.DataFrame(pcas)
return final_pcas
Any thoughts would be appreciated!

ValueError: Shape of passed values is, indices imply

Reposting again because i didn't get a response to the first post
I have the following data is below:
desc = pd.DataFrame(description, columns =['new_desc'])
257623 the public safety report is compiled from crim...
161135 police say a sea isle city man ordered two pou...
156561 two people are behind bars this morning, after...
41690 pumpkin soup is a beloved breakfast soup in ja...
70092 right now, 15 states are grappling with how be...
... ...
207258 operation legend results in 59 more arrests, i...
222170 see story, 3a
204064 st. louis — missouri secretary of state jason ...
151443 tony lavell jones, 54, of sunset view terrace,...
97367 walgreens, on the other hand, is still going t...
[9863 rows x 1 columns]
I'm trying to find the dominant topic within the documents, and When I run the following code
best_lda_model = lda_desc
data_vectorized = tfidf
lda_output = best_lda_model.transform(data_vectorized)
topicnames = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc " + str(i) for i in range(len(dataset))]
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topicnames, index = docnames)
dominant_topic = np.argmax(df_document_topic.values, axis = 1)
df_document_topic['dominant_topic'] = dominant_topic
I've tried tweaking the code, however, no matter what I change, I get the following error tracebook error
ValueError Traceback (most recent call last)
c:\python36\lib\site-packages\pandas\core\internals\ in create_block_manager_from_blocks(blocks, axes)
-> 1674 mgr = BlockManager(blocks, axes)
1675 mgr._consolidate_inplace()
c:\python36\lib\site-packages\pandas\core\internals\ in __init__(self, blocks, axes, do_integrity_check)
148 if do_integrity_check:
--> 149 self._verify_integrity()
c:\python36\lib\site-packages\pandas\core\internals\ in _verify_integrity(self)
328 if block.shape[1:] != mgr_shape[1:]:
--> 329 raise construction_error(tot_items, block.shape[1:], self.axes)
330 if len(self.items) != tot_items:
ValueError: Shape of passed values is (9863, 8), indices imply (0, 8)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-41-bd470d69b181> in <module>
4 topicnames = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
5 docnames = ["Doc " + str(i) for i in range(len(dataset))]
----> 6 df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topicnames, index = docnames)
7 dominant_topic = np.argmax(df_document_topic.values, axis = 1)
8 df_document_topic['dominant_topic'] = dominant_topic
c:\python36\lib\site-packages\pandas\core\ in __init__(self, data, index, columns, dtype, copy)
495 mgr = init_dict({ data}, index, columns, dtype=dtype)
496 else:
--> 497 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
499 # For data is list-like, or Iterable (will consume into list)
c:\python36\lib\site-packages\pandas\core\internals\ in init_ndarray(values, index, columns, dtype, copy)
232 block_values = [values]
--> 234 return create_block_manager_from_blocks(block_values, [columns, index])
c:\python36\lib\site-packages\pandas\core\internals\ in create_block_manager_from_blocks(blocks, axes)
1679 blocks = [getattr(b, "values", b) for b in blocks]
1680 tot_items = sum(b.shape[0] for b in blocks)
-> 1681 raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
ValueError: Shape of passed values is (9863, 8), indices imply (0, 8)
The desired results is to produce a list of documents according to a specific topic. Below is example code and desired output.
df_document_topic(df_document_topic['dominant_topic'] == 2).head(10)
When I run this code, I get the following traceback
TypeError Traceback (most recent call last)
<ipython-input-55-8cf9694464e6> in <module>
----> 1 df_document_topic(df_document_topic['dominant_topic'] == 2).head(10)
TypeError: 'DataFrame' object is not callable
Below is the desired output
Any help would be greatly appreciated.
The index you're passing as docnames is empty which is obtained from dataset as follows:
docnames = ["Doc " + str(i) for i in range(len(dataset))]
So this means that the dataset is empty too. For a workaround, you can create Doc indices based on the size of lda_output as follows:
docnames = ["Doc " + str(i) for i in range(len(lda_output))]
Let me know if this works.

Found array with 0 feature(s) (shape=(268215, 0)) while a minimum of 1 is required by StandardScaler

I am solving a problem where I am pulling data of all the ProductIDs and then I iterate through the dataframe to look at unique ProductIDs to perform a set of functions.
Here, item is the ProductID/Item number:
#looping through the big dataframe to get a dataframe pertaining to the unique ID
for item in df2['Item Nbr'].unique():
# fetch item data
df = df2.loc[df2['Item Nbr'] == item]
And then I have a set of custom made python functions:
So, when I get through the first loop (for one productID) it works all great, but when it iterates through the loop and goes to the next Product ID, I am certain that the data it is pulling out is right, but I get this error:
Found array with 0 feature(s) (shape=(268215, 0)) while a minimum of 1 is required by StandardScaler.
Although, the X_train and y_train shapes are : (268215, 6) (268215,)
Code Snippet : (Extra Information)
It is a huge file to show. But the initial big dataframe has
[362988 rows x 7 columns] - for first product and
[268215 rows x 7 columns] - for second product
Expansion of the code:
the big dataframe with two unique product IDS
biqQueryData = get_item_data(verbose=True)
iterate over each unique product ID for extracting a subset of dataframes that pertain to the product ID
for item in biqQueryData['Item Nbr'].unique():
df = biqQueryData.loc[biqQueryData['Item Nbr'] == item]
df_model = model_all_stores(df, item, n_jobs=n_jobs,
the function model_all_stores
def model_all_stores(df_raw, item_nbr, n_jobs=1, train_model=False,
test_model=False, export_model=False, output=False,
"""Models demand for specified item.
Predict the demand of specified item for all stores. Does not
filter for predict hidden demand (the function get_hidden_demand
should be used for this.)
Output: data frame output
# ML model hyperparameters
impute_with = 'median'
n_estimators = 100
min_samples_split = 3
min_samples_leaf = 3
max_depth = None
# load data and subset traited and valid
dfnew = subset_traited_valid(df_raw)
# get known demand
df_ma = get_demand(dfnew)
# impute missing sales data
median_sales = df_ma['Sales Qty'].median()
df_ma['Sales Qty'] = df_ma['Sales Qty'].fillna(median_sales)
# add moving average features
df_ma = df_ma.sort_values('Gregorian Days')
window_list = [7 * x for x in [1, 2, 4, 8, 16, 52]]
for w in window_list:
grouped = df_ma.groupby('Store Nbr')['Sales Qty'].shift(1)
rolling = grouped.rolling(window=w, min_periods=1).mean()
df_ma['MA' + str(w)] = rolling.reset_index(0, drop=True)
X_full = df_ma.loc[:, 'MA7':].values
# use full data if not testing/tuning
rows_for_model = df_ma['Known Demand'].notnull()
X = df_ma.loc[rows_for_model, 'MA7':].values
y = df_ma.loc[rows_for_model, 'Known Demand'].values
X_train, y_train = X, y
print(X_train.shape, y_train.shape)
if train_model:
# instantiate model components
imputer = Imputer(missing_values='NaN', strategy=impute_with, axis=0)
scale = StandardScaler()
pca = PCA()
forest = RandomForestRegressor(n_estimators=n_estimators,
# pipeline for model
pipeline_steps = [('imputer', imputer),
('scale', scale),
('pca', pca),
('forest', forest)]
regr = Pipeline(pipeline_steps), y_train)
It fails here
Snippet Of data:
biqQueryData (the entire Dataframe)
Subset DF 1:
Subset DF 2:
Any help here would be great! Thank you

Calculate raster landscape proportion (percentage) within multiple overlaping polygon (shapefiles)?

I think the most easiest way is to extract the raster values within each polygon and calculate the proportion. Is it possible to do so without reading the entire grid as an array?
I have 23 yearly global classified raster (resolution = 0.00277778 degree) from 1992 - 2015 and a polygon vector with 354 shapes (which overlap at some parts). Because of the overlap (Self-intersection) it is not easy to work with them as raster. Both projected in "+proj=longlat +datum=WGS84 +no_defs".
The raster consists of classes from 10 - 220
The polygon has ABC_ID from 1 - 449
For one Year it looks like:
classification and shape example
I need to create a table like:
example table
I already tried to achieve this with:
Zonal Statistics
Pk tools (extract vector sample from raster)
LecoS (Overlay raster metrics)
Cross-Classification and Tabulation" of SAGA GIS (problems with extent)
FRAGSTATS (i was not able to load in the shp file)
Raster --> Extraction --> Clipper dose not work (Ring Self-intersection)
I have heard that Tabulate Area from ArcMap can do this but it would be nice if there is an open source solution to this.
I have managed to do it with Python "rasterio" and "geopandas"
It now creates a table like:
example result
since i did not found something similar like the extract comand in R "raster" it took more than only 2 lines but instead of calculating half the night it now takes only 2 min for one year.
The results are the same. It is based on the ideas of "gene" from ""
import rasterio
from rasterio.mask import mask
import geopandas as gpd
import pandas as pd
print('1. Read shapefile')
shape_fn = "D:/path/path/multypoly.shp"
raster_fn = "D:/path/path/class_1992.tif"
# set max and min class
raster_min = 10
raster_max = 230
output_dir = 'C:/Temp/'
write_zero_frequencies = True
show_plot = False
shapefile = gpd.read_file(shape_fn)
# extract the geometries in GeoJSON format
geoms = shapefile.geometry.values # list of shapely geometries
records = shapefile.values
with as src:
print('nodata value:', src.nodata)
idx_area = 0
# for upslope_area in geoms:
for index, row in shapefile.iterrows():
upslope_area = row['geometry']
lake_id = row['ABC_ID']
print('\n', idx_area, lake_id, '\n')
# transform to GeJSON format
from shapely.geometry import mapping
mapped_geom = [mapping(upslope_area)]
print('2. Cropping raster values')
# extract the raster values values within the polygon
out_image, out_transform = mask(src, mapped_geom, crop=True)
# no data values of the original raster
# extract the values of the masked array
data =[0]
# extract the row, columns of the valid values
import numpy as np
# row, col = np.where(data != no_data)
clas = np.extract(data != no_data, data)
# from rasterio import Affine # or from affine import Affine
# T1 = out_transform * Affine.translation(0.5, 0.5) # reference the pixel centre
# rc2xy = lambda r, c: (c, r) * T1
# d = gpd.GeoDataFrame({'col':col,'row':row,'clas':clas})
range_min = raster_min # min(clas)
range_max = raster_max # max(clas)
classes = range(range_min, range_max + 2)
frequencies, class_limits = np.histogram(clas,
range=[range_min, range_max])
if idx_area == 0:
# data_frame = gpd.GeoDataFrame({'freq_' + str(lake_id):frequencies})
data_frame = pd.DataFrame({'freq_' + str(lake_id): frequencies})
data_frame.index = class_limits[:-1]
data_frame['freq_' + str(lake_id)] = frequencies
idx_area += 1
data_frame.to_csv(output_dir + 'upslope_area_1992.csv', sep='\t')
I can do it with the R command extract and summaries it with table as explained by "Spacedman" see:
shapes <- readOGR("C://data/.../shape)
LClass_1992 <- raster("C://.../LClass_1992.tif")
value_list <- extract (LClass, shapes )
stats <- lapply(value_list,table)
10 11 30 40 60 70 80 90 100 110 130 150 180 190 200 201 210
67 303 233 450 1021 8241 65 6461 2823 88 6396 5 35 125 80 70 1027
But it takes very long (half the night).
I will try to do it with Python maybe it will be faster.
Maybe someone had done something similar and can share the code.
