google datastore put_multi didn't insert data - python-3.x

I am trying to insert 6000 rows/entities into google cloud datastore. I am also using datastore emulator as the local server.
In the code, I created an insert function that inserts entities in batches using put_multi and set the batch size to 50. I use python multiprocessing to spawn processes that execute the function.
A slice function is also used to divide the workload based on how many CPU cores are used. e.g. if there are 3 cores, the workload (6000 entities) is divided into 3 parts with 2000 entities each, then each part is inserted by a spawned process that executes the insert function.
After insertion is done, I checked with Cloud Datastore Admin console, but couldn't find the kinds that have been inserted.
I am wondering what is the issue here and how to solve it.
code snippet is as follows,
# cores_to_use is how many cpu cores available for dividing workload
cores_to_use = 3
# a datastore client is passed in as the argument
inserter = FastInsertGCDatastore(client)
# entities is a list of datastore entities to be inserted
# the number of entities is 6000 here
input_size = len(entities)
slice_size = int(input_size / cores_to_use)
entity_blocks = []
iterator = iter(entities)
for i in range(cores_to_use):
entity_blocks.append([])
for j in range(slice_size):
entity_blocks[i].append(iterator.__next__())
for block in entity_blocks:
p = multiprocessing.Process(target=inserter.execute, args=(block,))
p.start()
class FastInsertGCDatastore:
"""
batch insert entities into gc datastore based on batch_size and number_of_entities
"""
def __init__(self, client):
"""
initialize with datastore client
:param client: the datastore client
"""
self.client = client
def execute(self, entities):
"""
batch insert entities
:param entities: a list of datastore entities need to be inserted
"""
number_of_entities = len(entities)
batch_size = 50
batch_documents = [0] * batch_size
rowct = 0 # entity count as index for accessing rows
for index in range(number_of_entities):
try:
batch_documents[index % batch_size] = entities[rowct]
rowct += 1
if (index + 1) % batch_size == 0:
self.client.put_multi(batch_documents)
index += 1
except Exception as e:
print('Unexpected error for index ', index, ' message reads', str(e))
raise e
# insert any remaining entities
if not index % batch_size == 0:
self.client.put_multi(batch_documents[:index % batch_size])

Related

What is an efficient way to make a dataset and dataloader for high frequency time series with multiple individuals?

I'm trying to forecast high frequency time series using LSTMs and PyTorch library. I'm going through PyTorch tutorial for creating custom datasets and models and figured out how to create my Dataset class and my Dataloader and they work perfectly fine but they take too much time to generate one batch.
I want to generate batches of fixed size, each batch contains time series from different individuals and the input window is of the same length as the output window (multi-step prediction).
I think the issue is due to the fact that I'm verifying the windows are correct.
My dataframe of a little bit more than 3M lines with 6 columns. I have some 100 individuals and for each individual I have 4 different time series $y_{1}$, $y_{2}$, $y_{3}$ and $y_{4}$. I have no missing values at all and the time steps are consecutive. For each individual I have the same time steps.
My code is:
class TSDataset(Dataset):
def __init__(self, train_data, unique_column = 'unique_id', input_length = 3840, target_length = 3840, targets = ['y1', 'y2', 'y3', 'y4'], transform = None):
self.train_data = train_data
self.unique_column = unique_column
self.input_length = input_length
self.target_length = target_length
self.total_window_length = input_length + target_length
self.targets = targets
def __len__(self):
return len(self.train_data)
def verify_time_steps(self, idx):
change = False
# Check if the window doesn't overlap over many individuals
num_individuals = self.train_data.iloc[np.arange(idx + self.total_window_length), :][self.unique_column].unique().shape[0]
if num_stations != 1:
change = True
if idx + self.total_window_length >= len(self.train_data):
change = True
return change
def reshuffle(self):
return np.random.randint(0, len(self.train_data))
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
change = self.verify_time_steps(idx)
if change == True:
while change != False:
idx = self.reshuffle()
change = self.verify_time_steps(idx)
sample = self.train_data.iloc[np.arange(idx, idx + self.input_length), :][self.targets].values
labels = self.train_data.iloc[np.arange(idx + self.input_length, idx + self.input_length + self.target_length), :][self.targets].values
sample = torch.from_numpy(sample)
labels = torch.from_numpy(labels)
return sample, labels
I've tried using the TimeSeriesDataset from PyTorchForecasting but I had a hard time creating models that suit it.
I've also tried creating the dataset outside, as a numpy array but my RAM can't handle it.
Hope you can help me figure out how to alleviate the computations.

Calling a file_write after multiprocessing causes issue in one case and not the other

I'm using a multiprocessing pool with 80 processes on a 16GB machine. The flow is as follows:
Read objects in batches from an input file
Send the entire batch to a multiprocessing pool, and record the time taken by the pool to process the batch
Write the time recorded in step 2 above to an output file
To achieve the above, I wrote code in 2 ways:
Way 1:
with open('input_file', 'r') as input_file, open('output_file', 'a') as of:
batch = read_next_batch_of_lines()
start_time = time.time()
call_api_for_each_item_in_batch(batch)
end_time = time.time()
of.write('{}\n'.format(end_time-start_time))
Way 2:
with open('input_file', 'r') as input_file:
batch = read_next_batch_of_lines()
start_time = time.time()
call_api_for_each_item_in_batch(batch)
end_time = time.time()
with open('output_file', 'a') as of:
of.write('{}\n'.format(end_time-start_time))
In the first case, nothing is being appended to the output file despite batches being processed. I'm unable to figure out the reason for this.
Details of call_api_for_each_item_in_batch():
def call_api_for_each_item_in_batch(batch):
intervals = get_intervals(batch, pool_size) #this gives intervals. Ex. if batch size is 10 and pool size is 3, then intervals would be (0, 4, 7, 10)
pool = mp.Pool(pool_size)
arguments = list(zip(intervals, intervals[1:]))
pool.starmap(call_api, arguments)
pool.close()
def call_api(start, end):
for i in range(start, end):
item = batch[i]
call_external_api(item)
How is Way 1 different from Way 2 when a pool.close() is called in the call_api_for_each_item_in_batch itself?
Also, I used pool.close() followed by pool.join(), but faced the same issue.

how to I load a pre-batched dataset on pytorch

I have a huge dataset that cannot be stored in memory so I prebatched it several files how do I make my dataset and data loader class such that load one bath at a time.
All the files have the same base name and a unique batch number an
Example file would be called o3_batch_1.hdf5 or o3_batch_2.hdf5 the
Largest batch number is o3_batch_102.hdf5
here is what I have tried so far:
would it work?
length would be the total length of the data.
batchNum would be the non-unique number at the end of the file.
base is the common name shared by the file.
class Data(Dataset):
# Constructor
def __init__(self, base, batchNum, length):
name = base + str(batchNum)
with h5py.File(name, "r") as f:
puzz = np.array(f.get('puzzle'))
sol = np.array(f.get('Sol'))
self.puzz = torch.from_numpy(puzz)
self.sol = torch.from_numpy(sol)
self.len = length
# Getter
def __getitem__(self, batchNum, index):
return self.puzz[index], self.sol[index]
# Get length
def __len__(self):
return self.len
I think you can iterate over the Index array, and you can get your data through iteration.
Suppose your file is organized in the following manner
/yourFileDir
o3_batch_1.hdf5
o3_batch_2.hdf5
...
o3_batch_102.hdf5
And your batch Index is 0,1,2,...,102
h5_dir = '/yourFileDir'
for Index in range(103):
with h5py.File(h5_dir + 'o3_batch_{}'.format(Index), 'r') as f:
puzz = np.array(f['puzzle'])
sol = np.array(f['Sol']) # this depends on how you save your data

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?
My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:
Traceback (most recent call last):
File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?
The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:
from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
def write_query_batch(session, batch):
tx = session.transaction().write()
for query in batch:
tx.query(query)
tx.commit()
def multi_thread_write_query_batches(session, query_batches, num_threads=8):
pool = ThreadPool(num_threads)
pool.map(partial(write_query_batch, session), query_batches)
pool.close()
pool.join()
def generate_query_batches(my_data_entries_list, batch_size):
batch = []
for index, data_entry in enumerate(my_data_entries_list):
batch.append(data_entry)
if index % batch_size == 0 and index != 0:
yield batch
batch = []
if batch:
yield batch
# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")
query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)
session.close()
client.close()
The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:
files = [
{
"file_path": f"/path/to/your/file.gql",
},
{
"file_path": f"/path/to/your/file2.gql",
}
]
KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000
# ​Entry point where migration starts
def migrate_graql_files():
start_time = time.time()
for file in files:
print('==================================================')
print(f'Loading from {file["file_path"]}')
print('==================================================')
open_file = open(file["file_path"], "r") # Here we are assuming you have 1 Graql query per line!
batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)
with GraknClient(uri=URI) as client: # Using `with` auto-closes the client
with client.session(KEYSPACE) as session: # Using `with` auto-closes the session
multi_thread_write_query_batches(session, batches, num_threads=16) # Pick `num_threads` according to your machine
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
if __name__ == "__main__":
migrate_graql_files()
You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.
An alternative approach using multi-processing instead of multi-threading follows below.
We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.
This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.
from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue
def batch_writer(database, kill_event, batch_queue):
client = TypeDB.core_client("localhost:1729")
session = client.session(database, SessionType.DATA)
while not kill_event.is_set():
try:
batch = batch_queue.get(block=True, timeout=1)
with session.transaction(TransactionType.WRITE) as tx:
for query in batch:
tx.query().insert(query)
tx.commit()
except queue.Empty:
continue
print("Received kill event, exiting worker.")
def start_writers(database, kill_event, batch_queue, parallelism=4):
processes = []
for _ in range(parallelism):
proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
processes.append(proc)
proc.start()
return processes
def batch(iterable, n=1000):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
if __name__ == '__main__':
batch_size = 100
parallelism = 1
database = "<database name>"
# filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"
with open(file_path, "r") as file:
statements = file.read().splitlines()[:]
batch_statements = batch(statements, n=batch_size)
total_batches = int(len(statements) / batch_size)
if total_batches % batch_size > 0:
total_batches += 1
batch_queue = mp.Queue(parallelism * 4)
kill_event = mp.Event()
writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
for i, batch in enumerate(batch_statements):
batch_queue.put(batch, block=True)
if i*batch_size % 10000 == 0:
print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
kill_event.set()
batch_queue.close()
batch_queue.join_thread()
for proc in writers:
proc.join()
print("Done loading")

python simpy memory usage with large numbers of objects/processes

I am using simpy to create a DES with a very large numbers of objects (many millions). I am running into memory issues and have being trying to figure out how to address this. It is possible to work out which objects will not undergo anymore interactions with other processes and so I can delete these objects from the simulation in theory freeing up memory. I created the below test this.
import psutil as ps
import simpy
import random
class MemoryUse(object):
"""a class used to output memory usage at various times within the sim"""
def __init__(self, env, input_dict):
self.env = env
self.input_dict = input_dict
self.env.process(self.before())
self.env.process(self.during())
self.env.process(self.after_sr())
self.env.process(self.after())
def before(self):
yield self.env.timeout(0)
print("full object list and memory events at time: ", self.env.now, " ", ps.virtual_memory())
print(len(self.input_dict), len(self.env._queue))
def during(self):
yield self.env.timeout(2)
print("full object list and events ar time: ", self.env.now, " ", ps.virtual_memory())
print(len(self.input_dict), len(self.env._queue))
def after_sr(self):
yield self.env.timeout(4)
print("reduced object list and reduced events at time: ", self.env.now, " ", ps.virtual_memory())
print(len(self.input_dict), len(self.env._queue))
def after(self):
yield self.env.timeout(6)
print("no objects and no events at time: ", self.env.now, " ", ps.virtual_memory())
print(len(self.input_dict), len(self.env._queue))
class ExObj(object):
"""a generic object"""
def __init__(self, env, id, input_dict):
self.env = env
self.id = id
self.input_dict = input_dict
if random.randint(0, 100) < 70:
# set as SR
self.timeout = 2
else:
self.timeout = 4
def action(self):
yield self.env.timeout(self.timeout)
del self.input_dict[self.id]
class StartObj(object):
"""this enables me to create the obj events after the sim has started so as to measure memory usage before the events
associated with the object exists"""
def __init__(self, env, input_dict):
self.env = env
self.input_dict = input_dict
self.env.process(self.start_obj())
def start_obj(self):
yield self.env.timeout(1)
for k, v in self.input_dict.items():
self.env.process(v.action())
yield self.env.timeout(0)
# memory usage before we do anything
print("before all: ", ps.virtual_memory())
# create simpy env
env = simpy.Environment()
obj_dict = {}
# create memory calculation events
memory = MemoryUse(env, obj_dict)
# create objects
for i in range(2500000):
obj_dict[i] = ExObj(env, i, obj_dict)
# create process that will itself start events associated with the objects
start = StartObj(env, obj_dict)
# run
env.run()
# clear the dict if not already clear
for j in range(2500000):
obj_dict.clear()
# final memory check
print("after all: ", ps.virtual_memory())
print(len(obj_dict))
I was expecting memory usage to drop by time 4, as many objects have been removed and processes completed (around 70%). However memory usage appears to stay the same (See below). Why is this so? What is using this memory? Do completed processes stay in the simulation?
before all: svmem(total=42195423232, available=39684155392, percent=6.0, used=2246373376, free=38884859904, active=2390749184, inactive=441712640, buffers=263155712, cached=801034240, shared=28721152)
full object list and memory events at time: 0 svmem(total=42195423232, available=38834251776, percent=8.0, used=3096276992, free=38035181568, active=3241959424, inactive=441466880, buffers=263159808, cached=800804864, shared=28721152)
2500000 4
full object list and events ar time: 2 svmem(total=42195423232, available=35121584128, percent=16.8, used=6808891392, free=34322219008, active=6947561472, inactive=441761792, buffers=263163904, cached=801148928, shared=28774400)
2500000 2500002
reduced object list and reduced events at time: 4 svmem(total=42195423232, available=35120973824, percent=16.8, used=6809530368, free=34321600512, active=6948368384, inactive=441737216, buffers=263168000, cached=801124352, shared=28745728)
767416 767417
no objects and no events at time: 6 svmem(total=42195423232, available=38448134144, percent=8.9, used=3482365952, free=37648760832, active=3627053056, inactive=441733120, buffers=263172096, cached=801124352, shared=28745728)
0 0
after all: svmem(total=42195423232, available=38825793536, percent=8.0, used=3104706560, free=38026420224, active=3250180096, inactive=441733120, buffers=263172096, cached=801124352, shared=28745728)
0
Process finished with exit code 0

Resources