Using python and Google cloud engine to process big data - python-3.x

I am an amateur to the world of Python programming and I need help. I have 10GB of data and I have written python codes with Spyder to process the data. a part of codes is provided:
The codes are good with a small sample of data. However, with 10GB of data, my laptop cannot handle it so I need to use Google Cloud Engine. How I can upload the data and use Google Cloud Engine to run codes?
import os
import pandas as pd
import pickle
import glob
import numpy as np
df=pd.read_pickle(r'C:\user\mydata.pkl')
i=2018
while i>=1995:
df=df[df.OverlapYearStart<=i]
df.to_pickle(r'C:\user\done\{}.pkl'.format(i))
i=i-1

I agree with the previous answer, just to complement it you can take a look in AI Platform Notebooks which is a managed service that offers an integrated JupyterLab environment, also has the capacity to pull your data from BigQuery and allow you to scale your application on demand.
On the other hand, I don't know how you have storage your 10GB of data into CSV? in a database? As is mentioned in the first answer Cloud Storage allows you to create buckets to store your data, once the data is in Cloud Storage you may export that data into BigQuery tables to work with that data in your app using Google Cloud App Engine or the earlier suggestion AI Platform Notebooks this will depend of your solution.

Probably the easiest thing to start digging into, is going to be to use App Engine to run the code itself:
https://cloud.google.com/appengine/docs/python/
And use Google Cloud Storage to hold your data objects:
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
I don't know what the output of your application is, so depending on what you want to do with the output, Google Compute Engine may be the right answer if AppEngine doesn't quite fit what you're doing.
https://cloud.google.com/compute/
The first two links take you to the documentation on how to get going with Python for AppEngine and Google Cloud Storage.
Edit to add from comments, that you'll also need to manage the memory footprint of your app. If you're really doing everything in one giant while loop, no matter where you run the application you'll have memory problems as all 10GB of your data will likely get loaded into memory. Definitely still shift that into the Cloud IMO, but yeah, that memory will need to get broken up somehow and handled in smaller chunks.

Related

Using Pandas to Write to a File within a Samba Share

I am using a GCP Cloud Function to read from a BigQuery Table and output the results to a CSV file located on a network drive (all the infrastructure parts necessary to communicate with on-prem are in place). I was wondering whether there is a way to write data out to this location using Pandas and PYSMB?
I have done a fair bit of reading on the topic and couldn't find a way, but thought someone with more experience may have an idea.
Thank you very much for your help.
Regards,
Scott

How to process large .kryo files for graph data using TinkerPop/Gremlin

I am new to Apache TinkerPop.
I have done some basic stuff like installing TinkerPop Gremlin console, creating graph .kryo file, loaded it in gremlin console and executed some basic gremlin queries. All good till now.
But i wanted to check how can we process .kryo files which are very much large in size says more than 1000GB. If i create a single .kryo file, loading it in console(or using some code) is not feasible i think.
Is there any way we can deal with graph data which is pretty huge in size?
basically i have some graph based data stored in Amazon Neptune DB, i want to take it out and store it in some files(e.g .kryo) and process later for gremlin queries. Thanks in advance.
Rather than use Kyro which is Java specific, I would recommend using something more language agnostic such as CSV files. If you are using Amazon Neptune you can use the Neptune Export tool to export your data as CSV files.
Documentation
Git repo
Cloud Formation template

Databricks or Functions with ADF?

I'm using ADF to output some reports to pdf (at least that's the goal.)
I'm using ADF to output a csv to a storage blob and I would like to ingest that, do some formatting and stats work (with scipy and matplotlib in python) and export as a pdf to the same container. This would be run once a month, and I may do a few other things like this, but they are periodical reports at the most, no streaming or anything like that.
From an architectural stand point, would this be a good application for an Azure Function (which I have some experience), or Azure Databricks (which I will want some experience in).
My first thought is the Azure Functions, since they are serverless and pay-as-you-go. But I don't know too much about Databricks except that it's primarily used for big data and long running jobs.
Databricks would be almost certainly an overkill for this. So yes, Azure Function for Python sounds like a perfect fit for your scenario.

Is it good or necessary to use Blobs when running machine learning algorithms with big data

I know what I can either upload my data files to the azure ml (as new datasets) or I can use Blobs (and read data within ML experiment). I wonder if particularly one of them is recommended when training machine learning models and creating prediction-related ML solutions.
My goal of using Azure is to cluster users based on a various of features. I have a large dataset (~ 50GB). I wonder if you have any recommendations.
I appreciate any help!
As stated at Azure Machine Learning Frequently Asked Questions: "For datasets larger than a few GB, you should upload data to Azure Storage or Azure SQL Database or use HDInsight, rather than directly uploading from local file."
Also please note the maximum sizes of datasets for modules in the Machine Learning Studio. These limits are listed as a part of the same FAQ linked above.

Using existing database in Firefox OS app

I don't know much about FirefoxOS hence this question.
I have an android app that ships with already prepared data saved in SQLite database. In the runtime the app copies that db to the device storage and uses it for reading and writing data. This is much more efficient than creating empty DB file and inserting data when the app first starts(e.g from JSON).
I was wondering how can I achieve the same thing in Firefox OS? Is there any way I can create IndexedDB, fill it with data and then add it to the app package as an asset?
Unfortunately this behavior is not yet supported. As Fabrice Desré mentioned in bugzilla, some of the files to achieve this behaviour is specific to gaia apps, which gecko does not have access at the moment.
By now, you will have to stick with the less efficient method (depending on the size of your db, the difference isn't that big).
Hope I was able to help,
cheers

Resources