How to read/load local files in Databricks?

How to read/load local files in Databricks? - apache-spark

is there anyway of reading files located in my local machine other than navigating to 'Data'> 'Add Data' on Databricks.
in my past experience using Databricks, when using s3 buckets, I was able to just read and load a dataframe by just specifying the path like so: i.e
df = spark.read.format('delta').load('<path>')
is there any way i can do something like this using databricks to read local files?

If you use the Databricks Connect client library you can read local files into memory on a remote Databricks Spark cluster. See details here.
The alternative is to use the Databricks CLI (or REST API) and push local data to a location on DBFS, where it can be read into Spark from within a Databricks notebook. A similar idea would be to use the AWS CLI to put local data into an S3 bucket that can be accessed from Databricks.
It sounds like what you are looking for is Databricks Connect, which works with many popular IDEs.

Related

How to access delta tables from azure data lake in jupyter notebook using pyspark

spark.read.format("delta").load("/tmp/delta/people10m")
How to give correct path to adls delta folder?
spark.read.format("delta").load("/tmp/delta/people10m")
tried above command, but not working

if you mean the direct access to ADLS, then you need to give a full path, like, abfss://<container>#<storage-acct>.dfs.core.windows.net/<path> - just following documentation about it (there are different authentication options available, so select what works for you).

Presto local file connector testing

I deployed presto in my local machine and the server is up and running. I'm trying to access a local csv file named "poc.csv" using local file connector. I have created a file called localfile.properties under the etc/catalog folder. So, catalog is localfile and schema is logs(As per the documentation https://prestodb.io/docs/current/connector/localfile.html)
I can also see the catalog created through presto cli using the command show catalogs; So I believe the catalog has created successfully with no issues.
Now, my question is how does local file connector know which file to read in my local machine (In my case it's poc.csv) and how can I query/access the contents of the poc.csv through presto cli. For the simplicity sake, let's assume we have name and employeeId in poc.csv.
Displaying catalogs through presto cli
localfile.properties

Read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?

There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.

I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.

Write a csv file into azure blob storage

I am trying use pyspark to analyze my data on databricks notebooks. Blob storage has been mounted on the databricks cluster and after ananlyzing, would like to write csv back into blob storage. As pyspark working in distributed fashion, csv file is broken into small blocks and written on the blob storage. How to overcome this and write as a single csv file on blob when we do analysis using pyspark. Thanks.

Do you really want a single file? If yes, the only way you can overcome it by merging all the small csv files into a single csv file. You can make use of map function on the databricks cluster to merge it or may be you can use some background job to do the same.
Have a look here: https://forums.databricks.com/questions/14851/how-to-concat-lots-of-1mb-cvs-files-in-pyspark.html

For Hadoop, which data storage to choose, Amazon S3 or Azure Blob Store?

I am working on a Hadoop project and generating lots of data in my local cluster. Sooner later I will be using cloud based Hadoop solution because my Hadoop cluster is very small comparative to real work load, however I dont have a choice as of now which one I will be using i.e. Windows Azure based, EMR or something else. I am generating lots of data locally and want to store this data to some cloud based storage based on the fact that I will use this data with Hadoop later but very soon.
I am looking for suggestion to decided which cloud store to choose based in someone experience. Thanks in advance.

First of all it is a great question. Let's try to understand "How data is processed in Hadoop":
In Hadoop all the data is processed on Hadoop cluster means when you process any data, that data is copied from its sources to HDFS, which is an essential component of Hadoop.
When data is copied to HDFS only after your run Map/Reduce jobs in it to get your results.
That means it does not matter what and where your data sources is(Amazon S3, Azure Blob, SQL Azure, SQL Server, on premise source etc), you will have to move/transfer/copy your data from source to HDFS, within the limits of Hadoop.
Once data is processed in Hadoop cluster, the result will be stored the location you would have configured in your job. The output data source can be HDFS or an outside location accessible from Hadoop Cluster
Once you have data copied to HDFS you can keep it one HDFS as long as you want but you will have to pay the price to use the Hadoop cluster.
In some cases when you are running Hadoop Job between some interval and data move/copy can be done faster, it is good to have a strategy to 1) acquire Hadoop cluster 2) copy data 3) run job 4) release cluster.
So based on above details, when you choose a data source in Cloud for your Hadoop Cluster you would have to consider the following:
If you have large data (which is normal with Hadoop clusters) to process, consider different data sources and the time it will take to copy/move data from those data source to HDFS because this will be your first step.
You would need to choose a data source which must have the lowest network latency so you can get data in and out, as fast as possible.
You also need to consider how you will move large amount of data from your current location to any cloud store. The best option would be to have a storage where you can send your data disk (HDD/Tape etc) because uploading multiple TB data will take great amount of time.
Amazon EMR (already available), Windows Azure (HadoopOnAzure in CTP) and Google (BigQuery in Preview, based on Google Dremel) provides pre-configured Hadoop clusters in cloud so you can choose where you would want to run your Hadoop job then you can consider the cloud storage.
Even if you choose one cloud data storage and decide to move to other because you want to use other Hadoop cluster in cloud, you sure can transfer the data however consider the time and data transfer support available to you.
For example, with HadooponAzure you can connect various data sources i.e. Amazon S3, Azure Blob Storage, SQL Server and SQL Azure etc so a variety of data sources are the best with any cloud Hadoop cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string