Presto local file connector testing - presto

I deployed presto in my local machine and the server is up and running. I'm trying to access a local csv file named "poc.csv" using local file connector. I have created a file called localfile.properties under the etc/catalog folder. So, catalog is localfile and schema is logs(As per the documentation https://prestodb.io/docs/current/connector/localfile.html)
I can also see the catalog created through presto cli using the command show catalogs; So I believe the catalog has created successfully with no issues.
Now, my question is how does local file connector know which file to read in my local machine (In my case it's poc.csv) and how can I query/access the contents of the poc.csv through presto cli. For the simplicity sake, let's assume we have name and employeeId in poc.csv.
Displaying catalogs through presto cli
localfile.properties

Related

Petastorm with Databricks Connect failing

Using Azure Databricks.
I have petastorm==0.11.2 and databricks-connect==9.1.0
My databricks-connect session seems to be working I'm able to read in data into my remote workspace. But when I use petastorm to create a spark converter object it says unable to infer schema, even though if take the object I'm passing it and check its .schema attribute it shows me a schema just fine.
The exact same code works within the databricks workspace in the notebooks. But doesn't work when I'm on a separate VM using DBConnect to read in the data.
I think the issue is around setting this configuration: SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF. When in the local databricks workspace using the value 'file:///tmp/petastorm/cache/' works fine. When using databricks-connect it supposedly builds a spark context that's linked to the cluster and otherwise for read and write paths behaves fine.
Any ideas?

Cannot read persisted spark warehouse databases on subsequent sessions

I am trying to create a locally persisted spark warehouse database that will be present/loaded/accessible to future spark sessions created by the same application.
I have configured the spark session conf with:
.config("spark.sql.warehouse.dir", "C:/path/to/my/long/lived/mock-hive")
When I create the databases, I see the mock-hive folder get created, and underneath two distinct databases that I create have folders: db1.db and db2.db
However, these folders are EMPTY after the session completes, despite the databases being successfully created and subsequently queried in the run that stands them up.
On a subsequent run with the same configured spark session, if I
baseSparkSession.catalog.listDatabases().collect() I only see the default database. The two I created did not persist into the second spark session.
What is the trick to get these local persisted databases to be available to read in subsequent execution?
I've noticed that spark.sql.warehouse.dir *.db folders empty after creation, which might have something to do with it...
Spark Version: 3.0.1
Turns out spark.sql.warehouse.dir is not where local db data is stored... it's in the derby database stored in metastore_db. To relocate that, you need to change a system param:
System.setProperty("derby.system.home", derbyPath)
I didn't even have to set spark.sql.warehouse.dir, just relocate the derbyPath to a common location all spark sessions use.
NOTE - You don't need to specify the "metastore_db" portion of the derbyPath, it will be auto appended to the location.

How to read/load local files in Databricks?

is there anyway of reading files located in my local machine other than navigating to 'Data'> 'Add Data' on Databricks.
in my past experience using Databricks, when using s3 buckets, I was able to just read and load a dataframe by just specifying the path like so: i.e
df = spark.read.format('delta').load('<path>')
is there any way i can do something like this using databricks to read local files?
If you use the Databricks Connect client library you can read local files into memory on a remote Databricks Spark cluster. See details here.
The alternative is to use the Databricks CLI (or REST API) and push local data to a location on DBFS, where it can be read into Spark from within a Databricks notebook. A similar idea would be to use the AWS CLI to put local data into an S3 bucket that can be accessed from Databricks.
It sounds like what you are looking for is Databricks Connect, which works with many popular IDEs.

Read oracle dump file (.dmp) file to panda dataframe

I have one testdata.dmp available in AWS s3 bucket and want to load data into panda dataframe. Looking for some solution, I've boto3 installed.
Your Oracle dump file testdata.dmp has a proprietary binary format maintained by Oracle. This means that Oracle controls which tools can process it correctly. One of such tools is Oracle Data Pump.
A workflow to extract data from a Oracle dump file and write it as Parquet files (readable with Pandas) could look as follows:
Create an Oracle DB. As you are already using AWS S3, I suggest setting up an AWS RDS instance with Oracle engine.
Download testdata.dmp from S3 to the created Oracle DB. This can be done by RDS' S3 integration.
Run Oracle Data Pump Import on the RDS instance. This tool is installed by default. The RDS docs provide a detailed walk-through. Now the content of testdata.dmp lives as tables with data and other objects inside the Oracle DB.
Dump all tables (and other objects) with a tool that is able to query Oracle DBs and able to write the result as Parquet. Some choices:
Sqoop (Hadoop-based command line tool, but deprecated)
(Py)Spark (Popular data processing tool and imho the unofficial successor of Sqoop.)
python-oracledb + Pandas

Migrating existing Marklogic Application Server from Linux to AWS

I'd like to migrate Marklogic 7 Application Server from a Linux environment to AWS.
I've seen pdfs/tutorials on creating a new server on AWS but I'm not sure how to migrate existing data and configurations.
There is more than one cluster.
Thanks
NGala
This question has nothing to do with AWS (AWS servers are just standard Linux servers). Consult your Marklogic documentation on how to migrate between servers.
It makes a big difference whether you need to keep the server online the whole time, or not. If you can shut it down, just install MarkLogic on an AWS linux image and copy /var/opt/MarkLogic and any external data directories.
If you need to keep the system online, export a configuration package for your database and app server(s) from the MarkLogic configuration manager on port 8000. Then import it on the new host. Then set up database replication as described at http://docs.marklogic.com/guide/database-replication/dbrep_intro - then once replication has synchronized, fail over to the new system.
Specific to AWS, you could back up a database to S3 from one cluster and then restore it on another cluster. This works even outside AWS, as long as the system can access S3.

Resources