We're exploring the possibility to use temporary views in spark and relate this to some actual file storage - or to other temporary views. We want to achieve something like:
Some user uploads data to some S3/hdfs file storage.
A (temporary) view is defined so that spark sql queries can be run against the data.
Some other temporary view (referring to some other data) is created.
A third temporary view is created that joins data from (2) and (3).
By selecting from the view from (4) the user gets a table that reflects the joined data from (2) and (3). This result can be further processed and stored in a new temporary view and so on.
So we end up with a tree of temporary views - querying their parent temporary views until they end up loading data from the filesystem. Basically we want to store transformation steps (selecting, joining, filtering, modifying etc) on the data - without storing new versions. The spark SQL-support and temporary views seems like a good fit.
We did some successful testing. The idea is to store the specification of these temporary views in our application and recreate them during startup (as temporary or global views).
Not sure if this is viable solution? One problem is that we need to know how the temporary views are related (which one queries which). We create them like:
sparkSession.sql("select * from other_temp_view").createTempView(name)
So, when this is run we have to make sure that other_temp_view is already created in the session. Not sure how this can be achieved. One idea is to store a timestamp and recreate them in the same order. This could be ok since out views most likely will have to be "immutable". We're not allowed to change a query that other queries relies on.
Any thoughts would be most appreciated.
I would definately go with the SessionCatalog object :
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SessionCatalog.html
You can access it with spark.sessionState.catalog
Related
I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?
spark.sql(
s"""
|...
""".stripMargin).createOrReplaceTempView(s"temp_view")
spark.table(s"temp_view").cache()
For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.
Am I right?
Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan
I think that maybe you can try to use cache table within your sql
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
CACHE TABLE statement caches contents of a table or output of a query
with the given storage level. If a query is cached, then a temp view
will be created for this query. This reduces scanning of the original
files in future queries.
For me its seems promising
I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have
val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()
Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.
I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.
When I run drop database command, spark deletes database directory and all its subdirectories on hdfs. How can I avoid this?
Short answer:
Unless you set up your database so that it contains only external tables that exist outside of the database HDFS directory, there is no way to achieve this without copying all of your data to another location in HDFS.
Long answer:
From the following website:
https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html
By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
You can copy the data to another location before dropping the database. I know it's a pain - but that's how Hive operates.
If you were trying to just drop a table without deleting the HDFS directory of the table, there's a solution for this described here: Can I change a table from internal to external in hive?
Dropping an external table preserves the HDFS location for the data.
Cascading the database drop to the tables after converting them to external will not fix this, because the database drop impacts the whole HDFS directory the database resides in. You would still need to copy the data to another location.
If you create a database from scratch, each table inside of which is external and references a location outside of the database HDFS directory, dropping this database would preserve the data. But if you have it set up so that the data is currently inside of the database HDFS directory, you will not have this functionality; it's something you would have to set up from scratch.
I'm playing around with the map and reduce through temporary views, but at 1,000,000+ documents it is a bit slow, rather than creating a separate dataset for testing, is it possible to only use a subset of data in the temporary view?
A map-reduce view is more like "CREATE INDEX" than it is like "SELECT * FROM".
In other words, when you do a map-reduce view, CouchDB will crunch through every document.
However, for testing, one thing you can do is make a normal view (not temporary). Just develop your work in a temporary design document, _design/my_experiments.
Save your map-reduce view code and then query the view with the ?stale=update_after option. You will probably get no results, however stale=update_after will tell CouchDB to begin processing the view. Now try your query again. You will see the results that have been processed so far. Now try a third time. You will see even more data reflected.
Roughly speaking, views process documents in the same order that a _changes query returns them to you: basically the first update is processed first, and then in order and the most recent change is processed last.
I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations