Spark doesn't show all Hive databases - apache-spark

If I list all the databases in Hive, I get the following result (I have 2 tables default ans sbm):
But if I try to do the same thing in spark I get this
It doesn't show the database SBM.

Are you connected to that hive metastore? Did you specify somewhere the metastore details (i.e. hive-site.xml in spark-conf directory) ? Seems like you are connected to the local merastore.

I think that you need copy your hive-site.xml to spark/conf directory.
If you use ubuntu and defined the environmental variables use the next command:
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf

Related

Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop.
I am confused what should be the proper hdfs file path format. When reading a hdfs file from spark shell like
val file=sc.textFile("hdfs:///datastore/events.txt")
it works fine and I am able to read it.
But when I sumbit the jar to yarn which contains same set of code it is giving the error saying
org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt
When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works.
I am bit confused about the behaviour and need an guidance.
Note: I am using aws emr set up and all the configurations are default.
if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.."
If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt")
That's only one /
I think it will work.
Problem solved. As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file. But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object.
As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path.
I don't know is it a correct way of doing it.

tables available in Spark-SQL CLI are not available over thriftserver

I'm trying to expose my spark-sql tables over JDBC via thriftserver but even though it looks like i've successfully connected, its not working. Here's what I've tried so far.
database setup:
in pyspark I loaded a parquet file, created a temp view as tableX
performed a .saveAsTable as hive_tableX
then I queried that table: spark.sql("SELECT * FROM hive_tableX LIMIT 1").show() which returned some data
at this point, my code is saving table information to the hivestore, right?
querying from spark-sql:
I then ran spark-sql and the spark sql shell started up
USE default
show tables; --> i see my table in there, hive_tableX
SELECT * FROM hive_tableX LIMIT 1 and I see some successful results.
thus, I believe it is now verified that my table has saved in the hive metastore, right?
then I turn on thriftserver
./sbin/start-thriftserver.sh
next, I turn on beeline so I can test the thriftserver connection
!connect jdbc:hive2://localhost:10000 (and enter username and password)
then I select the default db: use default;
and show tables; --> there's nothing there.
So, where are my tables? is beeline or thrift pointing to a different warehouse or something?
Edit: I think my thriftserver isn't using the right warehouse directory, so I'm trying to start it with a config option:
[still nothing] sbin/start-thriftserver.sh --hiveconf spark.sql.warehouse.dir=/code/spark/thrift/spark-warehouse
[still nothing] sbin/start-thriftserver.sh --conf spark.sql.warehouse.dir=/code/spark/thrift/spark-warehouse
Edit: starting it in the same physical directory as where the wherehouse was created seems to do the trick. Although, I don't know how to programmatically set the path to something else and start it elsewhere.
the solution to this particular problem was that I was starting thrift from a different directory than the spark-wherehouse and metastore_db were located.
Once I started it from the correct directory, it worked as expected and my tables were now available.

writing data to filesystem from hive queries in hdinsight

I see that its viable to write query results to filesystem in hadoop: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries
How do I save a query result in case of hdinsight in a folder which is accessible from blobstorage.
I tried something as below but was not successful.
INSERT OVERWRITE LOCAL DIRECTORY '/example/distinctconsumers' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select consumerid from distinctconsumers;
Thanks
Language manual clearly states below
LOCAL keyword is used, Hive will write data to the directory on the local file system.
If you remove 'LOCAL' from your query them it will work.
NOTE: the result might not be a single file but a list of files (one from each task)

why I got error "Could not retrieve endpoint rangs" when I run sstableloader?

I used the sstableloader many times successfully, but I got the following error:
[root#localhost pengcz]# /usr/local/cassandra/bin/sstableloader -u user -pw password -v -d 172.21.0.131 ./currentdata/keyspace/table
Could not retrieve endpoint ranges:
java.lang.IllegalArgumentException
java.lang.RuntimeException: Could not retrieve endpoint ranges:
at org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:338)
at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:156)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:106)
Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:267)
at org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:543)
at org.apache.cassandra.serializers.CollectionSerializer.readValue(CollectionSerializer.java:124)
at org.apache.cassandra.serializers.MapSerializer.deserializeForNativeProtocol(MapSerializer.java:101)
at org.apache.cassandra.serializers.MapSerializer.deserializeForNativeProtocol(MapSerializer.java:30)
at org.apache.cassandra.serializers.CollectionSerializer.deserialize(CollectionSerializer.java:50)
at org.apache.cassandra.db.marshal.AbstractType.compose(AbstractType.java:68)
at org.apache.cassandra.cql3.UntypedResultSet$Row.getMap(UntypedResultSet.java:287)
at org.apache.cassandra.config.CFMetaData.fromSchemaNoTriggers(CFMetaData.java:1833)
at org.apache.cassandra.config.CFMetaData.fromThriftCqlRow(CFMetaData.java:1126)
at org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:330)
... 2 more
I don't know whether this error is relative to one of cluster nodes' linux crash?
Any advice will be appreciated!
Are you running different version sstable loader than the version of your cluster? Looks like https://issues.apache.org/jira/browse/CASSANDRA-9324 if using the 2.1 loader on 2.0 cluster.
/usr/local/cassandra/bin/sstableloader -u user -pw password -v -d 172.21.0.131 ./currentdata/keyspace/table into this command as mention by you backup dir is as ./currentdata/keyspace/table, change the parent directory name of table dir to the keyspace name into which you are restoring, at the place of keyspace and also change the name of table dir as these two are cassandra preserved delimeters and sstableloader consider the parent dir of backup directory(here parent->keyspace and backup-dir->table) as a keyspace name, So it should be same as the keyspace name into which you are restoring the data. Apart from this please make sure your table name and keyspace name should not be as cassandra preserved delimiters.
I realise this is an old question but I'm posting the answer here for posterity. It looks like you're hitting CASSANDRA-10700.
TL;DR - When sstableloader tries to read the schema, it fails when it comes across a dropped collections column.
The problem only exists in the sstableloader utility and you can easily workaround it by getting a copy from Cassandra 2.1.13+ as documented here. Cheers!

Path for persistent, managed hive tables in Spark 1.4

I am new to Spark and working on JavaSqlNetworkWordCount example to append the word count in a persistent table. I understand that I can only do it via HiveContext. HiveContext, however, keeps trying to save the table in /user/hive/warehouse/. I have tried changing the path by adding
hiveContext.setConf("hive.metastore.warehouse.dir", "/home/user_name");
and by adding the property
<property><name>hive.metastore.warehouse.dir</name>
<value>/home/user_name</value></property>
$SPARK_HOME/conf/hive-site.xml, but nothing seems to work. If anyone else has faced this problem, please let me know if/how you resolved it. I am using Spark1.4 on my local RHEL5 machine.
I think I solved the problem. It looks like spark-submit was creating a metastore_db directory in root directory of the jar file. If metastore_db exists, then hive-stie.xml values are ignored. As soon as I removed that directory, code picked up values from hive-site.xml. I still cannot set the value of the hive.metastore.warehouse.dir property from the code, though.

Resources