Related
I'm able to successfully pull file metadata from my SharePoint library with the Microsoft Graph API, but am having trouble pulling the properties of an item:
I can get a partial list of properties using this endpoint:
https://graph.microsoft.com/v1.0/sites/{site-id}/drives/{}/items/{}/children?$expand=listItem($expand=fields)
But the list that comes from this endpoint doesn't match the list of properties that exists on the item.
For example, below is a list of fields that come from that endpoint - you can see that '.Push Too Salsify.' (one of the fields I need) is not present. There are also other fields that exist but don't appear in the item properties:
{'ParentLeafNameLookupId': '466', 'CLIPPING_x0020_STATUS': 'Not Started', 'Edit': '0', 'EditorLookupId': '67', '_ComplianceTagWrittenTime': '', 'RequiredField': 'teams/WORKFLOWDEMO/Shared Documents/1062CQP6.Phase4/1062CQP-Phase4-Size.tif', 'PM_x0020_SIGN_x0020_OFF': 'No', 'QA_x0020_APPROVED': 'No', 'ImageWidth': 3648, 'PM_x0020_Approval_x0020_Status': '-', 'AuthorLookupId': '6', 'SelectedFlag': '0', 'NameOrTitle': '1062CQP-Phase4-Size.tif', 'ItemChildCount': '0', 'FolderChildCount': '0', 'LinkFilename': '1062CQP-Phase4-Size.tif', 'ParentVersionStringLookupId': '466', 'PHOTOSTATUS': 'Not Started', '#odata.etag': '"c4b7516e-64df-46d2-b916-a1ee6f29d24a,8"', 'Thumbnail': '3648', '_x002e_Approval_x0020_Status_x002e_': 'Approved', 'Date_x0020_Created': '2019-10-09T04:25:40Z', '_CommentCount': '', 'Created': '2019-10-09T04:25:33Z', 'PreviewOnForm': '0', '_ComplianceTag': '', 'FileLeafRef': '1062CQP-Phase4-Size.tif', 'ImageHeight': 3648, 'LinkFilenameNoMenu': '1062CQP-Phase4-Size.tif', '_ComplianceFlags': '', 'ContentType': 'Document', 'Preview': '3648', 'ImageSize': '3648', 'Product_x0020_Category': 'Baseball', 'DATE_x0020_ASSIGNED': '2019-10-09T04:25:40Z', 'DateCreated': '2019-10-09T04:25:40Z', 'WORKFLOW_x0020_SELECTION': ['Select'], 'Predecessors': [], 'FileType': 'tif', 'LEGAL_x0020_APPROVED': 'No', 'PUSH_x0020_READY': False, 'FileSizeDisplay': '74966432', 'id': '466', '_LikeCount': '', '_ComplianceTagUserId': '', 'Modified': '2019-10-09T14:41:25Z', 'DocIcon': 'tif', '_UIVersionString': '0.7', '_CheckinComment': ''}
Any help would be greatly appreciated. I've scoured the documentation and can't seem to find the correct endpoint to pull item properties from a Sharepoint DriveItem.
I'm using PyCharm 2019.1 Professional and am able to connect to an Oracle JDBC database using a thin driver (jdbc:oracle:thin:#host:PORT:SID). I'm trying to use the cx_Oracle library (version 1.1.9) and Anaconda 3.6, but do not seem to have the functions .connect or .makedsn with the library. I find this unusual, and at a loss.
Do I just have the wrong cx_Oracle version even though I installed using pip?
Is the 1.1.9 version that works with Anaconda 3.6 just not have these functions?
Or is there a different/easier library I can use to connect with jdbc:oracle:thin:#host:PORT:SID?
dir(cx_Oracle)
Outputs:
['ARRAY', 'BIGINT', 'BINARY', 'BLANK_SCHEMA', 'BLOB', 'BOOLEAN',
'BigInteger', 'Binary', 'Boolean', 'CHAR', 'CLOB',
'CheckConstraint', 'Column', 'ColumnDefault', 'Constraint',
'DATE', 'DATETIME', 'DDL', 'DECIMAL', 'Date', 'DateTime',
'DefaultClause', 'Enum', 'FLOAT', 'FetchedValue', 'Float',
'ForeignKey', 'ForeignKeyConstraint', 'INT', 'INTEGER', 'Index',
'Integer', 'Interval', 'JSON', 'LargeBinary', 'MetaData',
'NCHAR', 'NUMERIC', 'NVARCHAR', 'Numeric', 'PassiveDefault',
'PickleType', 'PrimaryKeyConstraint', 'REAL', 'SMALLINT',
'Sequence', 'SmallInteger', 'String', 'TEXT', 'TIME',
'TIMESTAMP', 'Table', 'Text', 'ThreadLocalMetaData', 'Time',
'TypeDecorator', 'Unicode', 'UnicodeText', 'UniqueConstraint',
'VARBINARY', 'VARCHAR', 'all', 'builtins', 'cached',
'doc', 'file', 'go', 'loader', 'name',
'package', 'path', 'spec', 'version', 'alias',
'all', 'and', 'any_', 'asc', 'between', 'bindparam', 'case',
'cast', 'collate', 'column', 'create_engine', 'delete', 'desc',
'distinct', 'engine', 'engine_from_config', 'event', 'events',
'exc', 'except_', 'except_all', 'exists', 'extract', 'false',
'func', 'funcfilter', 'insert', 'inspect', 'inspection',
'interfaces', 'intersect', 'intersect_all', 'join', 'lateral',
'literal', 'literal_column', 'log', 'modifier', 'not_', 'null',
'or_', 'outerjoin', 'outparam', 'over', 'pool', 'processors',
'schema', 'select', 'sql', 'subquery', 'table', 'tablesample',
'text', 'true', 'tuple_', 'type_coerce', 'types', 'union',
'union_all', 'update', 'util', 'within_group']
Print out the value of cx_Oracle.version. The version number 1.1.9 is not a valid cx_Oracle version! The latest version is 7.2.1 and has a much different set of values than the ones you printed! Take a look at the cx_Oracle installation documentation and the top level module cx_Oracle documentation to get an idea of what I am talking about. If you have further questions, adjust your question above and add a comment below and I'll see if I can help further.
To check the version of cx_Oracle
You must have python
Use your python shell, command prompt, or your code/text editor
Print out the following code
C:\Users>python
Python 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cx_Oracle
>>> print(cx_Oracle.version)
8.3.0
>>> exit()
Platform: RHEL 7, cloudera CDH 6.2 hadoop distrubution, pyspark 3.7.1
What i tried: I could write a table to hive warehouse when I explicitly mention the table name as saveAsTable("tablename"). But, i am getting below error when I try to take the table name from a python variable in a "for loop" as shown below.
Similar to: How to save a dataframe result in hive table with different name on each iteration using pyspark
prefix_list = ["hive_table_name1","hive_table_name2", "hive_table_name3"]
list1 = ["dataframe_content_1", "dataframe_content__2", "dataframe_content_3"]
for index, l in enumerate(list1):
selecteddata = df.select(l)
#Embedding table name within quotations
tablename = '"' + prefix_list[index] + '"'
# write the "selecteddata" dataframe to hive table
selecteddata.write.mode("overwrite").saveAsTable(tablename)
Expected: 3 different hive tables in default hive warehouse
Actual:
"ReturnMessages"
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o86.saveAsTable.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '"ReturnMessages"' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 0)
== SQL ==
"ReturnMessages"
^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableIdentifier(ParseDriver.scala:49)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:400)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/37Pro/files/Partnerdaten.py", line 144, in <module>
dataframe.write.saveAsTable(filename, format="parquet", mode="overwrite")
File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 775, in saveAsTable
File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
pyspark.sql.utils.ParseException: '\nmismatched input \'"ReturnMessages"\' expecting {\'SELECT\', \'FROM\', \'ADD\', \'AS\', \'ALL\', \'ANY\', \'DISTINCT\', \'WHERE\', \'GROUP\', \'BY\', \'GROUPING\', \'SETS\', \'CUBE\', \'ROLLUP\', \'ORDER\', \'HAVING\', \'LIMIT\', \'AT\', \'OR\', \'AND\', \'IN\', NOT, \'NO\', \'EXISTS\', \'BETWEEN\', \'LIKE\', RLIKE, \'IS\', \'NULL\', \'TRUE\', \'FALSE\', \'NULLS\', \'ASC\', \'DESC\', \'FOR\', \'INTERVAL\', \'CASE\', \'WHEN\', \'THEN\', \'ELSE\', \'END\', \'JOIN\', \'CROSS\', \'OUTER\', \'INNER\', \'LEFT\', \'SEMI\', \'RIGHT\', \'FULL\', \'NATURAL\', \'ON\', \'PIVOT\', \'LATERAL\', \'WINDOW\', \'OVER\', \'PARTITION\', \'RANGE\', \'ROWS\', \'UNBOUNDED\', \'PRECEDING\', \'FOLLOWING\', \'CURRENT\', \'FIRST\', \'AFTER\', \'LAST\', \'ROW\', \'WITH\', \'VALUES\', \'CREATE\', \'TABLE\', \'DIRECTORY\', \'VIEW\', \'REPLACE\', \'INSERT\', \'DELETE\', \'INTO\', \'DESCRIBE\', \'EXPLAIN\', \'FORMAT\', \'LOGICAL\', \'CODEGEN\', \'COST\', \'CAST\', \'SHOW\', \'TABLES\', \'COLUMNS\', \'COLUMN\', \'USE\', \'PARTITIONS\', \'FUNCTIONS\', \'DROP\', \'UNION\', \'EXCEPT\', \'MINUS\', \'INTERSECT\', \'TO\', \'TABLESAMPLE\', \'STRATIFY\', \'ALTER\', \'RENAME\', \'ARRAY\', \'MAP\',
You are not specifying the name of the database in your write statement.
Here is how I would do what you are trying to do:
database_name = "my_database"
prefix_list = ["hive_table_name1","hive_table_name2", "hive_table_name3"]
list1 = ["dataframe_content_1", "dataframe_content_2", "dataframe_content_3"]
for index, l in enumerate(list1):
selecteddata = df.select(l)
#Embedding table name within quotations
tablename = prefix_list[index]
# map to the correct database and table
db_name_and_corresponding_table = f"{database_name}.{tablename}"
# write the "selecteddata" dataframe to hive table
selecteddata.write.mode("overwrite").saveAsTable(db_name_and_corresponding_table)
Hope that helps.
this works with parquet
val sqlDF = spark.sql("SELECT DISTINCT field FROM parquet.`file-path'")
I tried the same way with Avro but it keeps giving me an error even if i use com.databricks.spark.avro.
When I execute the following query:
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM avro.`file path`")
I get the AnalysisException. Why?
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;; line 1 pos 51
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:61)
at org.apache.spark.sql.execution.datasources.ResolveDataSource$$anonfun$apply$1.applyOrElse(rules.scala:38)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:38)
at org.apache.spark.sql.execution.datasources.ResolveDataSource.apply(rules.scala:37)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
Changing the name of the format to com.databricks.spark.avro does not make any difference and queries fail.
val sqlDF = spark.sql("SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`file-path`")
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '.' expecting {<EOF>, ',', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 65)
== SQL ==
SELECT DISTINCT Source_Product_Classification FROM com.databricks.spark.avro`/uat/myfile`
-----------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
... 48 elided
Spark SQL supports avro format through a separate spark-avro module.
A library for reading and writing Avro data from Spark SQL.
Please note that spark-avro is a seaprate module that is not included by default in Spark.
You should load the module using spark-submit --packages, e.g.
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
See With spark-shell or spark-submit.
Jaceks answer works in general but in my environment it was not working due to obscure reasons. and spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 is hanging for a long with out producing any result.
I solved this problems using --jars option along with spark-shell
Steps :
1) go to https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/4.0.0
copy link address of jar http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar
2) wget http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar .
3) spark-shell --jars <pathwhere you downloaded jar file>/spark-avro_2.11-4.0.0.jar
4)spark.read.format("com.databricks.spark.avro").load("s3://MYAVROLOCATION.avro")
which got converted in to dataframe and was able to print it.
In your case once you get the dataframe you can do sql on your way.
Note : If you are not using spark-shell you can make uber jar using sbt or maven with spark-avro_2.11-4.0.0.jar using below maven coordinates.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Note : Avro datasource was introduced in spark 2.4 on wards.. SparkSPARK-24768
Have a built-in AVRO data source implementation
Which means that all the above things are not necessary any more.
See spark-release-2-4-0 release notes
Spark Avro Integration:
By using Spark, we can integrate avro format using spark-avro module. spark-avro library originally developed by databricks as a open source library. spark-avro module is external and not included in the spark-submit or spark-shell by default. So externally we need to specify while submitting spark job.
In the following section, i will explain how to integrate Spark and Avro data format.
Spark version > 2.4
Spark 2.4 release onwards, Spark SQL provides built-in support for reading and writing Apache Avro data.
Maven Dependency:
https://mvnrepository.com/artifact/org.apache.spark/spark-avro
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>2.4.5</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
SparkShell:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.5 ...
Example:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.4/SparkAvro
Spark version < 2.4
In Spark version < 2.4, explicitly we need to specify avro format as com.databricks.spark.avro otherwise we will get org.apache.spark.sql.AnalysisException: Failed to find data source: avro. error.
Maven Dependency:
Spark Version Compatible version of Avro Data Source for Spark
1.2 0.2.0
1.3 1.0.0
1.4+ 2.0.1
2.0 - 2.1 3.2.0
2.2 - 2.3 4.0.0
https://mvnrepository.com/artifact/com.databricks/spark-avro
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Spark Submit:
./bin/spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 ...
SparkShell:
./bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 ...
Examples:
SparkAvroWriteExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
case class Employee( id:Long, name:String, salary:Float, deptId: Int)
object SparkAvroWriteExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeList = List(Employee(1, "Ranga", 10000, 1),
Employee(2, "Vinod", 1000, 1),
Employee(3, "Nishanth", 500000, 2),
Employee(4, "Manoj", 25000, 1),
Employee(5, "Yashu", 1600, 1),
Employee(6, "Raja", 50000, 2)
);
val employeeDF = spark.createDataFrame(employeeList);
employeeDF.coalesce(1).write.format("com.databricks.spark.avro").mode("overwrite").save("employees.avro");
spark.close();
}
}
SparkAvroReadExample.scala
import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;
object SparkAvroReadExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setIfMissing("spark.master", "local[*]").setAppName("Spark Avro Read Examples")
val spark = SparkSession.builder().config(conf).getOrCreate();
val employeeDF = spark.read.format("com.databricks.spark.avro").load("employees.avro");
employeeDF.printSchema();
employeeDF.foreach(employee => {println(employee);});
spark.close();
}
}
Github link
https://github.com/rangareddy/ranga-spark-poc/tree/master/spark-2.3/SparkAvro
Thats all folks!!
I have next Cassandra table structure:
CREATE TABLE ringostat.hits (
hitId uuid,
clientId VARCHAR,
session MAP<VARCHAR, TEXT>,
traffic MAP<VARCHAR, TEXT>,
PRIMARY KEY (hitId, clientId)
);
INSERT INTO ringostat.hits (hitId, clientId, session, traffic)
VALUES('550e8400-e29b-41d4-a716-446655440000'. 'clientId', {'id': '1', 'number': '1', 'startTime': '1460023732', 'endTime': '1460023762'}, {'referralPath': '/example_path_for_example', 'campaign': '(not set)', 'source': 'www.google.com', 'medium': 'referal', 'keyword': '(not set)', 'adContent': '(not set)', 'campaignId': '', 'gclid': '', 'yclid': ''});
INSERT INTO ringostat.hits (hitId, clientId, session, traffic)
VALUES('650e8400-e29b-41d4-a716-446655440000'. 'clientId', {'id': '1', 'number': '1', 'startTime': '1460023732', 'endTime': '1460023762'}, {'referralPath': '/example_path_for_example', 'campaign': '(not set)', 'source': 'www.google.com', 'medium': 'cpc', 'keyword': '(not set)', 'adContent': '(not set)', 'campaignId': '', 'gclid': '', 'yclid': ''});
INSERT INTO ringostat.hits (hitId, clientId, session, traffic)
VALUES('750e8400-e29b-41d4-a716-446655440000'. 'clientId', {'id': '1', 'number': '1', 'startTime': '1460023732', 'endTime': '1460023762'}, {'referralPath': '/example_path_for_example', 'campaign': '(not set)', 'source': 'www.google.com', 'medium': 'referal', 'keyword': '(not set)', 'adContent': '(not set)', 'campaignId': '', 'gclid': '', 'yclid': ''});
I want to select all rows where source='www.google.com' AND medium='referal'.
SELECT * FROM hits WHERE traffic['source'] = 'www.google.com' AND traffic['medium'] = 'referal' ALLOW FILTERING;
Without add ALLOW FILTERING I got error: No supported secondary index found for the non primary key columns restrictions.
That's why I see two options:
Create index on traffic column.
Create materialized view.
Create another table and set INDEX for traffic column.
Which is the best option ? Also, I have many fields with MAP type on which I will need to filter. What issues can be if on every field I will add INDEX ?
Thank You.
From When to use an index.
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of records for a small number of results. [...] Conversely, creating an index on an extremely low-cardinality column, such as a boolean column, does not make sense.
In tables that use a counter column
On a frequently updated or deleted column.
To look for a row in a large partition unless narrowly queried.
If your planned usage meets one or more of these criteria, it is probably better to use a materialized view.