Reading from DB2 tables with columns with names having special characters into Spark - apache-spark

I need to read data from a DB2 table into a spark dataframe.
However, the DB2 table named 'TAB#15' has 2 columns with special characters with names such as MYCRED# and MYCRED$.
My pyspark code looks like this:
query = '''select count(1) as cnt from {table} as T'''.format(table=table)
my_val = spark.read.jdbc(url, table=query, properties).collect()
My spark-submit however, throws an error that looks like this:
"ERROR: u"\nextraneous input '#' expecting... "
My questions/ ask is:
Is it possible to read data into a Spark dataframe, from a DB2 table whose table name and column names have special characters like '#' and '$'?
If there are any code samples/ similar questions to this one, that can illustrate the above requirement of reading DB2 table data from columns that have special characters in their column names, please point me out to them..

Try to use something like
table = '"MYDB2Specifier.TAB#15"'
Identifiers have double quotes. If you leave them out, everything will be uppercase. If the string has special characters like a $, you might need to escape the character.

Related

How to fix the SQL query in databricks if column name has bracket in it

I have a file which has data like this , I have converted that file into a databricks table.
Select * from myTable
Output:
Product[key] Product[name]
123 Mobile
345 television
456 laptop
I want to query my table for laptop data.
I am using below query
Select * from myTable where Product[name]='laptop'
I am getting below error in databricks:
AnalysisException: cannot resolve 'Product' given input columns:
[spark_catalog.my_db.myTable.Product[key],[spark_catalog.my_db.myTable.Product[name]
When certain characters appear in column names of a table in SQL, you get a parse exception. These characters include brackets, dots (.), hyphens (-), etc. So, when such characters appear in column names, we need an escape character to parse these characters just as a part of column name.
For SQL in Databricks, this character is Backtick (`). Enclosing your column name in backticks ensures that your column name is parsed correctly as it is even when it includes characters like ‘[]’ (In this case).
Since you have converted a file data into Databricks table, you were not able to see the main problem which is parsing the column name. If you manually create a table with specified schema in Databricks, you will get the following result:
Once you use Backtick in the following way, using the column name would not be a problem anymore.
create table mytable(`Product[key]` integer, `Product[name]` varchar(20))
select * from mytable where `Product[name]`='laptop'

Snowflake ODBC to Excel, errors on field names containing spaces

We have a highly complex set of tables, with nested views that eventually feed a series of dashboards on a Tableau server. The base view uses "as" clauses on some data fields to create fields with spaces in the field name (i.e.somefieldname as "Some Field Name"). Later views use the * wildcard to retrieve all values. Tableau is able to handle it.
The problem is now users want to access those final views in Excel.
We set up an ODBC connection on their workstations and when they pull the data from one of the final views. However, the fields that contain blanks in the field names show as errors and are blank in the resulting worksheet. I'm trying to build a view on that final view and use "as" clauses to remove the spaces in the field names, but haven't been able to find the proper SQL syntax for the source field. I've tried brackets but that didn't work.
Would we be better off trying Power BI? Our data management people are just getting started with it; I haven't seen it yet but will be tomorrow.
Thanks in advance for any tips you can provide!
Lou
Creating a view on top of your final view with renamed columns is probably your easiest solution. The SQL syntax for selecting from a column that has been created with spaces (more generally: a column that has been created with " around its field name/s) is to put the column in double quotes (") when you select from it. Here is an example:
-- Create a sample table. The first column contains spaces and capitals
create or replace table test_table
(
"Column with Spaces" varchar, -- A column created with quotes around it means that you can put anything into the field name. Including spaces & funky characters
col_without_spaces varchar
);
-- Insert some sample data
insert overwrite into test_table
values ('row1 col1', 'row1 col2'),
('row2 col1', 'row2 col2');
-- Create a view that renames columns
create or replace view test_view as
(
select
"Column with Spaces" as col_1, -- But now you have to select it like this since the spaces and capital letters have been treated literally in the create table statement
col_without_spaces as col_2
from test_table
);
-- Select from the view
select * from test_view;
Produces:
+---------+---------+
|COL_1 |COL2 |
+---------+---------+
|row1 col1|row1 col2|
|row2 col1|row2 col2|
+---------+---------+

Keeping Special Characters in Spark Table Column Name

Is there any way to keep special characters for a column in a spark 3.0 table?
I need to do something like
CREATE TABLE schema.table
AS
SELECT id=abc
FROM tbl1
I was reading in Hadoop that you would put back ticks around the column name but this does not work in spark.
If there is a way to do this in PySpark that would work as well
Turns out you parquet and delta formats do not accept special characters under any circumstance. You must use Row Format Delimited
spark.sql("""CREATE TABLE schema.test
ROW FORMAT DELIMITED
SELECT 1 AS `brand=one` """)

When and why are Google Cloud Spanner table and column names case-sensitive?

Spanner documentation says:
Table and column names:
Can be between 1-128 characters long. Must start with an uppercase or lowercase letter.
Can contain uppercase and lowercase letters, numbers, and underscores, but not hyphens.
Are case-insensitive. For example, you cannot create tables named mytable and MyTable in the same database or columns names mycolumn and
MyColumn in the same table.
https://cloud.google.com/spanner/docs/data-definition-language#table_statements
Given that, I have no idea what this means:
Table names are usually case insensitive, but may be case sensitive
when querying a database that uses case sensitive table names.
https://cloud.google.com/spanner/docs/lexical#case-sensitivity
In fact it seems that table names are case-sensitive, for example:
Queries fail if we don't match the case shown in the UI.
This seems to be an error in the documentation. Table names are case insensitive in cloud spanner. I'll follow up with the docs team.
Edit: Updated docs https://cloud.google.com/spanner/docs/data-definition-language#naming_conventions
I add a couple of examples, so we can see the diference.
Table names are case sensitive, In this example, It does not matter, there is only one table:
Example 1:
SELECT *
FROM Roster
WHERE LastName = #myparam
returns all rows where LastName is equal to the value of query parameter myparam.
But for Example 2, if we comparing two tables, or make other kind of queries, using tables.
SELECT id, name
FROM Table1 except select id, name
FROM Table2
It will give you everything in Table1 but not in Table2.

Loading PIG output files into Hive table with some blank cells

I have successfully loaded a 250000 record CSV file into HDFS and I have performed some ETL functions on it such as removing any characters in a string other than 0-9, a-z and A-Z so that it's nice and clean.
I've saved the output of this ETL to the HDFS for loading into Hive. While in Hive I created the schema for the table and set the appropriate data types for each column.
create external table pigOutputHive (
id string,
Score int,
ViewCount int,
OwnerUserId string,
Body string,
Rank int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
location '/user/admin/PigOutputETL';
When I run a simple query on the data such as:
SELECT * FROM pigoutputhive LIMIT 100000;
The data looks as it should, and when i download it to my local machine and view it in Excel as a CSV it also looks good.
When I try and run the following query on the same table I get every field being returned as an integer even for the string columns. See the screenshot below.
Can anyone see where I am going wrong? Of the original 250000 rows there are some blanks in a particular fields such as the OwnerUserId, do I need to tell Pig or Hive how to handle these?

Resources