SparkSQL Column Query not showing column contents? - apache-spark

I have created a persistant table via df.saveAsTable
When I run the following query I receive these results
spark.sql("""SELECT * FROM mytable """).show()
I get view of the DataFrame and all of it's columns, and all of the data.
However when I run
spark.sql("""SELECT 'NameDisplay' FROM mytable """).show()
I receive results that look like this
| NameDisplay|
|--|
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
| NameDisplay |
NameDisplay is definitely one of the columns in the table as it's shown when I run select * - how come this is not shown in the second query?

Issue was using quotes on the column names. Needs to be escaped via backtick ``NameDisplay`

Selecting 'NameDisplay', in SQL, is selecting the literal, text "NameDisplay". In that, the result you got are in fact valid.
To select values of the "NameDisplay" column, then you must issue:
"SELECT NameDisplay FROM mytable "
Or, if you need to quote it (maybe in case the column was created like this or has spaces, or is case-sensitive):
"""SELECT `NameDisplay` FROM mytable"""
This is SQL syntax, nothing specific to Spark.

Related

BigQuery - Delete rows from table partitioned by date

I have a dataset.table partioned by date (100 partition) like this :
table_name_(100) which means : table_name_20200101, table_name_20200102, table_name_20200103, ...
Exemple of table_name_20200101 :
| id | col_1 | col_2 | col_3 |
-----------------------------------------------------------------------------
| xxx | 2 | 6 | 10 |
| yyy | 1 | 60 | 29 |
| zzz | 12 | 61 | 78 |
| aaa | 18 | 56 | 80 |
I would like to delete the row ID = yyy in all the table (partioned) :
DELETE FROM `project_id.dataset_id.table_name_*`
WHERE id = 'yyy'
I got this error :
Illegal operation (write) on meta-table
project_id:dataset_id.table_name_*
Is there a way to delete rows 'yyy' in all table (partioned) ?
Thank you
Okay, some various things to call out here to ensure we're using consistent terminology.
You're talking about sharded tables, not partitioned. In a partitioned table, the data within the table is organized based on the partitioning specification. Here, you just have a series of tables named using a common prefix and a suffix based on date.
The use of the table_prefix* syntax is called a wildcard table, and DML is explicitly not allowed via wildcard tables: https://cloud.google.com/bigquery/docs/querying-wildcard-tables
The table_name_(100) is an aspect of how the BigQuery UI collapses series of like-named tables to save space in the navigation panes. It's not how the service itself references tables at all.
The way you can accomplish this is to leverage other aspects of BigQuery: The INFORMATION_SCHEMA tables and scripting functionality.
Information about what tables are in a dataset is available via the TABLES view: https://cloud.google.com/bigquery/docs/information-schema-tables
Information about scripting can be found here: https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now, here's an example that combines these concepts:
DECLARE myTables ARRAY<STRING>;
DECLARE X INT64 DEFAULT 0;
DECLARE queryStr STRING;
# First, we query INFORMATION_SCHEMA to generate an array of the tables we want to process.
# This INFORMATION_SCHEMA query currently has a LIMIT clause so that if you get it wrong,
# you won't bork all the tables in the dataset in one go.
SET myTables = (
SELECT
ARRAY_AGG(t)
FROM (
SELECT
TABLE_NAME as t
FROM `my-project-id`.my_dataset.INFORMATION_SCHEMA.TABLES
WHERE
TABLE_TYPE = 'BASE TABLE' AND
STARTS_WITH(TABLE_NAME, 'table_name_')
ORDER BY TABLE_NAME
LIMIT 2
)
);
# Now, we process that array of tables using scripting's loop construct,
# one at a time.
LOOP
IF X >= ARRAY_LENGTH(myTables)
THEN LEAVE;
END IF;
# DANGER WILL ROBINSON: This mutates tables!!!
#
# The next line constructs the SQL statement we want to run for each table.
#
# In this example, we're constructing the same DML DELETE
# statement to run on each table. For safety sake, you may want to start with
# something like a SELECT query to validate your assumptions and project the
# myTables values to see what you're getting.
SET queryStr = "DELETE FROM `my-project-id`.my_dataset." || myTables[SAFE_OFFSET(X)] || " WHERE id = 'yyy'";
# Now, run the generated SQL via EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE queryStr;
SET X = X + 1;
END LOOP;

SparkSQL/Hive: equivalent of MySQL's `information_schema.table.{data_length, table_rows}`?

In MySQL, we can query the table information_schema.tables and obtain useful information such as data_length or table_rows
select
data_length
, table_rows
from
information_schema.tables
where
table_schema='some_db'
and table_name='some_table';
+-------------+------------+
| data_length | table_rows |
+-------------+------------+
| 8368 | 198 |
+-------------+------------+
1 row in set (0.01 sec)
Is there an equivalent mechanism for SparkSQL/Hive?
I am okay to use SparkSQL or program API like HiveMetaStoreClient (java API org.apache.hadoop.hive.metastore.HiveMetaStoreClient). For the latter I read the API doc (here) and could not find any method related to table row numbers and sizes.
There is no one command for meta-information. Rather there are a set of commands, you may use
Describe Table/View/Column
desc [formatted|extended] schema_name.table_name;
show table extended like part_table;
SHOW TBLPROPERTIES tblname("foo");
Display Column Statistics (Hive 0.14.0 and later)
DESCRIBE FORMATTED [db_name.]table_name column_name;
DESCRIBE FORMATTED [db_name.]table_name column_name PARTITION (partition_spec);

Add Column in Apache Cassandra

How to check in node.js that the column does not exist in Apache Cassandra ?
I need to add a column only if it not exists.
I have read that I must make a select before, but if I select a column that does not exist, it will return an error.
Note that if you're on Cassandra 3.x and up, you'll want to query from the columns table on the system_schema keyspace:
aploetz#cqlsh:system_schema> SELECT * FROm system_schema.columns
WHERE keyspace_name='stackoverflow'
AND table_name='vehicle_information'
AND column_name='name';
keyspace_name | table_name | column_name | clustering_order | column_name_bytes | kind | position | type
---------------+---------------------+-------------+------------------+-------------------+---------+----------+------
stackoverflow | vehicle_information | name | none | 0x6e616d65 | regular | -1 | text
(1 rows)
You can check a column existance using a select query on system.schema_columns table.
Suppose you have the table test_table on keyspace test. Now you want to check a column test_column If exit or not.
Use the below query :
SELECT * FROM system.schema_columns WHERE keyspace_name = 'test' AND columnfamily_name = 'test_table' AND column_name = 'test_column';
If the above query return a result then the column exist otherwise not.

Cassandra CQL searching for element in list

I have a table that has a column of list type (tags):
CREATE TABLE "Videos" (
video_id UUID,
title VARCHAR,
tags LIST<VARCHAR>,
PRIMARY KEY (video_id, upload_timestamp)
) WITH CLUSTERING ORDER BY (upload_timestamp DESC);
I have plenty of rows containing various values in the tags column, ie. ["outdoor","funny cats","funny mice"].
I want to perform a SELECT query that will return all rows that contain "funny cats" in the tags column. How can I do that?
To directly answer your question, yes there is a way to accomplish this. As of Cassandra 2.1 you can create a secondary index on a collection. First, I'll re-create your column family definition (while adding a definition for upload_timestamp timeuuid) and put some values in it.
aploetz#cqlsh:stackoverflow> SELECT * FROM videos ;
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+---------------------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
ab696e1f-78c0-45e6-893f-430e88db7f46 | 8db7c4b0-64fa-11e4-a819-21b264d4c94d | ['documentary'] | The Witches of Whitewater
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(3 rows)
Next, I'll create a secondary index on the tags column:
aploetz#cqlsh:stackoverflow> CREATE INDEX ON videos (tags);
Now, if I want to query the videos that contain the tag "action," I can accomplish this with the CONTAINS keyword:
aploetz#cqlsh:stackoverflow> SELECT * FROM videos WHERE tags CONTAINS 'action';
video_id | upload_timestamp | tags | title
--------------------------------------+--------------------------------------+-----------------------------------------------+--------------
2977b806-df76-4dd7-a57e-11d361e72ce1 | fc011080-64f9-11e4-a819-21b264d4c94d | ['sci-fi', 'action', 'adventure'] | Star Wars
15e6bc0d-6195-4d8b-ad25-771966c780c8 | 1680d120-64fa-11e4-a819-21b264d4c94d | ['dark comedy', 'action', 'language warning'] | Pulp Fiction
(2 rows)
With this all being said, I should pass along a couple of warnings:
Secondary indexes do not perform well at scale. They exist to provide convenience, not performance. If you are expecting to have to query by tag often, then the right way to solve this would be to create a videosbytag query table, with the same data but keyed like this: PRIMARY KEY (tag,video_id)
You don't need the double-quotes in your table name. In fact, having it in quotes may cause you problems (ok, maybe minor irritations) down the road.

adding set of columns in row in cassandra

i am a newbie in cassandra. I am not sure about adding set of columns in a row many times, e.g., i want to add call related information columns (columns like timestamp_calling_no, timestamp_tower_id, timestamp_start_time, timestamp_end_time, timestamp_duration, timestamp_call_type,etc.) in a row whenever same mobile number make a call by using hector/astyanax/java/CQL.
please give your suggestions. Thanx in advance.
adding columns to a row is an inexpensive operation and one more thing cassandra stores column names as sorted, so by using timestamp as a column name will solve the problem of slicing.cassandra can have 2 billion columns in a row in the CDR CF, so we can easily keep adding columns.
you were trying to run a query which required cassandra to scan all rows then yes, it would perform poorly.
It's good that you recognise that there are a number of APIs available. I would recommend using CQL3, mostly because thrift apis are now being kept only for backwards compatibility and because CQL can be used with most languages while Astyanax and hector are java specific.
I'd use a compound key (under "Clustering, compound keys, and more").
//An example keyspace using CQL2
cqlsh>
CREATE KEYSPACE calls WITH strategy_class = 'SimpleStrategy'
AND strategy_options:replication_factor = 1;
//And next create a CQL3 table with a compound key
//Compound key is formed from the number and call's start time
cqlsh>
CREATE TABLE calls.calldata (
number text,
timestamp_start_time timestamp,
timestamp_end_time timestamp,
PRIMARY KEY (number, timestamp_start_time)
) WITH COMPACT STORAGE
The above schema would allow you to insert a row containing the same number as a key multiple times but because the start of the call is part of the key, it will ensure that the combination creates a unique key each time.
Next insert some test data using CQL3(for the purpose of the example of-course)
cqlsh> //This example data below uses 2 diffrent numbers
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361733545850, 1335361773545850);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+440987654321', 1335361734678700, 1335361737678700);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361738208700, 1335361738900032);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+441234567890', 1335361740100277, 1335361740131251);
insert into calls.calldata (number, timestamp_start_time, timestamp_end_time)
values ('+440987654321', 1335361740176666, 1335361740213000);
And now we can retreive all the data (again using CQL3):
cqlsh> SELECT * FROM calls.calldata;
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+440987654321 | 44285-12-05 15:11:18+0000 | 44285-12-05 16:01:18+0000
+440987654321 | 44285-12-05 16:42:56+0000 | 44285-12-05 16:43:33+0000
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
+441234567890 | 44285-12-05 16:10:08+0000 | 44285-12-05 16:21:40+0000
+441234567890 | 44285-12-05 16:41:40+0000 | 44285-12-05 16:42:11+0000
Or part of of the data. Because a compound key is used you can retrieve all the rows of a specific number using CQL3:
cqlsh> SELECT * FROM calls.calldata WHERE number='+441234567890';
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
+441234567890 | 44285-12-05 16:10:08+0000 | 44285-12-05 16:21:40+0000
+441234567890 | 44285-12-05 16:41:40+0000 | 44285-12-05 16:42:11+0000
And if you want to be really specific, you can retrieve a specific number where the call started at a specific time (again thanks to the compound key)
cqlsh> SELECT * FROM calls.calldata WHERE number='+441234567890'
and timestamp_start_time=1335361733545850;
number | timestamp_start_time | timestamp_end_time
---------------+---------------------------+---------------------------
+441234567890 | 44285-12-05 14:52:25+0000 | 44285-12-06 01:59:05+0000
In frameworks like Playorm, there are multiple ways to do that. For e.g. you can use #OneToMany OR #NoSqlEmbedded pattern. For more details, visit http://buffalosw.com/wiki/Patterns-Page/

Resources