I have two tables finance.movies and finance.dummytable_3.
Both has been created using Spark SQL and their meta information is same
Both has two files
The total file size of finance.movies is 312363 while that of finance.dummytable_3 is 1209
But when I execute a simple "select *" on both the tables, finance.dummytable_3 whose total file size is less execute 2 jobs while finance.movies executes only one job. Screenshot follows.
hive> describe formatted finance.movies;
OK
# col_name data_type comment
movieid int
title_movie string
genres string
# Detailed Table Information
Database: finance
Owner: root
CreateTime: Thu Apr 16 02:18:51 EDT 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/movies
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 2
spark.sql.create.version 2.3.0.2.6.5.0-292
spark.sql.sources.provider parquet
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"movieid\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"title_movie\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"genres\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
totalSize 312363
transient_lastDdlTime 1587017931
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/movies
serialization.format 1
Time taken: 3.877 seconds, Fetched: 35 row(s)
hive> describe formatted finance.dummytable_3;
OK
# col_name data_type comment
colname1 string
colname2 string
colname3 string
# Detailed Table Information
Database: finance
Owner: root
CreateTime: Wed Apr 15 02:40:18 EDT 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/dummytable_3
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 2
spark.sql.create.version 2.3.0.2.6.5.0-292
spark.sql.sources.provider parquet
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"colName1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"colName2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"colName3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}
totalSize 1209
transient_lastDdlTime 1586932818
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path hdfs://test.com:8020/GenNext_Finance_DL/Finance.db/dummytable_3
serialization.format 1
Time taken: 0.373 seconds, Fetched: 35 row(s)
When I execute the following query the number of jobs being executed is 2
spark-sql --master yarn --name spark_load_test -e "select * from finance.dummytable_3 limit 2;"
While when I execute the following query, the number of job being executed is just one. Despite both the tables having two files and identical struture.
spark-sql --master yarn --name spark_load_test -e "select * from finance.movies limit 2;"
Related
I'm getting invalid timestamp when reading Elasticsearch records using Spark with elasticsearch-hadoop library. I'm using following Spark code for records reading:
val sc = spark.sqlContext
val elasticFields = Seq(
"start_time",
"action",
"category",
"attack_category"
)
sc.sql(
"CREATE TEMPORARY TABLE myIndex " +
"USING org.elasticsearch.spark.sql " +
"OPTIONS (resource 'aggattack-2021.01')" )
val all = sc.sql(
s"""
|SELECT ${elasticFields.mkString(",")}
|FROM myIndex
|""".stripMargin)
all.show(2)
Which leads to the following result:
+-----------------------+------+---------+---------------+
|start_time |action|category |attack_category|
+-----------------------+------+---------+---------------+
|1970-01-19 16:04:27.228|drop |udp-flood|DoS |
|1970-01-19 16:04:24.027|drop |others |DoS |
+-----------------------+------+---------+---------------+
But I'm expecting timestamp with current year, eg 2021-01-19 16:04:27.228. In the elastic, start_time field has unixtime format in millis -> start_time": 1611314773.641
Problem was with the data in ElasticSearch. start_time field was mapped as epoch_seconds and contained value epoch seconds with three decimal places (eg 1611583978.684). Everything works fine after we have converted epoch time to millis without any decimal places
I want to insert data from S3 parquet files to Redshift.
Files in parquet comes from a process that reads JSON files, flatten them out, and store as parquet. To do it we use pandas dataframes.
To do so, I tried two different things. The first one:
COPY schema.table
FROM 's3://parquet/provider/A/2020/11/10/11/'
IAM_ROLE 'arn:aws:iam::XXXX'
FORMAT AS PARQUET;
It returned:
Invalid operation: Spectrum Scan Error
error: Spectrum Scan Error
code: 15001
context: Unmatched number of columns between table and file. Table columns: 54, Data columns: 41
I understand the error but I don't have an easy option to fix it.
If we have to do a reload from 2 months ago the file will only have for example 40 columns, because on that given data we needed just this data but table already increased to 50 columns.
So we need something automatically, or that we can specify the columns at least.
Then I applied another option which is to do a SELECT with AWS Redshift Spectrum. We know how many columns the table have using system tables, and we now the structure of the file loading again to a Pandas dataframe. Then I can combine both to have the same identical structure and do the insert.
It works fine but it is slow.
The select looks like:
SELECT fields
FROM schema.table
WHERE partition_0 = 'A'
AND partition_1 = '2020'
AND partition_2 = '11'
AND partition_3 = '10'
AND partition_4 = '11';
The partitions are already added as I checked using:
select *
from SVV_EXTERNAL_PARTITIONS
where tablename = 'table'
and schemaname = 'schema'
and values = '["A","2020","11","10","11"]'
limit 1;
I have around 170 files per hour, both in json and parquet file. The process list all files in S3 json path, and process them and store in S3 parquet path.
I don't know how to improve execution time, as the INSERT from parquet takes 2 minutes per each partition_0 value. I tried the select alone to ensure its not an INSERT issue, and it takes 1:50 minutes. So the issue is to read data from S3.
If I try to select a non existent value for partition_0 it takes again around 2 minutes, so there is some kind of problem to access data. I don't know if partition_0 naming and others are considered as Hive partitioning format.
Edit:
AWS Glue Crawler table specification
Edit: Add SVL_S3QUERY_SUMMARY results
step:1
starttime: 2020-12-13 07:13:16.267437
endtime: 2020-12-13 07:13:19.644975
elapsed: 3377538
aborted: 0
external_table_name: S3 Scan schema_table
file_format: Parquet
is_partitioned: t
is_rrscan: f
is_nested: f
s3_scanned_rows: 1132
s3_scanned_bytes: 4131968
s3query_returned_rows: 1132
s3query_returned_bytes: 346923
files: 169
files_max: 34
files_avg: 28
splits: 169
splits_max: 34
splits_avg: 28
total_split_size: 3181587
max_split_size: 30811
avg_split_size: 18825
total_retries:0
max_retries:0
max_request_duration: 360496
avg_request_duration: 172371
max_request_parallelism: 10
avg_request_parallelism: 8.4
total_slowdown_count: 0
max_slowdown_count: 0
Add query checks
Query: 37005074 (SELECT in localhost using pycharm)
Query: 37005081 (INSERT in AIRFLOW AWS ECS service)
STL_QUERY Shows that both queries takes around 2 min
select * from STL_QUERY where query=37005081 OR query=37005074 order by query asc;
Query: 37005074 2020-12-14 07:44:57.164336,2020-12-14 07:46:36.094645,0,0,24
Query: 37005081 2020-12-14 07:45:04.551428,2020-12-14 07:46:44.834257,0,0,3
STL_WLM_QUERY Shows that no queue time, all in exec time
select * from STL_WLM_QUERY where query=37005081 OR query=37005074;
Query: 37005074 Queue time 0 Exec time: 98924036 est_peak_mem:0
Query: 37005081 Queue time 0 Exec time: 100279214 est_peak_mem:2097152
SVL_S3QUERY_SUMMARY Shows that query takes 3-4 seconds in s3
select * from SVL_S3QUERY_SUMMARY where query=37005081 OR query=37005074 order by endtime desc;
Query: 37005074 2020-12-14 07:46:33.179352,2020-12-14 07:46:36.091295
Query: 37005081 2020-12-14 07:46:41.869487,2020-12-14 07:46:44.807106
stl_return Comparing min start for to max end for each query. 3-4 seconds as says SVL_S3QUERY_SUMMARY
select * from stl_return where query=37005081 OR query=37005074 order by query asc;
Query:37005074 2020-12-14 07:46:33.175320 2020-12-14 07:46:36.091295
Query:37005081 2020-12-14 07:46:44.817680 2020-12-14 07:46:44.832649
I dont understand why SVL_S3QUERY_SUMMARY shows just 3-4 seconds to run query in spectrum, but then STL_WLM_QUERY says the excution time is around 2 minutes as i see in my localhost and production environtments... Neither how to improve it, because stl_return shows that query returns few data.
EXPLAIN
XN Partition Loop (cost=0.00..400000022.50 rows=10000000000 width=19608)
-> XN Seq Scan PartitionInfo of parquet.table (cost=0.00..22.50 rows=1 width=0)
Filter: (((partition_0)::text = 'A'::text) AND ((partition_1)::text = '2020'::text) AND ((partition_2)::text = '12'::text) AND ((partition_3)::text = '10'::text) AND ((partition_4)::text = '12'::text))
-> XN S3 Query Scan parquet (cost=0.00..200000000.00 rows=10000000000 width=19608)
" -> S3 Seq Scan parquet.table location:""s3://parquet"" format:PARQUET (cost=0.00..100000000.00 rows=10000000000 width=19608)"
svl_query_report
select * from svl_query_report where query=37005074 order by segment, step, elapsed_time, rows;
Just like in your other question you need to change your keypaths on your objects. It is not enough to just have "A" in the keypath - it needs to be "partition_0=A". This is how Spectrum knows that the object is or isn't in the partition.
Also you need to make sure that your objects are of reasonable size or it will be slow if you need to scan many of them. It takes time to open each object and if you have many small objects the time to open them can be longer than the time to scan them. This is only an issue if you need to scan many many files.
I have a csv file with contents as below which has a header in the 1st line .
id,name
1234,Rodney
8984,catherine
Now I was able create a table in hive to skip header and read the data appropriately.
Table in Hive
CREATE EXTERNAL TABLE table_id(
`tmp_id` string,
`tmp_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-testing/test/data/'
tblproperties ("skip.header.line.count"="1");
Results in Hive
select * from table_id;
OK
1234 Rodney
8984 catherine
Time taken: 1.219 seconds, Fetched: 2 row(s)
But, when I use the same table in pyspark (Ran the same query) I see even the headers from file in pyspark results as below.
>>> spark.sql("select * from table_id").show(10,False)
+------+---------+
|tmp_id|tmp_name |
+------+---------+
|id |name |
|1234 |Rodney |
|8984 |catherine|
+------+---------+
Now, how can I ignore these showing up in the results in pyspark.
I'm aware that we can read the csv file and add .option("header",True) to achieve this but, I wanna know if there's a way to do something similar in pyspark while querying tables.
Can someone suggest me a way.... Thanks 🙏 in Advance !!
u can use below two properties:
serdies properties and table properties, you will be able to access table from hive and spark by skipping header in both env.
CREATE EXTERNAL TABLE `student_test_score_1`(
student string,
age string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'delimiter'=',',
'field.delim'=',',
'header'='true',
'skip.header.line.count'='1',
'path'='hdfs:<path>')
LOCATION
'hdfs:<path>'
TBLPROPERTIES (
'spark.sql.sources.provider'='CSV')
This is know issue in Spark-11374 and closed as won't fix.
In query you can have where clause to select all records except 'id' and 'name'.
spark.sql("select * from table_id where tmp_id <> 'id' and tmp_name <> 'name'").show(10,False)
#or
spark.sql("select * from table_id where tmp_id != 'id' and tmp_name != 'name'").show(10,False)
Another way would be using reading files from HDFS with .option("header","true").
I have a collection of slow query logs from RDS and I put them together into a single file. Trying to run it through pt-query-digest, following instructions here, but it reads the whole file as a single query.
Command:
pt-query-digest --group-by fingerprint --order-by Query_time:sum collider-slow-query.log > slow-query-analyze.txt
Output, showing that it only analyzed one query:
# Overall: 1 total, 1 unique, 0 QPS, 0x concurrency ______________________
Here's just a short snippet containing 2 queries from the file being analyzed to demonstrate there are many queries:
2019-05-03T20:44:21.828Z # Time: 2019-05-03T20:44:21.828954Z
# User#Host: username[username] # [ipaddress] Id: 19
# Query_time: 17.443164 Lock_time: 0.000145 Rows_sent: 5 Rows_examined: 121380
SET timestamp=1556916261;
SELECT wp_posts.ID FROM wp_posts LEFT JOIN wp_term_relationships ON (wp_posts.ID = wp_term_relationships.object_id) WHERE 1=1 AND wp_posts.ID NOT IN (752921) AND (
wp_term_relationships.term_taxonomy_id IN (40)
) AND wp_posts.post_type = 'post' AND ((wp_posts.post_status = 'publish')) GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 5;
2019-05-03T20:44:53.597Z # Time: 2019-05-03T20:44:53.597137Z
# User#Host: username[username] # [ipaddress] Id: 77
# Query_time: 35.757909 Lock_time: 0.000054 Rows_sent: 2 Rows_examined: 199008
SET timestamp=1556916293;
SELECT post_id, meta_value FROM wp_postmeta
WHERE meta_key = '_wp_attached_file'
AND meta_value IN ( 'family-guy-vestigial-peter-slice.jpg','2015/08/bobs-burgers-image.jpg','2015/08/bobs-burgers-image.jpg' );
Why isn't it reading all the queries? Is there a problem with my concatenation?
I had the same problem as well. It turns out that some additional timestamp is stored in the SQL (through this variable: https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_log_timestamps).
This makes pt-query-digest not see individual queries. It should be easily fixed by either turning off the timestamp variable or remove the timestamp:
2019-10-28T11:17:18.412 # Time: 2019-10-28T11:17:18.412214
# User#Host: foo[foo] # [192.168.8.175] Id: 467836
# Query_time: 5.839596 Lock_time: 0.000029 Rows_sent: 1 Rows_examined: 0
use foo;
SET timestamp=1572261432;
SELECT COUNT(*) AS `count` FROM `foo`.`invoices` AS `Invoice` WHERE 1 = 1;
By removing the first timestamp (the 2019-10-28T11:17:18.412 part), it works again.
Configurations:-
3 node cluster
Node 1 --> 172.27.21.16(Seed Node)
Node 2 --> 172.27.21.18
Node 3 --> 172.27.21.19
cassandra.yaml paramters for all the nodes:-
1) seeds: "172.27.21.16"
2) write_request_timeout_in_ms: 5000
3) listen_address: 172.27.21.1(6,8,9)
4) rpc_address: 172.27.21.1(6,8,9)
This is my code.yaml file:-
keyspace: prutorStress3node
keyspace_definition: |
CREATE KEYSPACE prutorStress3node WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
table: code
table_definition: |
CREATE TABLE code (
id timeuuid,
assignment_id bigint,
user_id text,
contents text,
save_type smallint,
save_time timeuuid,
PRIMARY KEY(assignment_id, id)
) WITH CLUSTERING ORDER BY (id DESC)
columnspec:
- name: assignment_id
population: uniform(1..100000) # Assignment IDs from range 1-100000
- name: id
cluster: fixed(100) # Each assignment id will have atmost 100 ids(maximum 100 codes allowed per student)
- name: user_id
size: fixed(50) # user_id comes from account table varchar(50)
- name: contents
size: uniform(1..2000)
- name: save_type
size: fixed(1) # Generated values between 1 to 4
population: uniform(1..4)
- name: save_time
insert:
partitions: fixed(1) # The number of partitions to update per batch
select: fixed(1)/100 # The ratio of rows each partition should insert as a proportion of the total possible rows for the partition (as defined by the clustering distribution columns
batchtype: UNLOGGED # The type of CQL batch to use. Either LOGGED/UNLOGGED
queries:
query1:
cql: select id,contents,save_type,save_time from code where assignment_id = ?
fields: samerow
query2:
cql: select id,contents,save_type,save_time from code where assignment_id = ? LIMIT 10
fields: samerow
query3:
cql: SELECT contents FROM code WHERE id = ? # Create index for this query
fields: samerow
Following is my script which I am running from 172.27.21.17(A node on the same network but in a different cluster):-
#!/bin/bash
echo '***********************************'
echo Deleting old data from prutorStress3node
echo '***********************************'
cqlsh -e "DROP KEYSPACE prutorStress3node" 172.27.21.16
sleep 120
echo '***********************************'
echo Creating fresh data
echo '***********************************'
/bin/cassandra-stress user profile=./code.yaml n=1000000 no-warmup cl=quorum ops\(insert=1\) \
-rate threads=25 -node 172.27.21.16
sleep 120
Issue:-
The issue is that when I run the script just after deleting previous data and after bringing up those nodes, it works fine, but when I run it further, for each run, I hit the following error:-
Unable to find prutorStress3node.code
at org.apache.cassandra.stress.StressProfile.maybeLoadSchemaInfo(StressProfile.java:306)
at org.apache.cassandra.stress.StressProfile.maybeCreateSchema(StressProfile.java:273)
at org.apache.cassandra.stress.StressProfile.newGenerator(StressProfile.java:676)
at org.apache.cassandra.stress.StressProfile.printSettings(StressProfile.java:129)
at org.apache.cassandra.stress.settings.StressSettings.printSettings(StressSettings.java:383)
at org.apache.cassandra.stress.Stress.run(Stress.java:95)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
In the file StressProfile.java ,I saw that table metadata is being populated to NULL. I tried to make sense of the stack trace, but was not able to make anything of it. Please give me some directions as to what might have gone wrong?
Giving keyspace and table name in small case solved the issue for me. Previously i was giving mixed case and i was getting the same error.