My requirement is that I would like to stop the poller after a fixed interval of time say 9 (hrs.). For now I am trying to stop the poller after 1 min. Following is my code:
<int-task:scheduler id="scheduler" pool-size="10"/>
<int-task:scheduled-tasks scheduler="scheduler">
<int-task:scheduled ref="incomingFiles.adapter" method="stop" fixed-delay="#{10 * 1000}"/>
</int-task:scheduled-tasks>
But now what I observe is that when I start my program then on startup I immediately get the message in the console as:
> INFO: Starting beans in phase 0 May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint start INFO:
> started incomingFiles.adapter May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.EventDrivenConsumer
> logComponentSubscriptionEvent INFO: Adding {service-activator} as a
> subscriber to the 'incomingFiles' channel May 28, 2014 10:27:55 AM
> org.springframework.integration.channel.AbstractSubscribableChannel
> adjustCounterIfNecessary INFO: Channel
> 'org.springframework.context.support.ClassPathXmlApplicationContext#f4d5bc9.incomingFiles'
> has 1 subscriber(s). May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint start INFO:
> started
> org.springframework.integration.config.ConsumerEndpointFactoryBean#0
> May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.EventDrivenConsumer
> logComponentSubscriptionEvent INFO: Adding {router} as a subscriber to
> the 'contextStartedEventChannelChannel' channel May 28, 2014 10:27:55
> AM org.springframework.integration.channel.AbstractSubscribableChannel
> adjustCounterIfNecessary INFO: Channel
> 'org.springframework.context.support.ClassPathXmlApplicationContext#f4d5bc9.contextStartedEventChannelChannel'
> has 1 subscriber(s). May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint start INFO:
> started
> org.springframework.integration.config.ConsumerEndpointFactoryBean#1
> May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.EventDrivenConsumer
> logComponentSubscriptionEvent INFO: Adding
> {outbound-channel-adapter:trueChannel.adapter} as a subscriber to the
> 'trueChannel' channel May 28, 2014 10:27:55 AM
> org.springframework.integration.channel.AbstractSubscribableChannel
> adjustCounterIfNecessary INFO: Channel
> 'org.springframework.context.support.ClassPathXmlApplicationContext#f4d5bc9.trueChannel'
> has 1 subscriber(s). May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint start INFO:
> started trueChannel.adapter May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.EventDrivenConsumer
> logComponentSubscriptionEvent INFO: Adding
> {outbound-channel-adapter:falseChannel.adapter} as a subscriber to the
> 'falseChannel' channel May 28, 2014 10:27:55 AM
> org.springframework.integration.channel.AbstractSubscribableChannel
> adjustCounterIfNecessary INFO: Channel
> 'org.springframework.context.support.ClassPathXmlApplicationContext#f4d5bc9.falseChannel'
> has 1 subscriber(s). May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint start INFO:
> started falseChannel.adapter May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.EventDrivenConsumer
> logComponentSubscriptionEvent INFO: Adding
> {logging-channel-adapter:_org.springframework.integration.errorLogger}
> as a subscriber to the 'errorChannel' channel May 28, 2014 10:27:55 AM
> org.springframework.integration.channel.AbstractSubscribableChannel
> adjustCounterIfNecessary INFO: Channel
> 'org.springframework.context.support.ClassPathXmlApplicationContext#f4d5bc9.errorChannel'
> has 1 subscriber(s). May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint start INFO:
> started _org.springframework.integration.errorLogger May 28, 2014
> 10:27:55 AM
> org.springframework.integration.file.FileReadingMessageSource receive
> INFO: Created message: [[Payload File
> content=C:\TEMP\incomingFile\ETD.CONFIRM.60326.140519.T0613170][Headers={id=b003893a-e013-57c8-0c96-55db627ec643, timestamp=1401287275402}]] May 28, 2014 10:27:55 AM
> org.springframework.integration.endpoint.AbstractEndpoint stop INFO:
> stopped incomingFiles.adapter
Somewhere in the start of the starttup logs we get:
May 28, 2014 10:27:55 AM
org.springframework.integration.endpoint.AbstractE ndpoint start INFO:
started incomingFiles.adapter
Somewhere in the end of the startup logs we get:
May 28, 2014 10:27:55 AM
org.springframework.integration.endpoint.AbstractE ndpoint stop INFO:
stopped incomingFiles.adapter
Why the incomingFiles.adapter has been stopped immediately while our fixed-delay="#{10 * 1000}" is 10 sec. Time is exactly same and there is absolutely no delay. So ideally the poller should stop after 10 sec. and not immediately. Also there are 4 files in the directory and its picking up only one.
Please do suggest what's wrong.
Well, I see. The
<int-task:scheduled ref="incomingFiles.adapter" method="stop"
fixed-delay="#{10 * 1000}"/>
produces PeriodicTrigger which result (nextExecutionTime) depends on triggerContext.lastScheduledExecutionTime() and if it is null (your case) it invokes underlying method immidiatelly.
Let's try this!
<task:scheduled ref="incomingFiles.adapter" method="stop"
fixed-delay="#{10 * 1000}" initial-delay="#{10 * 1000}"/>
I mean the same value for the initial-delay to postpone the first stop task for the desired time.
Related
Why does PyCharm suddenly wants to start a test?
My script is named 1_selection_sort.py And I'm trying to call the function test_selection_sort, and I'm just running with <current file> (Added in 2022.2.2 I assume).
I'm pretty sure this worked 24/10/2022 (Version 2022.2.2 and maybe 2022.2.3, but in 2022.2.4 it's no longer working).
Could someone please tell me when and why this was changed? Or maybe I did something wrong during installing?
My file is NOT named according to this naming scheme (https://docs.pytest.org/en/7.1.x/explanation/goodpractices.html#conventions-for-python-test-discovery):
In those directories, search for test_*.py
or *_test.py files, imported by their test
package name.
"""
Schrijf een functie selection_sort dat een lijst in dalende volgorde sorteert m.b.v. selection sort.
"""
def selection_sort(lijst):
for i in range(len(lijst)):
for j, number in enumerate(lijst):
if number < lijst[i]:
lijst[j] = lijst[i]
lijst[i] = number
return lijst
def test_selection_sort(lijst, check):
print(lijst)
result = selection_sort(lijst)
print(result)
print(check)
assert result == check
print("Begin controle selection_sort")
test_selection_sort([1, 3, 45, 32, 65, 34], [65, 45, 34, 32, 3, 1])
test_selection_sort([1], [1])
test_selection_sort([54, 29, 12, 92, 2, 100], [100, 92, 54, 29, 12, 2])
test_selection_sort([], [])
print("Controle selection_sort succesvol")
Output:
"C:\Program Files\Anaconda3\python.exe" "C:/Users/r0944584/AppData/Local/JetBrains/PyCharm Community Edition 2022.2.4/plugins/python-ce/helpers/pycharm/_jb_pytest_runner.py" --path "C:\Users\r0944584\Downloads\skeletons(4)\skeletons\1_selection_sort.py"
Testing started at 14:13 ...
Launching pytest with arguments C:\Users\r0944584\Downloads\skeletons(4)\skeletons\1_selection_sort.py --no-header --no-summary -q in C:\Users\r0944584\Downloads\skeletons(4)\skeletons
============================= test session starts =============================
collecting ... collected 1 item
1_selection_sort.py::test_selection_sort ERROR [100%]
test setup failed
file C:\Users\r0944584\Downloads\skeletons(4)\skeletons\1_selection_sort.py, line 15
def test_selection_sort(lijst, check):
E fixture 'lijst' not found
> available fixtures: anyio_backend, anyio_backend_name, anyio_backend_options, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
> use 'pytest --fixtures [testpath]' for help on them.
C:\Users\r0944584\Downloads\skeletons(4)\skeletons\1_selection_sort.py:15
========================= 1 warning, 1 error in 0.01s =========================
Process finished with exit code 1
The solution I found was to disable Pytest following this answer: https://stackoverflow.com/a/59203776/13454049
Disable Pytest for you project
Open the Settings/Preferences | Tools | Python Integrated Tools settings dialog as described in Choosing Your Testing Framework.
In the Default test runner field select Unittests.
Click OK to save the settings.
I tried to upgrade from Juypterhub using pyspark 3.1.2 (using Python 3.7) using Debian Linux with Kafka from Spark 2.4.1 to Spark 3.1.2. Therefore, I also update Kafka from 2.4.1 to 2.8 but this does not seem to be the problem. I checked the dependencies from https://spark.apache.org/docs/latest/ and it seems fine so far.
For Spark 2.4.1 I used these additional jar in the sparks directory:
slf4j-api-1.7.26.jar
unused-1.0.0.jar
lz4-java-1.6.0.jar
kafka-clients-2.3.0.jar
spark-streaming-kafka-0-10_2.11-2.4.3.jar
spark-sql-kafka-0-10_2.11-2.4.3.jar
For Spark 3.1.2 I updated these jars and already added some more the other file already existed like unused:
spark-sql-kafka-0-10_2.12-3.1.2.jar
spark-streaming-kafka-0-10_2.12-3.1.2.jar
spark-streaming-kafka-0-10-assembly_2.12-3.1.2.jar
spark-token-provider-kafka-0-10_2.12-3.1.2.jar
kafka-clients-2.8.0.jar
I striped my pyspark code to this that works with spark 2.4.1 but not with Spark 3.1.2:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.utils import AnalysisException
import datetime
# configuration of target db
db_target_url = "jdbc:mysql://localhost/test"
db_target_properties = {"user": "john", "password": "doe"}
# create spark session
spark = SparkSession.builder.appName("live1").getOrCreate()
spark.conf.set('spark.sql.caseSensitive', True)
# create schema for the json iba data
schema_tww_vs = T.StructType([T.StructField("[22:8]", T.DoubleType()),\
T.StructField("[1:3]", T.DoubleType()),\
T.StructField("Timestamp", T.StringType())])
# create dataframe representing the stream and take the json data into a usable df structure
d = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test_so") \
.load() \
.selectExpr("timestamp", "cast (value as string) as json") \
.select("timestamp", F.from_json("json", schema_tww_vs).alias("struct")) \
.selectExpr("timestamp", "struct.*") \
# add timestamp of this spark processing
d = d.withColumn("time_spark", F.current_timestamp())
d1 = d.withColumnRenamed('[1:3]','signal1') \
.withColumnRenamed('[22:8]','ident_orig') \
.withColumnRenamed('timestamp','time_kafka') \
.withColumnRenamed('Timestamp','time_source')
d1 = d1.withColumn("ident", F.round(d1["ident_orig"]).cast('integer'))
d4 = d1.where("signal1 > 3000")
d4a = d4.withWatermark("time_kafka", "1 second") \
.groupby('ident', F.window('time_kafka', "5 second")) \
.agg(
F.count("*").alias("count"), \
F.min("time_kafka").alias("time_start"), \
F.round(F.avg("signal1"),1).alias('signal1_avg'),)
# Remove the column "windows" since this struct (with start and stop time) cannot be written to the db
d4a = d4a.drop('window')
d8a = d4a.select('time_start', 'ident', 'count', 'signal1_avg')
# write the dataframe into the database using the streaming mode
def write_into_sink(df, epoch_id):
df.write.jdbc(table="test_so", mode="append", url=db_target_url, properties=db_target_properties)
pass
query_write_sink = d8a.writeStream \
.foreachBatch(write_into_sink) \
.trigger(processingTime = "1 seconds") \
.start()
Some of the errors are:
java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
Jul 22 15:41:22 run [847]: #011at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<init>(KafkaDataConsumer.scala:623)
Jul 22 15:41:22 run [847]: #011at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.<clinit>(KafkaDataConsumer.scala)
…
jupyterhub-start.sh[847]: 21/07/22 15:41:22 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
Jul 22 15:41:22 run [847]: 21/07/22 15:41:22 ERROR MicroBatchExecution: Query [id = 5d2a70aa-1463-48f3-a4a6-995ceef22891, runId = d1f856b5-eb0c-4635-b78a-d55e7ce81f2b] terminated with error
Jul 22 15:41:22 run [847]: py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
Jul 22 15:41:22 run [847]: File "/opt/anaconda/envs/env1/lib/python3.7/site-packages/py4j/java_gateway.py", line 2451, in _call_proxy
Jul 22 15:41:22 run [847]: return_value = getattr(self.pool[obj_id], method)(*params)
Jul 22 15:41:22 run [847]: File "/opt/spark/python/pyspark/sql/utils.py", line 196, in call
Jul 22 15:41:22 run [847]: raise e
Jul 22 15:41:22 run [847]: File "/opt/spark/python/pyspark/sql/utils.py", line 193, in call
Jul 22 15:41:22 run [847]: self.func(DataFrame(jdf, self.sql_ctx), batch_id)
Jul 22 15:41:22 run [847]: File "<ipython-input-10-d40564c31f71>", line 3, in write_into_sink
Jul 22 15:41:22 run [847]: df.write.jdbc(table="test_so", mode="append", url=db_target_url, properties=db_target_properties)
Jul 22 15:41:22 run [847]: File "/opt/spark/python/pyspark/sql/readwriter.py", line 1445, in jdbc
Jul 22 15:41:22 run [847]: self.mode(mode)._jwrite.jdbc(url, table, jprop)
Jul 22 15:41:22 run [847]: File "/opt/anaconda/envs/env1/lib/python3.7/site-packages/py4j/java_gateway.py", line 1310, in __call__
Jul 22 15:41:22 run [847]: answer, self.gateway_client, self.target_id, self.name)
Jul 22 15:41:22 run [847]: File "/opt/spark/python/pyspark/sql/utils.py", line 111, in deco
Jul 22 15:41:22 run [847]: return f(*a, **kw)
Jul 22 15:41:22 run [847]: File "/opt/anaconda/envs/env1/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
Jul 22 15:41:22 run [847]: format(target_id, ".", name), value)
Jul 22 15:41:22 run [847]: py4j.protocol.Py4JJavaError: An error occurred while calling o101.jdbc.
Jul 22 15:41:22 run [847]: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 200) (master executor driver): java.lang.NoClassDefFoundError: org/apache/commons/pool2/impl/GenericKeyedObjectPoolConfig
Do you have ideas what causes this error?
As devesh said there was one jar file missing:
commons-pool2-2.8.0.jar that can be downloaded from https://mvnrepository.com/artifact/org.apache.commons/commons-pool2/2.8.0
After creating SKLearn() instance and using HyperparamaterTuner with a few hyperparameter ranges, I get the best estimator. When I try to deploy() the estimator, it gives an error in the log. Exactly same error happens when I create transformer and call transform on it(). Doesn't deploy and doesn't transform. What could be the problem and at least how could I possibly narrow down the problem?
I have no idea how to even begin to figure this out. Googling didn't help. Nothing comes up.
Creating SKLearn instance:
sklearn = SKLearn(
entry_point=script_path,
train_instance_type="ml.c4.xlarge",
role=role,
sagemaker_session=session,
hyperparameters={'model': 'rfc'})
Putting tuner to work:
tuner = HyperparameterTuner(estimator = sklearn,
objective_metric_name = objective_metric_name,
objective_type = 'Minimize',
metric_definitions = metric_definitions,
hyperparameter_ranges = hyperparameters,
max_jobs = 3, # 9,
max_parallel_jobs = 4)
tuner.fit({'train': s3_input_train})
tuner.wait()
best_training_job = tuner.best_training_job()
the_best_estimator = sagemaker.estimator.Estimator.attach(best_training_job)
This gives a valid best training job. Everything seems great.
Here is where the problem manifests:
predictor = the_best_estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
or the following (triggers exactly same problem):
rfc_transformer = the_best_estimator.transformer(1, instance_type="ml.m4.xlarge")
rfc_transformer.transform(test_location)
rfc_transformer.wait()
Here is the log with the error message (it reiterates the same error many times while trying to deploy or transform; here is the beginning of the log):
................[2019-09-22 09:17:48 +0000] [17] [INFO] Starting gunicorn 19.9.0
[2019-09-22 09:17:48 +0000] [17] [INFO] Listening at: unix:/tmp/gunicorn.sock (17)
[2019-09-22 09:17:48 +0000] [17] [INFO] Using worker: gevent
[2019-09-22 09:17:48 +0000] [24] [INFO] Booting worker with pid: 24
[2019-09-22 09:17:48 +0000] [25] [INFO] Booting worker with pid: 25
[2019-09-22 09:17:48 +0000] [26] [INFO] Booting worker with pid: 26
[2019-09-22 09:17:48 +0000] [30] [INFO] Booting worker with pid: 30
2019-09-22 09:18:15,061 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
2019-09-22 09:18:15,062 INFO - sagemaker_sklearn_container.serving - Encountered an unexpected error.
[2019-09-22 09:18:15 +0000] [24] [ERROR] Error handling request /ping
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 56, in handle self.handle_request(listener_name, req, client, addr)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request addr)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 107, in handle_request respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 119, in main user_module_transformer = import_module(serving_env.module_name, serving_env.module_dir)
File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 97, in import_module user_module = importlib.import_module(module_name)
File "/usr/lib/python3.5/importlib/init.py", line 117, in import_module if name.startswith('.'):
AttributeError: 'NoneType' object has no attribute 'startswith'
169.254.255.130 - - [22/Sep/2019:09:18:15 +0000] "GET /ping HTTP/1.1" 500 141 "-" "Go-http-client/1.1"
2019-09-22 09:18:15,178 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
2019-09-22 09:18:15,179 INFO - sagemaker_sklearn_container.serving - Encountered an unexpected error.
[2019-09-22 09:18:15 +0000] [30] [ERROR] Error handling request /ping
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 56, in handle self.handle_request(listener_name, req, client, addr)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/ggevent.py", line 160, in handle_request addr)
File "/usr/local/lib/python3.5/dist-packages/gunicorn/workers/base_async.py", line 107, in handle_request respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 119, in main user_module_transformer = import_module(serving_env.module_name, serving_env.module_dir)
File "/usr/local/lib/python3.5/dist-packages/sagemaker_sklearn_container/serving.py", line 97, in import_module user_module = importlib.import_module(module_name)
File "/usr/lib/python3.5/importlib/init.py", line 117, in import_module if name.startswith('.'):
Double check you have the necessary environment variables set. I ran into this issue when I didn't set the environment variables SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT, SAGEMAKER_PROGRAM, and SAGEMAKER_SUBMIT_DIRECTORY. Check a working base model to see what environment variables need to be set.
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I am a beginner trying to scrape bitcoin price history, everything works fine until I try to append it to a list, as nothing ends up being appended to the list.
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
url = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190821'
page = requests.get(url).content
soup = BeautifulSoup(page, 'html.parser')
priceDiv = soup.find('div', attrs={'class':'table-responsive'})
rows = priceDiv.find_all('tr')
data = []
i = 0
for row in rows:
temp = []
tds = row.findChildren()
for td in tds:
temp.append(td.text)
if i > 0:
temp[0] = temp[0].replace(',', '')
temp[6] = temp[6].replace(',', '')
if temp[5] == '-':
temp[5] = 0
else:
temp[5] = temp[5].replace(',', '')
data.append({'date': datetime.strptime(temp[0], '%b %d %Y'),
'open': float(temp[1]),
'high': float(temp[2]),
'low': float(temp[3]),
'close': float(temp[4]),
'volume': float(temp[5]),
'market_cap': float(temp[6])})
i += 1
df = pd.DataFrame(data)
If I try to print df or data it is just empty.
As noted above, you need to increment i outside that the check for > 0.
Secondly, have you considered using pandas .read_html(). That will do the hard work for you.
Code:
import pandas as pd
url = 'https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20130428&end=20190821'
dfs = pd.read_html(url)
df = dfs[0]
Output:
print (df)
Date Open* ... Volume Market Cap
0 Aug 20, 2019 10916.35 ... 15053082175 192530283565
1 Aug 19, 2019 10350.28 ... 16038264603 195243306008
2 Aug 18, 2019 10233.01 ... 12999813869 185022920955
3 Aug 17, 2019 10358.72 ... 13778035685 182966857173
4 Aug 16, 2019 10319.42 ... 20228207096 185500055339
5 Aug 15, 2019 10038.42 ... 22899115082 184357666577
6 Aug 14, 2019 10889.49 ... 19990838300 179692803424
7 Aug 13, 2019 11385.05 ... 16681503537 194762696644
8 Aug 12, 2019 11528.19 ... 13647198229 203441494985
9 Aug 11, 2019 11349.74 ... 15774371518 205941632235
10 Aug 10, 2019 11861.56 ... 18125355447 202890020455
11 Aug 09, 2019 11953.47 ... 18339989960 211961319133
12 Aug 08, 2019 11954.04 ... 19481591730 213788089212
13 Aug 07, 2019 11476.19 ... 22194988641 213330426789
14 Aug 06, 2019 11811.55 ... 23635107660 205023347814
15 Aug 05, 2019 10960.74 ... 23875988832 210848822060
16 Aug 04, 2019 10821.63 ... 16530894787 195907875403
17 Aug 03, 2019 10519.28 ... 15352685061 193233960601
18 Aug 02, 2019 10402.04 ... 17489094082 187791090996
19 Aug 01, 2019 10077.44 ... 17165337858 185653203391
20 Jul 31, 2019 9604.05 ... 16631520648 180028959603
21 Jul 30, 2019 9522.33 ... 13829811132 171472452506
22 Jul 29, 2019 9548.18 ... 13791445323 169880343827
23 Jul 28, 2019 9491.63 ... 13738687093 170461958074
24 Jul 27, 2019 9871.16 ... 16817809536 169099540423
25 Jul 26, 2019 9913.13 ... 14495714483 176085968354
26 Jul 25, 2019 9809.10 ... 15821952090 176806451137
27 Jul 24, 2019 9887.73 ... 17398734322 175005760794
28 Jul 23, 2019 10346.75 ... 17851916995 176572890702
29 Jul 22, 2019 10596.95 ... 16334414913 184443440748
... ... ... ... ...
2276 May 27, 2013 133.50 ... - 1454029510
2277 May 26, 2013 131.99 ... - 1495293015
2278 May 25, 2013 133.10 ... - 1477958233
2279 May 24, 2013 126.30 ... - 1491070770
2280 May 23, 2013 123.80 ... - 1417769833
2281 May 22, 2013 122.89 ... - 1385778993
2282 May 21, 2013 122.02 ... - 1374013440
2283 May 20, 2013 122.50 ... - 1363709900
2284 May 19, 2013 123.21 ... - 1363204703
2285 May 18, 2013 123.50 ... - 1379574546
2286 May 17, 2013 118.21 ... - 1373723882
2287 May 16, 2013 114.22 ... - 1325726787
2288 May 15, 2013 111.40 ... - 1274623813
2289 May 14, 2013 117.98 ... - 1243874488
2290 May 13, 2013 114.82 ... - 1315710011
2291 May 12, 2013 115.64 ... - 1281982625
2292 May 11, 2013 117.70 ... - 1284207489
2293 May 10, 2013 112.80 ... - 1305479080
2294 May 09, 2013 113.20 ... - 1254535382
2295 May 08, 2013 109.60 ... - 1264049202
2296 May 07, 2013 112.25 ... - 1240593600
2297 May 06, 2013 115.98 ... - 1249023060
2298 May 05, 2013 112.90 ... - 1288693176
2299 May 04, 2013 98.10 ... - 1250316563
2300 May 03, 2013 106.25 ... - 1085995169
2301 May 02, 2013 116.38 ... - 1168517495
2302 May 01, 2013 139.00 ... - 1298954594
2303 Apr 30, 2013 144.00 ... - 1542813125
2304 Apr 29, 2013 134.44 ... - 1603768865
2305 Apr 28, 2013 135.30 ... - 1488566728
[2306 rows x 7 columns]
I have set compression settings in SparkConf as follows:
sparkConf.set("spark.sql.parquet.compression.codec", "SNAPPY")
and also after 'SparkSession' creation as follows:
val spark = SparkSession
.builder()
.config(sparkConf)
.config("spark.sql.parquet.compression.codec", "GZIP") //SNAPPY
.config("spark.io.compression.codec", "org.apache.spark.io.LZ4CompressionCodec")
However, I see following in executor stdout:
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.codec.CodecConfig: Compression set to false
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.codec.CodecConfig: Compression: UNCOMPRESSED
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary is on
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Validation is off
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Mar 16, 2019 10:34:17 AM INFO: parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
Concerning output is :
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.codec.CodecConfig: Compression set to false
Mar 16, 2019 10:34:16 AM INFO: parquet.hadoop.codec.CodecConfig: Compression: UNCOMPRESSED
Does this mean spark is writing uncompressed data to parquet? If not how do I verify? Is there a way to view parquet metadata?