Spark crashing with any python UDF, but not if I execute locally - apache-spark

I'm really confused here folks, and I hope someone can help:
I have Delta Table that stores syslog messages. When I use any python UDF with it, it crashes, whereas if I collect the results, create a local dataframe and run the same udf on the local dataframe, then it does not. The exception is uninformative, and I'm wonder how I can proceed with debugging.
Here is the dataframe:
rn_df = (
spark
.read
.format("delta")
.load(_get_delta_dbfs_location("rn"))
.filter(F.col("parse_status") == "success")
.drop("boot_number")
)
Here is the schema:
df:pyspark.sql.dataframe.DataFrame
APP_NAME:string
DEVICEID:string
FACILITY:string
HOSTNAME:string
MESSAGE:string
MSGID:string
PRI:string
PROCID:string
SEVERITY:string
STRUCTURED_DATA:string
DEVICE_TIMESTAMP:timestamp
SYSLOG_MSG:string
parse_status:string
account:string
ingestion_timestamp:timestamp
ingestion_date:date
Here is the UDF and query I'm running:
testing_udf = f.udf(lambda x: str(x) + "asdf", t.StringType())
rn_df.limit(10).withColumn("udf_test", testing_udf(f.col("MESSAGE"))).display()
Here is the exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 106.0 failed 4 times, most recent failure: Lost task 2.3 in stage 106.0 (TID 17251) (10.62.239.214 executor 68): ExecutorLostFailure (executor 68 exited caused by one of the running tasks) Reason: Command exited with code 134
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3312)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3244)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3235)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3235)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1425)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1425)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1425)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3524)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3462)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3450)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:51)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$runJob$1(DAGScheduler.scala:1170)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1158)
at org.apache.spark.SparkContext.runJobInternal(SparkContext.scala:2716)
at org.apache.spark.sql.execution.collect.Collector.$anonfun$runSparkJobs$1(Collector.scala:348)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.collect.Collector.runSparkJobs(Collector.scala:292)
at org.apache.spark.sql.execution.collect.Collector.collect(Collector.scala:376)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:133)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:120)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:108)
at org.apache.spark.sql.execution.qrc.InternalRowFormat$.collect(cachedSparkResults.scala:90)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$computeResult$1(ResultCacheManager.scala:528)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.collectResult$1(ResultCacheManager.scala:520)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.computeResult(ResultCacheManager.scala:540)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:395)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:388)
at org.apache.spark.sql.execution.qrc.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:286)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeCollectResult$1(SparkPlan.scala:438)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:435)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3473)
at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3464)
at org.apache.spark.sql.Dataset.$anonfun$withAction$3(Dataset.scala:4346)
at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:794)
at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4344)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:245)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:414)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:190)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1003)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:144)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:364)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4344)
at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3463)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:267)
at com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:101)
at com.databricks.backend.daemon.driver.PythonDriverLocalBase.generateTableResult(PythonDriverLocalBase.scala:763)
at com.databricks.backend.daemon.driver.JupyterDriverLocal.computeListResultsItem(JupyterDriverLocal.scala:1334)
at com.databricks.backend.daemon.driver.JupyterDriverLocal$JupyterEntryPoint.addCustomDisplayData(JupyterDriverLocal.scala:491)
at sun.reflect.GeneratedMethodAccessor685.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
Here is an example of running successfully on a local dataframe:
local_rows = rn_df.limit(10).collect()
local_dtf = spark.createDataFrame(local_rows)
local_dtf.withColumn("udf_test", testing_udf(f.col("MESSAGE"))).show()
Here is the output
+------------------+---------------+--------+---------------+--------------------+-----+-------------+------+--------+---------------+--------------------+--------------------+------------+-------+--------------------+--------------+--------------------+
| APP_NAME| DEVICEID|FACILITY| HOSTNAME| MESSAGE|MSGID| PRI|PROCID|SEVERITY|STRUCTURED_DATA| DEVICE_TIMESTAMP| SYSLOG_MSG|parse_status|account| ingestion_timestamp|ingestion_date| udf_test|
+------------------+---------------+--------+---------------+--------------------+-----+-------------+------+--------+---------------+--------------------+--------------------+------------+-------+--------------------+--------------+--------------------+
| SYS|S150F2224016872| user|S150F2224016872|/extra/logs/syslo...| -| user.info| -| info| -|2023-01-16 23:45:...|user.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|/extra/logs/syslo...|
| SYS|S150F2224016872| user|S150F2224016872|exec_cmd( curl -s...| -| user.err| -| err| -|2023-01-16 23:46:...|user.err 2023-01-...| success| prod|2023-01-17 03:29:...| 2023-01-17|exec_cmd( curl -s...|
| SYS|S150F2224016872| user|S150F2224016872|exec_cmd( nice -n...| -| user.err| -| err| -|2023-01-16 23:46:...|user.err 2023-01-...| success| prod|2023-01-17 03:29:...| 2023-01-17|exec_cmd( nice -n...|
|connection_manager|S150F2224016872| local0|S150F2224016872|1582.957542:conne...| -|local0.notice| 1852| notice| -|2023-01-16 23:46:...|local0.notice 202...| success| prod|2023-01-17 03:29:...| 2023-01-17|1582.957542:conne...|
| crond|S150F2224016872| cron|S150F2224016872|USER root pid 113...| -| cron.info| 783| info| -|2023-01-17 00:00:...|cron.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|USER root pid 113...|
| logrotate|S150F2223731479| user|S150F2223731479|Rotated /var/log/...| -| user.info| -| info| -|2023-01-17 00:15:...|user.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|Rotated /var/log/...|
| logrotate|S150F2223731479| user|S150F2223731479|Creating symlink ...| -| user.info| -| info| -|2023-01-17 00:15:...|user.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|Creating symlink ...|
| SYS|S150F2223731479| user|S150F2223731479|Initiating log pu...| -| user.info| -| info| -|2023-01-17 00:15:...|user.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|Initiating log pu...|
| SYS|S150F2223731479| user|S150F2223731479|Fetching list of ...| -| user.info| -| info| -|2023-01-17 00:15:...|user.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|Fetching list of ...|
| SYS|S150F2223731479| user|S150F2223731479|/extra/logs/syslo...| -| user.info| -| info| -|2023-01-17 00:15:...|user.info 2023-01...| success| prod|2023-01-17 03:29:...| 2023-01-17|/extra/logs/syslo...|
+------------------+---------------+--------+---------------+--------------------+-----+-------------+------+--------+---------------+--------------------+--------------------+------------+-------+--------------------+--------------+--------------------+
Update 2023-02-16
I "fixed" the issue by disabling Photon. I am working with DataBricks support on a permanent solution. I will do a writeup when I have that.

Related

Bitbucket Pipelines export into variable using jq and xq causes error

When i running a pipeline in bitbucket i want to export into variable using
export APEX_CLASSES=$(xq . < package/package.xml | jq '.Package.types | [.] | flatten | map(select(.name=="ApexClass")) | .[] | .members | [.] | flatten | map(select(. | index("*") | not)) | unique | join(",")' -r)
but i got error in pipeline
parse error: Invalid numeric literal at line 1, column 5
i tried to identify a error but i always get same error :(
When i add a escape \ before " i got this error
jq: error: syntax error, unexpected INVALID_CHARACTER (Unix shell quoting issues?) at <top-level>, line 1:
.Package.types | [.] | flatten | map(select(.name==\"ApexClass\")) | .[] | .members | [.] | flatten | map(select(. | index(\"*\") | not)) | unique | join(\",\")
jq: 1 compile error
This is package.xml
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>AccountHelper</members>
<members>BoatHelper</members>
<members>CaseHelper</members>
<name>ApexClass</name>
</types>
<version>57.0</version>
</Package>

Can’t enable encryption in YugabyteDB cluster using yugabyted cli

[Question posted by a user on YugabyteDB Community Slack]
I'm running the yb-cluster of 3 logical node on 1 VM. I am trying with SSL Mode enabled cluster. Below is the property file i am using to start the cluster with SSL Mode ON:
./bin/yugabyted start --config /data/ybd1/config_1.config
./bin/yugabyted start --base_dir=/data/ybd2 --listen=127.0.0.2 --join=192.168.56.12
./bin/yugabyted start --base_dir=/data/ybd3 --listen=127.0.0.3 --join=192.168.56.12
my config file:
{
"base_dir": "/data/ybd1",
"listen": "192.168.56.12",
"certs_dir": "/root/192.168.56.12/",
"allow_insecure_connections": "false",
"use_node_to_node_encryption": "true"
"use_client_to_server_encryption": "true"
}
I am able to connect using:
bin/ysqlsh -h 127.0.0.3 -U yugabyte -d yugabyte
ysqlsh (11.2-YB-2.11.1.0-b0)
Type "help" for help.
yugabyte=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------------+----------+----------+---------+-------------+-----------------------
postgres | postgres | UTF8 | C | en_US.UTF-8 |
system_platform | postgres | UTF8 | C | en_US.UTF-8 |
template0 | postgres | UTF8 | C | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | en_US.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
yugabyte | postgres | UTF8 | C | en_US.UTF-8 |
But when I am trying to connect to my yb-cluster from psql client. I am getting below errors.
psql -h 192.168.56.12 -p 5433
psql: error: connection to server at "192.168.56.12", port 5433 failed: FATAL: Timed out: OpenTable RPC (request call id 2) to 192.168.56.12:9100 timed out after 120.000s
postgres#acff2570dfbc:~$
And in yb t-server logs I am getting below errors:
I0228 05:00:21.248733 21631 async_initializer.cc:90] Successfully built ybclient
2022-02-28 05:02:21.248 UTC [21624] FATAL: Timed out: OpenTable RPC (request call id 2) to 192.168.56.12:9100 timed out after 120.000s
I0228 05:02:21.251086 21627 poller.cc:66] Poll stopped: Service unavailable (yb/rpc/scheduler.cc:80): Scheduler is shutting down (system error 108)
2022-02-28 05:54:20.987 UTC [23729] LOG: invalid length of startup packet
Any HELP in this regard is really apricated.
You’re setting your config wrong when using yugabyted tool. You want to use --master_flags and --tserver_flags like explained in the docs: https://docs.yugabyte.com/latest/reference/configuration/yugabyted/#flags.
An example:
bin/yugabyted start --base_dir=/data/ybd1 --listen=192.168.56.12 --tserver_flags=use_client_to_server_encryption=true,ysql_enable_auth=true,use_cassandra_authentication=true,certs_for_client_dir=/root/192.168.56.12/
Sending the parameters this way should work on your cluster.

Spark Structured Streaming application reading from Kafka return only null values

I plan to extract the data from Kafka using Spark Structured Streaming, but I got empty data.
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_csv, from_json
from pyspark.sql.types import StringType, StructType
if __name__ == '__main__':
spark = SparkSession \
.builder \
.appName("pyspark_structured_streaming_kafka") \
.getOrCreate()
df_raw = spark.read \
.format("kafka") \
.option("kafka.bootstrap.servers","52.81.249.81:9092") \
.option("subscribe","product") \
.option("kafka.ssl.endpoint.identification.algorithm","") \
.option("kafka.isolation.level","read_committed") \
.load()
df_raw.printSchema()
product_schema = StructType() \
.add("product_name", StringType()) \
.add("product_factory", StringType()) \
.add("yield_num", StringType()) \
.add("yield_time", StringType())
df_1=df_raw.selectExpr("CAST(value AS STRING)") \
.select(from_json("value",product_schema).alias("data")) \
.select("data.*") \
.write \
.format("console") \
.save()
My test data is the following
{
"product_name": "X Laptop",
"product_factory": "B-3231",
"yield_num": 899,
"yield_time": "20210201 22:00:01"
}
But the result is out of my predication
./spark-submit ~/Documents/3-Playground/kbatch.py
+------------+---------------+---------+----------+
|product_name|product_factory|yield_num|yield_time|
+------------+---------------+---------+----------+
| null| null| null| null|
| null| null| null| null|
The test data was published by the command:
./kafka-producer-perf-test.sh --topic product --num-records 90000000 --throughput 5 --producer.config ../config/producer.properties --payload-file ~/Downloads/product.json
If cut away some code just like this
df_1=df_raw.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("console") \
.outputMode("append") \
.option("checkpointLocation","file:///Users/picomy/Kafka-Output/checkpoint") \
.start() \
.awaitTermination()
The result is the following
Batch: 3130
-------------------------------------------
+--------------------+
| value|
+--------------------+
| "yield_time":...|
| "product_name...|
| "yield_num": ...|
| "product_fact...|
| "yield_num": ...|
| "yield_num": ...|
| "product_fact...|
| "product_fact...|
| "product_name...|
| "product_fact...|
| "product_name...|
| }|
| "yield_time":...|
| "product_name...|
| }|
| "product_fact...|
| "yield_num": ...|
| "product_fact...|
| "yield_time":...|
| "product_name...|
+--------------------+
I don't know where is the problem's root cause.
There are few things causing your code not to be working correct:
Wrong schema (the field yield_num is an integer/long)
Using writeStream instead of just write (if you want streaming)
Start and awaitTermination of the streaming query
The data in your json file should be stored in one line only
You can replace parts of your code with the following snippet:
from pyspark.sql.types import StringType, StructType, LongType
product_schema = StructType() \
.add("product_name", StringType()) \
.add("product_factory", StringType()) \
.add("yield_num", LongType()) \
.add("yield_time", StringType())
df_1=df_raw.selectExpr("CAST(value AS STRING)") \
.select(from_json("value",product_schema).alias("data")) \
.select("data.*") \
.writeStream \
.format("console") \
.start()
.awaitTermination()

How to generate single output json object from multiple log lines in logstash filter?

I am new to Logstash and Grok filter. I want to parse logs like these -
2018-01-11 17:17:16,071 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | CommittedVirtualMemorySize :: 401186816
2018-01-11 17:17:16,071 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | FreePhysicalMemorySize :: 1751130112
2018-01-11 17:17:16,072 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | FreeSwapSpaceSize :: 4294967295
2018-01-11 17:17:16,694 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | ProcessCpuLoad :: -1.0
2018-01-11 17:17:16,694 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | ProcessCpuTime :: 47471104300
2018-01-11 17:17:16,698 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | SystemCpuLoad :: 1.0
2018-01-11 17:17:16,698 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | TotalPhysicalMemorySize :: 4285849600
2018-01-11 17:17:16,698 | DEBUG | [Thread-2] | com.example.monitor.MonitorHelper:cpuMonitoring(307) | TotalSwapSpaceSize :: 4294967295
to a JSON Object like this -
{
"timestamp": "2018-01-11 17:17:16,071",
"log_level": "DEBUG",
"thread_name": "Thread-2",
"class": "com.example.monitor.MonitorHelper",
"method": "cpuMonitoring",
"line_number": "307",
"CommittedVirtualMemorySize": "401186816",
"FreePhysicalMemorySize": "1751130112",
"FreeSwapSpaceSize": "4294967295",
"ProcessCpuLoad": "-1.0",
"ProcessCpuTime": "47471104300",
"SystemCpuLoad": "1.0",
"TotalPhysicalMemorySize": "4285849600",
"TotalSwapSpaceSize": "4294967295"
}
As of now my grok pattern is -
%{TIMESTAMP_ISO8601:timestamp} \| %{LOGLEVEL:log_level} \| [(?\b[\w-]+\b)] \| %{JAVAFILE:class}:%{JAVAMETHOD:method}(%{NUMBER:line_number}) \| %{GREEDYDATA:log_message}
which provides multiple output lines for each input log line. JSON object looks like this-
{
"timestamp": "2018-01-11 17:17:16,071",
"log_level": "DEBUG",
"thread_name": "Thread-2",
"class": "com.example.monitor.MonitorHelper",
"method": "cpuMonitoring",
"line_number": "307",
"log_message": "CommittedVirtualMemorySize :: 401186816 "
}
can you please help me with what I need to look for in order to achieve this?
The first recommendation is to change the original log output into a single line.
If you can't, and you're using filebeat to ship the file, use FB's multiline config to merge the lines before sending it to logstash.
If you're not using filebeat, you can try to use the multiline codec in logstash.

what is the difference between cmd and idle when using tqdm?

recently I want to add a simple progress bar to my script, I use tqdm to that, but what puzzle me is that the output is different when I am in the IDLE or in the cmd
for example this
from tqdm import tqdm
import time
def test():
for i in tqdm( range(100) ):
time.sleep(0.1)
give the expected output in the cmd
30%|███ | 30/100 [00:03<00:07, 9.14it/s]
but in the IDLE the output is like this
0%| | 0/100 [00:00<?, ?it/s]
1%|1 | 1/100 [00:00<00:10, 9.14it/s]
2%|2 | 2/100 [00:00<00:11, 8.77it/s]
3%|3 | 3/100 [00:00<00:11, 8.52it/s]
4%|4 | 4/100 [00:00<00:11, 8.36it/s]
5%|5 | 5/100 [00:00<00:11, 8.25it/s]
6%|6 | 6/100 [00:00<00:11, 8.17it/s]
7%|7 | 7/100 [00:00<00:11, 8.12it/s]
8%|8 | 8/100 [00:00<00:11, 8.08it/s]
9%|9 | 9/100 [00:01<00:11, 8.06it/s]
10%|# | 10/100 [00:01<00:11, 8.04it/s]
11%|#1 | 11/100 [00:01<00:11, 8.03it/s]
12%|#2 | 12/100 [00:01<00:10, 8.02it/s]
13%|#3 | 13/100 [00:01<00:10, 8.01it/s]
14%|#4 | 14/100 [00:01<00:10, 8.01it/s]
15%|#5 | 15/100 [00:01<00:10, 8.01it/s]
16%|#6 | 16/100 [00:01<00:10, 8.00it/s]
17%|#7 | 17/100 [00:02<00:10, 8.00it/s]
18%|#8 | 18/100 [00:02<00:10, 8.00it/s]
19%|#9 | 19/100 [00:02<00:10, 8.00it/s]
20%|## | 20/100 [00:02<00:09, 8.00it/s]
21%|##1 | 21/100 [00:02<00:09, 8.00it/s]
22%|##2 | 22/100 [00:02<00:09, 8.00it/s]
23%|##3 | 23/100 [00:02<00:09, 8.00it/s]
24%|##4 | 24/100 [00:02<00:09, 8.00it/s]
25%|##5 | 25/100 [00:03<00:09, 8.00it/s]
26%|##6 | 26/100 [00:03<00:09, 8.00it/s]
27%|##7 | 27/100 [00:03<00:09, 8.09it/s]
28%|##8 | 28/100 [00:03<00:09, 7.77it/s]
29%|##9 | 29/100 [00:03<00:09, 7.84it/s]
30%|### | 30/100 [00:03<00:08, 7.89it/s]
31%|###1 | 31/100 [00:03<00:08, 7.92it/s]
32%|###2 | 32/100 [00:03<00:08, 7.94it/s]
33%|###3 | 33/100 [00:04<00:08, 7.96it/s]
34%|###4 | 34/100 [00:04<00:08, 7.97it/s]
35%|###5 | 35/100 [00:04<00:08, 7.98it/s]
36%|###6 | 36/100 [00:04<00:08, 7.99it/s]
37%|###7 | 37/100 [00:04<00:07, 7.99it/s]
38%|###8 | 38/100 [00:04<00:07, 7.99it/s]
39%|###9 | 39/100 [00:04<00:07, 8.00it/s]
40%|#### | 40/100 [00:04<00:07, 8.00it/s]
41%|####1 | 41/100 [00:05<00:07, 8.00it/s]
I also get the same result if I make my own progress bar like
import sys
def progress_bar_cmd(count,total,suffix="",*,bar_len=60,file=sys.stdout):
filled_len = round(bar_len*count/total)
percents = round(100*count/total,2)
bar = "#"*filled_len + "-"*(bar_len - filled_len)
file.write( "[%s] %s%s ...%s\r"%(bar,percents,"%",suffix))
file.flush()
for i in range(101):
time.sleep(1)
progress_bar_cmd(i,100,"range 100")
why is that????
and there is a way to fix it???
Limiting ourselves to ascii characters, the program output of your second code is the same in both cases -- a stream of ascii bytes representing ascii chars. The language definition does not and cannot specify what an output device or display program will do with the bytes, in particular with control characters such as '\r'.
The Windows Command Prompt console at least sometimes interprets '\r' as 'return the cursor to the beginning of the current line without erasing anything'.
In a Win10 console:
>>> import sys; out=sys.stdout
>>> out.write('abc\rdef')
def7
However, when I run your second code, with the missing time import added, I do not see the overwrite behavior, but see the same continued line output as with IDLE.
C:\Users\Terry>python f:/python/mypy/tem.py
[------------------------------------------------------------] 0.0% ...range 100[#-----------------------------------------------------------] ...
On the third hand, if shorten the write to file.write("[%s]\r"% bar), then I do see one output overwritten over and over.
The tk Text widget used by IDLE only interprets \t and \n, but not other control characters. To some of us, this seems appropriate for a development environment, where erasing characters is less appropriate than in a production environment.

Resources