Synchronization problem while executing Simulink FMU in ROS 2-Gazebo (TF_OLD_DATA warning) - fmi

I'm working in a co-simulation project between Simulink and Gazebo. The aim is to move a robot model in Gazebo with the trajectory coordinates computed from Simulink. I'm using MATLAB R2022a, ROS 2 Dashing and Gazebo 9.9.0 in a computer running Ubuntu 18.04.
The problem is that when launching the FMU with the fmi_adapter, I'm obtaining the following. It is tagged as [INFO], but actually messing up all my project.
[fmi_adapter_node-1] [INFO] [fmi_adapter_node]: Simulation time 1652274762.959713 is greater than timer's time 1652274762.901340. Is your step size to large?
Note the timer's time is higher than the simulation time. Even if I try to change the step size with the optional argument of the fmi_adapter_node, the same log appears with small differences in the times. I'm using the next commands:
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu # default step size: 0.2
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu _step_size:=0.001
As you would expect, the outputs of the FMU are the xyz coordinates of the robot trajectory in each time step. Since the fmi_adapter_node creates topics for both inputs and outputs, I'm reading the output xyz values by means of 3 subscribers with the next code. Then, those coordinates are being used to program the robot trajectories with the MoveIt-Python API.
When I run the previous Python code, I'm obtaining the following warning once and again and the robot manipulator actually doesn't move.
[ WARN] [1652274804.119514250]: TF_OLD_DATA ignoring data from the past for frame motor6_link at time 870.266 according to authority unknown_publisher
Possible reasons are listed at http://wiki.ros.org/tf/Errors%20explained
The previous warning is explained here, but I'm not able to fix it. I've tried clicking Reset in RViz, but nothing changes. I've also tried the following without success:
ros2 param set /fmi_adapter_node use_sim_time true # it just sets the timer's time to 0
It seems that the clock is taking negative values, so there is a synchronization problem.
Any help is welcome.

The warning message by the FMIAdapterNode is emitted if the timer's period is only slightly greater than the simulation step-size and if the timer is preempted by other processes or threads.
I created an issue at https://github.com/boschresearch/fmi_adapter/issues/9 which explains this in more detail and lists two possible fixes. It would be great if you could contribute to this discussion.
I assume that the TF_OLD_DATA error is not related to the fmi_adapter. Looking at the code snippet at ROS Answers, I wondered whether x,y,z values are re-published at all given that the lines
pose.position.x = listener_x.value
pose.position.y = listener_y.value
pose.position.z = listener_z.value
are not inside a callback and executed even before rospy.spin(), but maybe that's just truncated.

Related

Calculation dynamic delay in AnyLogic

Good day!
Please, help me understand how the Delay block works in AnyLogic. Suppose we deal with a multichannel transmission network.
The model has 2 sources. Suppose these sources generate packets every 1 sec. Packets from different sources have different priorities and need different quantities of resources to be served (it is set up with Priority and Resource_quantity parameters respectively). The Priority_queue in the model is priority-based. The proposed model put the packets into the channels in accordance with Resource availability in the channel. Firstly, it tries to put the packet to the first channel. If there are no available resources it puts the packet into the second channel. If there are no resources in both channels it waits (it is realized with Hold block).
I noticed that if I set delays in blocks delay1 and delay2 with static parameters (for ex. 2 sec) the model works ok. But then I try to calculate it before these blocks the model doesn't take into consideration it at all. And in this case, the model works without any delays.
What did I do wrong here?
I will appreciate any help.
The delay is calculated in Exit block and is written into the variable delay of the agent. I tried to add traceln(agent.delay) as #Jaco-Ben suggested right after calculation of the delay and it showed zero. In this case it doesn't also seize resources :(
Thank #Jaco-Ben for the useful comments.
The delay is zero because
the result of division in Java depends on the types of the operands.
If both operands are integer, the result will be integer as well. To
let Java perform real division (and get a real number as a result) at
least one of the operands must be of real type.
So it was my problem.
To solve it I assigned double to one of the operands :
agent.delay = (double)agent.Resource_quantity/ChannelResources1.idle();
However, it is strange why it shows correct values in the database.

Extracting Meaningful Error Message from 'RuntimeError: CUDA error: device-side assert triggered' on Google Colab in Pytorch

I am experiencing the following error while training a generative network via Pytorch 1.9.0+cu102:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
While using a Google Colaboratory GPU session. This segment was triggered on either one of these two lines:
running_loss += loss.item()
or
target = target.to(device)
It produces the error on the first line when I am first running the notebook, and the second line each subsequent time I try to run the block. The first error occurs after training for 3 batches. The second error happens on the first batch. I can confirm that the device is cuda0, that device is available, and target is a pytorch tensor. Naturally, I tried to take the advice of the error and run:
!CUDA_LAUNCH_BLOCKING=1
and
os.system('CUDA_LAUNCH_BLOCKING=1')
However, neither of these lines changes the error message. According to a different post, this is because colab is running these lines in a subshell. The error does not occur when running on CPU, and I do not have access to a GPU device besides the GPU on Colab. While this question has been asked in many different forms, no answers are particularly helpful to me because they either recommend passing the aforementioned line, are about a situation fundamentally different from my own (such as training a classifier with an inappropriate number of classes), or recommend a solution which I have already tried, such as resetting the runtime or switching to CPU.
I am hoping to gain insight into the following questions:
Is there a way for me to get a more specific error message? Efforts to set the launch blocking variable have been unsuccessful.
How could it be that I am getting this error on two seemingly very different lines? How could it be that my network trains for 3 batches (it is always 3), but fails on the fourth?
Does this situation remind anyone of an error that they have encountered previously, and have a possible route for ameliorating it given the limited information I can extract?
I was successfully able to get more information about the error by executing:
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
BEFORE importing torch. This allowed me to get a more detailed traceback and ultimately diagnose the problem as an inappropriate loss function.
This can be mainly due to 2 reasons:
Inconsistency in the number of classes
Wrong input for the loss function
If it's the first one, then see you should get the same error when you change the runtime back to CPU.
In my case, it was the second one. I had used BCE loss, and its input should be between 0 and 1. If it's any other value, this error might appear. So I fixed this by using:
criterion=nn.BCEWithLogitsLoss()
instead of:
criterion=nn.BCELoss()
Oh yeah, and I also used:
CUDA_LAUNCH_BLOCKING = "1"
at the beginning of the code.

Communication frequency vs Simulation Time for FMU

Lets say we have a FMU which is getting inputs from Python and simulating at an interval of 0.001s. Does the FMI/FMU standard allow us to run the FMU multiple times for a same input (so Python provides the input at 0.01s interval and the FMU simulates that 10 times at each step)? Would that be faster since we have reduced the communication interface by 1/10th ?
(For CS FMUs:) Updating the inputs only every 10th step can be seen as a special co-simualtion algorithm and is ok. Input variables keep their values until they they are newly set.
This will only lead to a benefit in simulation speed, if the the internal calculation time (of a doStep) is small compared to the communication runtime.

Custom player using NDK/C++/MediaCodec - starvation/buffering in decoder

I have a very interesting problem.
I am running custom movie player based on NDK/C++/CMake toolchain that opens streaming URL (mp4, H.264 & stereo audio). In order to restart from given position, player opens stream, buffers frames to some length and then seeks to new position and start decoding and playing. This works fine all the times except if we power-cycle the device and follow the same steps.
This was reproduced on few version of the software (plugin build against android-22..26) and hardware (LG G6, G5 and LeEco). This issue does not happen if you keep app open for 10 mins.
I am looking for possible areas of concern. I have played with decode logic (it is based on the approach described as synchronous processing using buffers).
Edit - More Information (4/23)
I modified player to pick a stream and then played only video instead of video+audio. This resulted in constant starvation resulting in buffering. This appears to have changed across android version (no fix data here). I do believe that I am running into decoder starvation. Previously, I had set timeouts of 0 for both AMediaCodec_dequeueInputBuffer and AMediaCodec_dequeueOutputBuffer, which I changed on input side to 1000 and 10000 but does not make much difference.
My player is based on NDK/C++ interface to MediaCodec, CMake build passes -DANDROID_ABI="armeabi-v7a with NEON" and -DANDROID_NATIVE_API_LEVEL="android-22" \ and C++_static.
Anyone can share what timeouts they have used and found success with it or anything that would help avoid starvation or resulting buffering?
This is solved for now. Starvation was not caused from decoding perspective but images were consumed in faster pace as clock value returned were not in sync. I was using clock_gettime method with CLOCK_MONOTONIC clock id, which is recommended way but it was always faster for first 5-10 mins of restarting device. This device only had Wi-Fi connection. Changing clock id to CLOCK_REALTIME ensures correct presentation of images and no starvation.

How do I know the last sched time of a process

I current run into an issue that a process seems stuck somehow, it just doesn't gets scheduled, the status is always 'S'. I have monitored sched_switch_task trace by debugfs for a while, didn't see the process get scheduled. So I would like to know when is that last time scheduled of this process by kernel?
Thanks a lot.
It might be possible using the info in /proc/pid#/sched file.
In there you can find these parameters (depending on the OS version, mine is opensuse 3.16.7-21-desktop):
se.exec_start : 593336938.868448
...
se.statistics.wait_start : 0.000000
se.statistics.sleep_start : 593336938.868448
se.statistics.block_start : 0.000000
The values represent timestamps relative to the system boot time, but in a unit which may depend on your system (in my example the unit is 0.5 msec, for a total value of ~6 days 20 hours and change).
In the last 3 parameters listed above at most one appears to be non-zero at any time and it I suspect that the respective non-zero value represents the time when it last entered the corresponding state (with the process actively running when all are zero).
So if your process is indeed stuck the non-zero value would have recorded when it got stuck.
Note: this is mostly based on observations and assumptions - I didn't find these parameters documented anywhere, so take them with a grain of salt.
Plenty of other scheduling info in that file, but mostly stats and without documentation difficult to use.

Resources