How do I know the last sched time of a process - linux

I current run into an issue that a process seems stuck somehow, it just doesn't gets scheduled, the status is always 'S'. I have monitored sched_switch_task trace by debugfs for a while, didn't see the process get scheduled. So I would like to know when is that last time scheduled of this process by kernel?
Thanks a lot.

It might be possible using the info in /proc/pid#/sched file.
In there you can find these parameters (depending on the OS version, mine is opensuse 3.16.7-21-desktop):
se.exec_start : 593336938.868448
...
se.statistics.wait_start : 0.000000
se.statistics.sleep_start : 593336938.868448
se.statistics.block_start : 0.000000
The values represent timestamps relative to the system boot time, but in a unit which may depend on your system (in my example the unit is 0.5 msec, for a total value of ~6 days 20 hours and change).
In the last 3 parameters listed above at most one appears to be non-zero at any time and it I suspect that the respective non-zero value represents the time when it last entered the corresponding state (with the process actively running when all are zero).
So if your process is indeed stuck the non-zero value would have recorded when it got stuck.
Note: this is mostly based on observations and assumptions - I didn't find these parameters documented anywhere, so take them with a grain of salt.
Plenty of other scheduling info in that file, but mostly stats and without documentation difficult to use.

Related

Synchronization problem while executing Simulink FMU in ROS 2-Gazebo (TF_OLD_DATA warning)

I'm working in a co-simulation project between Simulink and Gazebo. The aim is to move a robot model in Gazebo with the trajectory coordinates computed from Simulink. I'm using MATLAB R2022a, ROS 2 Dashing and Gazebo 9.9.0 in a computer running Ubuntu 18.04.
The problem is that when launching the FMU with the fmi_adapter, I'm obtaining the following. It is tagged as [INFO], but actually messing up all my project.
[fmi_adapter_node-1] [INFO] [fmi_adapter_node]: Simulation time 1652274762.959713 is greater than timer's time 1652274762.901340. Is your step size to large?
Note the timer's time is higher than the simulation time. Even if I try to change the step size with the optional argument of the fmi_adapter_node, the same log appears with small differences in the times. I'm using the next commands:
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu # default step size: 0.2
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu _step_size:=0.001
As you would expect, the outputs of the FMU are the xyz coordinates of the robot trajectory in each time step. Since the fmi_adapter_node creates topics for both inputs and outputs, I'm reading the output xyz values by means of 3 subscribers with the next code. Then, those coordinates are being used to program the robot trajectories with the MoveIt-Python API.
When I run the previous Python code, I'm obtaining the following warning once and again and the robot manipulator actually doesn't move.
[ WARN] [1652274804.119514250]: TF_OLD_DATA ignoring data from the past for frame motor6_link at time 870.266 according to authority unknown_publisher
Possible reasons are listed at http://wiki.ros.org/tf/Errors%20explained
The previous warning is explained here, but I'm not able to fix it. I've tried clicking Reset in RViz, but nothing changes. I've also tried the following without success:
ros2 param set /fmi_adapter_node use_sim_time true # it just sets the timer's time to 0
It seems that the clock is taking negative values, so there is a synchronization problem.
Any help is welcome.
The warning message by the FMIAdapterNode is emitted if the timer's period is only slightly greater than the simulation step-size and if the timer is preempted by other processes or threads.
I created an issue at https://github.com/boschresearch/fmi_adapter/issues/9 which explains this in more detail and lists two possible fixes. It would be great if you could contribute to this discussion.
I assume that the TF_OLD_DATA error is not related to the fmi_adapter. Looking at the code snippet at ROS Answers, I wondered whether x,y,z values are re-published at all given that the lines
pose.position.x = listener_x.value
pose.position.y = listener_y.value
pose.position.z = listener_z.value
are not inside a callback and executed even before rospy.spin(), but maybe that's just truncated.

Can hard restart of machine hosting PostgreSQL change PostgreSQL sequence?

I have this case on my work table.
Customer has restarted (at least 2 times) the machine on which PostgreSQL was running. After that, the nextval of a sequence on one column has changed.
The last value before restart was 582. After restart it should return 583, but instead it returned 615.
I have checked all possible logs, from linux system logs through PostgreSQL logs until our app logs, no sight for anything calling nextval on this row.
So I tried the crazy idea, and translated the numbers into bits..
583 in bits: 0010 0100 0111
615 in bits: 0010 0110 0111
There is only one bit difference. So, is it possible that one bit got messed with from the hard restart???
There really are not much option how would be this nextval called 33 times in that time. The time difference between call that returned 582 and the one that returned 615 is only like hour or so and in that time was PC twice restarted. Yes, it is long time in programming, but there is no sight of calling nextval during that time.
Edit #1:
I have checked cache_value, it is 1 (as it probably should be). Also increment_by is 1 too. So there shouldn't be any allocated values. The code that is calling nextval is connected to hardware switch (Cashbox sensor), when activated, it finds out Cashbox state and insert it into table whose id is this sequence. There are some pretty heavy selects done before new row is inserted, so if it was called, there would be some footprints in either our app logs or PostgreSQL logs.
The reason why do I care is, that this sequence is used for IDs on cashbox changes log, so gap in the IDs doesn't look good, even that we can prove there has been nothing done between those 2 IDs.
Please refer to this Postgres documentation for details on how sequences are generated in Postgres.
To quote the documentation:
So, any numbers allocated but not used within a session will be lost
when that session ends, resulting in "holes" in the sequence.
It is likely that what is happening is that either the sequence generator has some cached sequence values, which would be lost during the hard restart. Or, one or more transactions in progress have retrieved a sequence value, but were interrupted before they were able to commit the transaction.
The documentation on sequence generation may help you identify how to set the size of your sequence cache, and what value to "restart" with.
In addition:
In your situation, it sounds as though you may have a business rule requiring the primary keys in your table to be both sequential, and gapless (no gaps between sequences). A. Elein Mustain suggests a non-trivial solution to generating so called "gapless sequences".

How to work with the COUNTER in Nagios or RRD?

I have the following problem:
I want to do the statistics of data that need to be constantly increasing. For example, the number of visits to the link. After some time be restarted these visit and start again from the beginning. To have a continuous increase, want to do the statistics somewhere. For this purpose, use a site that does this. In his condition can be used to COUNTER, GAUGE, AVERAGE, ... a.. I want to use the COUNTER. The system is built on Nagios.
My question is how to use this COUNTER. I guess it is the same as that of the RRD. But I met some strange things in the creation of such a COUNTER.
I submit the values ' 1 ' then ' 2 ' and the chart to come up 3. When I do it doesn't work. After the restart, for example, and submit again 1 to become 4
Anyone dealt with these things tell me briefly how it works with this COUNTER.
I saw that the COUNTER is used for traffic on routers, etc, but I want to apply for a regular graph, which just increases.
The RRD data type COUNTER will convert the input data into a rate, by taking the difference between this sample and the last sample, and dividing by the time interval (note that data normalisation also takes place and this is dependent on the Interval setting of the RRD)
Thus, updating with a constantly increasing count will result in a rate-of-change value to be graphed.
If you want to see your graph actually constantly increasing, IE showing the actual count of packets transferred (for example) rather than the rate of transfer, you would need to use type GAUGE which assumes any rate conversion has already been done.
If you want to submit the rate values (EG, 2 in the last minute), but display the overall constantly increasing total (in other words, the inverse of how the COUNTER data type works), then you would need to store the values as GAUGE, and use a CDEF in your RRDgraph command of the form CDEF:x=y,PREV,+ to obtain the ongoing total. Of course you would only have this relative to the start of the graph time window; maybe a separate call would let you determine what base value to use.
As you use Nagios, you may like to investigate Nagios add-ons such as pnp4nagios which will handle much of the graphing for you.

explain me a difference of how MRTG measures incoming data

Everyone knows that MRTG needs at least one value to be passed on it's input.
In per-target options MRTG has 'gauge', 'absolute' and default (with no options) behavior of 'what to do with incoming data'. Or, how to count it.
Lets look at the elementary, yet popular example :
We pass cumulative data from network interface statistics of 'how much packets were recieved by the interface'.
We take it from '/proc/net/dev' or look at 'ifconfig' output for certain network interface. The number of recieved bytes is increasing every time. Its cumulative.
So as i can imagine there could be two types of possible statistics:
1. How fast this value changes upon the time interval. In oher words - activity.
2. Simple, as-is growing graphic that just draw every new value per every minute (or any other time interwal)
First graphic will be saltatory (activity). Second will just grow up every time.
I read twice rrdtool's and MRTG's docs and can't understand which option mentioned above counts what.
I suppose (i am not sure) that 'gauge' draw values as is, without any differentiation calculations (good for measuring how much memory or cpu is used every 5 minutes). And default or 'absolute' behavior tryes to calculate the speed between nearby measures, but what's the differencr between last two?
Can you, guys, explain in a simple manner which behavior stands after which option of three options possible?
Thanks in advance.
MRTG assumes that everything is being measured as a rate (even if it isnt a rate)
Type 'gauge' assumes that you have already calculated the rate; thus, the provided value is stored as-is (after Data Normalisation). This is appropriate for things like CPU usage.
Type 'absolute' assumes the value passed is the count since the last update. Thus, the value is divided by the number of seconds since the last update to get a rate in thingies per second. This is rarely used, and only for certain unusual data sources that reset their value on being read - eg, a script that counts the number of lines in a log file, then truncates the log file.
Type 'counter' (the default) assumes the value passed is a constantly growing count, possibly that wraps around at 16 or 64 bits. The difference between the value and its previous value is divided by the number of seconds since the last update to get a rate in thingies per second. If it sees the value decrease, it will assume a counter wraparound at 16 or 64 bit. This is appropriate for something like network traffic counters, which is why it is the default behaviour (MRTG was originally written for network traffic graphs)
Type 'derive' is like 'counter', but will allow the counter to decrease (resulting in a negative rate). This is not possible directly in MRTG but you can manually create the necessary RRD if you want.
All types subsequently perform Data Normalisation to adjust the timestamp to a multiple of the Interval. This will be more noticeable for Gauge types where the value is small than for counter types where the value is large.
For information on this, see Alex van der Bogaerdt's excellent tutorial

Importance of do_fast_gettimeoffset( ) in linux

Was reading "Understanding Linux Kernel" book and in it says that, "number of microseconds is calculated by do_fast_gettimeoffset( )". Also it says that "to count the number of microseconds that have elapsed within the current second."
Couldnt understand what the author means by last sentence. Could anyone explain more on that?
If you want to understand the linux kernel, you should be aware that that book has been outdated for a long time and that do_fast_gettimeoffset no longer exists.
do_get_fast_time returns the number of seconds, and is always fast.
do_gettimeoffset returns the number of microseconds since the start of the second, and might be slow.

Resources