I've been trying to run Openai's universe-starter-agent example found here, However, using an m4.16xlarge instance on AWS with 32 workers, the agent's training result doesn't improve after 0.6 hours (over 30 minutes) while it is stated that "the agent is able to solve the same environment in 10 minutes" on the GitHub page.
The progress was monitored through TensorBoard. Please notice the example given in the GitHub was shown for the case of 16 workers, and it converges to an episode reward of 21 within 30 minutes, while for this case, with doubled number of workers and same amount of training time, the reward doesn't improve. I also took a look at the log and it seems there's no compiling error. The command I used to run the script is:
python train.py --num-workers 32 --env-id PongDeterministic-v3 --log-dir /tmp/pong
The only thing that I find a little dubious is when running the script, the following error was shown, but didn't abort the run. The error reads: "failed to connect to server"
Has anyone else run the starter agent, and/or run into similar issue? If so, how did you solve it?
Thanks!
Problem solved - downgrade tensorflow from 1.0.0 to 0.11.0 and trained as expected!
Related
I'm working in a co-simulation project between Simulink and Gazebo. The aim is to move a robot model in Gazebo with the trajectory coordinates computed from Simulink. I'm using MATLAB R2022a, ROS 2 Dashing and Gazebo 9.9.0 in a computer running Ubuntu 18.04.
The problem is that when launching the FMU with the fmi_adapter, I'm obtaining the following. It is tagged as [INFO], but actually messing up all my project.
[fmi_adapter_node-1] [INFO] [fmi_adapter_node]: Simulation time 1652274762.959713 is greater than timer's time 1652274762.901340. Is your step size to large?
Note the timer's time is higher than the simulation time. Even if I try to change the step size with the optional argument of the fmi_adapter_node, the same log appears with small differences in the times. I'm using the next commands:
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu # default step size: 0.2
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu _step_size:=0.001
As you would expect, the outputs of the FMU are the xyz coordinates of the robot trajectory in each time step. Since the fmi_adapter_node creates topics for both inputs and outputs, I'm reading the output xyz values by means of 3 subscribers with the next code. Then, those coordinates are being used to program the robot trajectories with the MoveIt-Python API.
When I run the previous Python code, I'm obtaining the following warning once and again and the robot manipulator actually doesn't move.
[ WARN] [1652274804.119514250]: TF_OLD_DATA ignoring data from the past for frame motor6_link at time 870.266 according to authority unknown_publisher
Possible reasons are listed at http://wiki.ros.org/tf/Errors%20explained
The previous warning is explained here, but I'm not able to fix it. I've tried clicking Reset in RViz, but nothing changes. I've also tried the following without success:
ros2 param set /fmi_adapter_node use_sim_time true # it just sets the timer's time to 0
It seems that the clock is taking negative values, so there is a synchronization problem.
Any help is welcome.
The warning message by the FMIAdapterNode is emitted if the timer's period is only slightly greater than the simulation step-size and if the timer is preempted by other processes or threads.
I created an issue at https://github.com/boschresearch/fmi_adapter/issues/9 which explains this in more detail and lists two possible fixes. It would be great if you could contribute to this discussion.
I assume that the TF_OLD_DATA error is not related to the fmi_adapter. Looking at the code snippet at ROS Answers, I wondered whether x,y,z values are re-published at all given that the lines
pose.position.x = listener_x.value
pose.position.y = listener_y.value
pose.position.z = listener_z.value
are not inside a callback and executed even before rospy.spin(), but maybe that's just truncated.
I wonder but could not found any information why this appears all the time if I try to tune hyperparameter from sklearn with TuneSearchCV:
Note that the important part is the Log sync warning and as a result that the logging in combination with tensorflow and search_optimization such as optuna does not work:
Backend is sklearn
Concatenating h5 datasets of the following files:
('output/example_train_1.h5', 'output/example_train_2.h5')
based on the following keys:
('x', 'y')
Concatenation successful, resulting shapes for the given dsets:
Key: x, shape: (20000, 25)
Key: y, shape: (20000,)
Log sync requires rsync to be installed.
Process finished with exit code 0
The tuning processes seem to be working, as long as I do not use search-optimization such as optional.
I use it within a docker container. I got through the ray-documentation, but I could find the source where I think the error drops. However, I could not find any settings or additional options on how to prevent it.
Furthermore, it seems that rsync is just necessary if I use a cluster. But actually, I don't do that right now.
The warning (Log sync requires rsync to be installed.) does not stop the script from executing. If rsync is not installed, it will just not synchronize logs between nodes, which seems to be unnecessary in your case anyway. You shouldn't run into any problem there.
It's hard to say what the problem here is, as we're missing crucial information: Which version of Ray are you running, which version of tune-sklearn, and how does your training script look like?
If you're running into problems and you suspect it is a bug, please consider opening an issue in the tune-sklearn repository, and make sure to include the above information and preferably a minimal reproducible script so the maintainers can look into this.
Recently I was trying to test my model which i already trained. Initially I was using Google colab notebook to write code because of it's interactive features, once I was done writing code and I was getting satisfactory results, it took around 2.5 hr to give final output. After that what I wanted was to transfer the notebook code to .py script, I did that with little bit of modification, saved it in gdrive, and then used command !python test.py. now it took me more than 4.5 hr to get the final output, can any one explain why does colab take so much time when trying to run the python script from gdrive while compared to the same code as used in notebook
I would add time calculation to every step I doubt that takes time and see which step in your whole program takes the time.
a1 = time.time()
//your code step
print(time.time() - a1)
This will give you the time for each step and you can see which one is taking a long time.
Operations to check.
1. object creation in loops
2. read/write operation to Gdrive
Once you find the problem-causing piece of code, you may change it.
Well it can be because of the fact that colab is retrieving the data from gdrive and then might be again writing in gdrive which will of ofcourse take time i guess
After having created a TensorFlow 1.4 model for Python 3, I have now found that Google Cloud ML Engine currently only has support for Python 2.7.
Back-porting my Python 3 code at first seemed simple enough: Some scripts still work as expected when I replace their shebang #!/usr/bin/env python3 with #!/usr/bin/env python. python -V reports 2.7.10 in my (macOS) environment.
Yet one script does not react so gracefully. When I run it now, it produces a Segmentation fault: 11 without any previous warnings or other diagnostic output.
How can I find out about the root cause, so that I know what else to change to make also that script palatable to Python 2?
UPDATE The segmentation fault apparently occurs during a call to session.run(get_next), where get_next is obtained from a tf.data.Iterator as follows:
iterator = dataset.make_initializable_iterator()
get_next = iterator.get_next()
There are two issues here: one is about Python 3 support and the other is about the segfault.
Python 3 Support
CloudML Engine now supports Python 3, via the 'pythonVersion' field when submitting jobs (see the API reference docs).
If you are using gcloud you will need to create a config file like this (let's name it config.yaml):
trainingInput:
pythonVersion: "3.5"
When you submit your job, point gcloud to that file, e.g.
gcloud ml-engine jobs submit training --config=config.yaml ...
Segfault
This may be caused by running out of memory. Please check the memory usage in the console for that job. That said, if the job dies abruptly, memory usage at the time of failure may not be accurately reflected for that job.
This question is basically a duplicate of this one, except that the accepted answer on that question was, "it's not actually slower, you just weren't running the timing command correctly."
In my case, it actually is slower! :)
I'm on Windows 10. Here's the output from PowerShell's Measure-Command (the TotalMilliseconds line represents wall-clock time):
PS> Measure-Command {npm --version}
Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 481
Ticks : 14815261
TotalDays : 1.71472928240741E-05
TotalHours : 0.000411535027777778
TotalMinutes : 0.0246921016666667
TotalSeconds : 1.4815261
TotalMilliseconds : 1481.5261
A few other numbers, for comparison:
'{.\node_modules.bin\mocha}': 1300ms
'npm run test' (just runs mocha): 3300ms
npm help: 1900ms.
the node interpreter itself is ok: node -e 0: 180ms
It's not just npm that's slow... mocha reports that my tests only take 42ms, but as you can see above, it takes 1300ms for mocha to run those 42ms of tests!
I've had the same trouble. Do you have Symantec Endpoint Protection? Try disabling Application and Device Control in Change Settings > Client Management > General > Enable Application and Device Control.
(You could disable SEP altogether; for me the command is: "%ProgramFiles(x86)%\Symantec\Symantec Endpoint Protection\smc.exe" -stop.)
If you have some other anti-virus, there's likely a way to disable it as well. Note that closing the app in the Notification area might not stop the virus protection. The problem is likely with any kind of realtime protection that scans a process as it starts. Since node and git are frequently-invoked short-running processes, this delay is much more noticeable.
In Powershell, I like to measure the performance of git status, both before and after that change: Measure-Command { git status }
I ran into this problem long ago, I think it was an extension that I had. I use Visual Studio Code, and when it has no extensions and running bash:
//GIT Bash Configuration
"terminal.integrated.shell.windows": "C:\\Program Files\\Git\\bin\\bash.exe",
it actually flies, I use both OS, so I can tell the difference. Try using different tools and disabling some.
And if that still doesn't work, check your antivirus, maybe it's slowing down the process?
Been googling this all day, with no luck. Decided to uninstall Java to see what would happen and bingo, solved my problem. I know this is an old thread, but I found myself coming back to it so many times to see if I missed anything.
off topic:
Got to figure out how to get Java working now 🤦
Didn't know about Measure-Command, so I'll be using that in the future!
I had this problem. When I tried to run an application of my job in my home, I realized that in my job's laptop the app starts on 2 minutes but in my personal notebook it tooked 5 minutes or more.
After trying some possible solutions, finally I found the problem was that I installed Git Bash in my D drive partition which is a HDD. When I re-installed in C drive whichs is a SSD then the app started faster. However, I also moved Node.js to C drive to prevent another issues.