How to make Collab-dependent Python code self-contained? - linux

This question was migrated from Super User because it can be answered on Stack Overflow.
Migrated 12 days ago.
sorry if this question isn't in the right place, but I wouldn't qualify myself as a developer.
Overview: our organization runs a piece of critical Python code on Google's Collab from Google Drive. This code processes sensitive data and for privacy reasons, we'd like to move it out of Google. There's also a performance reason as the RAM requirements are sometimes higher than allowed by the free plan, halting code execution.
The setup: the code takes one very large PDF file as input (in the same Google Drive directory), processes it then spits out tens of usable XLSX files containing the relevant info, each in a different folder since the code is expected to be run once every 4-5 weeks. The very few parameters allowed are set in a simple text file.
The bug: no one in the organization currently has the skills to edit code and maintain a compatible Python environment.
The ideal end product would be a self-contained executable with no external dependency, able to run in any user directory. I know this is a rather inefficient use of disk space, but the only way to ensure proper execution every time. Primary target platforms would be Mac OS X 10.14 to current, or Debian-based Linux distribution, more precisely Linux Mint. Windows compatibility isn't a priority.
What would be the proper way to make this code independant?

Related

signal 11 in a linux shell remote site. How can I troubleshoot

I'm a bio major only recently doing major coding for research stuff. Our campus in order to support research has an on campus supercomputer for researcher use. I work remotely from this supercomputer and it uses a linux shell to access it and submit jobs. I'm writing a job submission script for the alignment of a lot of genomes using a program installed on the computer called Mauve. Now I've run a job on Mauve fine before and have altered the script from that job to fit this job. Only this time I keep getting this error
Storing raw sequence at
/scratch/addiseg/Elizabethkingia_clonalframe/rawseq16360.000
Sequence loaded successfully.
GCA_000689515.1_E27107v1_PRJEB5243_genomic.fna 4032057 base pairs.
Storing raw sequence at
/scratch/addiseg/Elizabethkingia_clonalframe/rawseq16360.001
Sequence loaded successfully.
e.anophelisGCA_000496055.1_NUH11_genomic.fna 4091484 base pairs.
Caught signal 11
Cleaning up and exiting!
Temporary files deleted.
So I've got no idea how to troubleshoot this. I'm so sorry if this is super basic and wasting time but I don't know how to troubleshoot this at a remote site. All possible solutions I've seen so far require me to access the hardware or software neither of which I can control.
My current submission script is this.
module load mauve
progressiveMauve --output=8elizabethkingia-alignment.etc.xmfa --output-guide-tree=8.elizabethkingia-alignment.etc.tree --backbone-output=8.elizabethkingia-alignment.etc.backbone --island-gap-size=100 e.anophelisGCA_000331815.1_ASM33181v1_genomicR26.fna GCA_000689515.1_E27107v1_PRJEB5243_genomic.fna e.anophelisGCA_000496055.1_NUH11_genomic.fna GCA_001596175.1_ASM159617v1_genomicsrr3240400.fna e.meningoseptica502GCA_000447375.1_C874_spades_genomic.fna e.meningoGCA_000367325.1_ASM36732v1_genomicatcc13253.fna e.anophelisGCA_001050935.1_ASM105093v1_genomicPW2809.fna e.anophelisGCA_000495935.1_NUHP1_genomic.fna

What are all the kinds of instructions that are contained in a executable file? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I know that exe file contains pure CPU instruction plus extra piece of data. So if I begin running a simple hello world console app or a 32 bit GUI app (exe file) then the OS will load the instruction given in the exe file into memory to get processed by CPU. So if I run that app, it should only follow instruction as it is, that is to display hello world only (in a complete blank screen with only the word 'hello world'). But it is not happening so. It is somewhat controlled by OS to display in a windowed environment of command prompt. So what is actually happening there.
edit: To be precise my question is that i want to know what are all the instruction that an exe(a simple 16bit dos app in windows OS) file contains(considering the confusion above I have)?
If you compile a program on a particular OS, that it is that OS that you should run it on. For example, if you compile a command line program for Windows, it will work on Windows, but it will not work on Ubuntu. You said yourself that the content of the executable file differs from OS to OS.
What you can do is to use an emulator for a particular system on another system. For example, if you are on Linux, you can use Wine to run Windows programs. In this way you emulate the Windows environment for your program even though it is on Linux.
Of course, the CPU should follow the instructions as they are. If you, e.g., want to print a hello world line, than your program contains code for that. But not just for that. It contains the code for other stuff that are OS dependent, and this is where the problem is. For example, your Windows compiled program might use some Windows API to perform the printing, and this API is not to be found on Linux. It is than the called API, and not your program that directly performs the output.
As Jongware mentioned it in the comment below, what you could do is to cross-compile your code. In that case you can compile your code on Windows so that it can run on Linux, but only if you compile it with the libraries needed for the specific Linux. In that case, however, you will not be able to run you code on Windows.
By default, software are compiled for the same type of machine that you are using. However, you can also install a cross-compiler, to compile for some other type of machine.
When you develop a desktop or server application, almost always the development platform (the machine that runs your compiler) and the target platform (the machine that runs your application) are the same. By "platform" I mean the combination of CPU architecture, and Operating System. The process of building executable binaries on one machine, and run them on another machine when the CPU architecture or the Operating System are different is called "cross compilation". A special compiler is needed for doing cross compilation that is called "cross compiler", and sometimes just "toolchain".
have a look here and here.
So if I run that app, it should only follow instruction as it is, that is to display hello world only (in a complete blank screen with only the word 'hello world').
Why "complete blank screen"? It completely depends on how your application is designed:
In a console application, it will output the string to stdout, which is then displayed inside the console you started the program out of. If you started it directly via a shortcut or by double-clicking the .exe file, it will open a new console window.
In a windowed application, it will output the string wherever you defined it to be output.
Alas, your question is by far too vague to become more concrete here.
But it is not happening so. It is somewhat controlled by OS to display in a windowed environment of command prompt.
What does that mean?
If it means what I think, i. e. you indeed get a black window with the string output, well, it is exactly how it is supposed to work, if it is a console application.

Can you load a tree structure in memory with Linux shell?

I want to create an application with a Linux shell script like this — but can it be done?
This application will create a tree containing data. The tree should be loaded in the memory. The tree (loaded in memory) could be readable from any other external Linux script.
Is it possible to do it with a Linux shell?
If yes, how can you do it?
And are there any simple examples for that?
There are a large number of misconceptions on display in the question.
Each process normally has its own memory; there's no trivial way to load 'the tree' into one process's memory and make it available to all other processes. You might devise a system of related programs that know about a shared memory segment (somehow — there's a problem right there) that contains the tree, but that's about it. They'd be special programs, not general shell scripts. That doesn't meet your 'any other external Linux script' requirement.
What you're seeking is simply not available in the Linux shell infrastructure. That answers your first question; the other two are moot given the answer to the first.
There is a related discussion here. They use shared memory device /dev/shm and, ostensibly, it works for multiple users. At least, it's worth a try:
http://www.linuxquestions.org/questions/linux-newbie-8/bash-is-it-possible-to-write-to-memory-rather-than-a-file-671891/
Edit: just tried it with two users on Ubuntu - looks like a normal directory and REALLY WORKS with the right chmod.
See also:
http://www.cyberciti.biz/tips/what-is-devshm-and-its-practical-usage.html
I don't think there is a way to do this as if you want to keep all the requirements of:
Building this as a shell script
In-memory
Usable across terminals / from external scripts
You would have to give up at least one requirement:
Give up shell script req - Build this in C to run as a Linux process. I only understand this up to the point to say that it would be non-trivial
Give up in-memory req - You can serialize the tree and keep the data in a temp file. This works as long as the file is small and performance bottleneck isn't around access to the tree. The good news is you can use the data across terminals / from external scripts
Give up usability from external scripts req - You can technically build a script and run it by sourcing it to add many (read: a mess of) variables representing the tree into your current shell session.
None of these alternatives are great, but if you had to go with one, number 2 is probably the least problematic.

Implementing an update/upgrade system for embedded Linux devices

I have an application that runs on an embedded Linux device and every now and then changes are made to the software and occasionally also to the root file system or even the installed kernel.
In the current update system the contents of the old application directory are simply deleted and the new files are copied over it. When changes to the root file system have been made the new files are delivered as part of the update and simply copied over the old ones.
Now, there are several problems with the current approach and I am looking for ways to improve the situation:
The root file system of the target that is used to create file system images is not versioned (I don't think we even have the original rootfs).
The rootfs files that go into the update are manually selected (instead of a diff)
The update continually grows and that becomes a pita. There is now a split between update/upgrade where the upgrade contains larger rootfs changes.
I have the impression that the consistency checks in an update are rather fragile if at all implemented.
Requirements are:
The application update package should not be too large and it must also be able to change the root file system in the case modifications have been made.
An upgrade can be much larger and only contains the stuff that goes into the root file system (like new libraries, kernel, etc.). An update can require an upgrade to have been installed.
Could the upgrade contain the whole root file system and simply do a dd on the flash drive of the target?
Creating the update/upgrade packages should be as automatic as possible.
I absolutely need some way to do versioning of the root file system. This has to be done in a way, that I can compute some sort of diff from it which can be used to update the rootfs of the target device.
I already looked into Subversion since we use that for our source code but that is inappropriate for the Linux root file system (file permissions, special files, etc.).
I've now created some shell scripts that can give me something similar to an svn diff but I would really like to know if there already exists a working and tested solution for this.
Using such diff's I guess an Upgrade would then simply become a package that contains incremental updates based on a known root file system state.
What are your thoughts and ideas on this? How would you implement such a system? I prefer a simple solution that can be implemented in not too much time.
I believe you are looking wrong at the problem - any update which is non atomic (e.g. dd a file system image, replace files in a directory) is broken by design - if the power goes off in the middle of an update the system is a brick and for embedded system, power can go off in the middle of an upgrade.
I have written a white paper on how to correctly do upgrade/update on embedded Linux systems [1]. It was presented at OLS. You can find the paper here: https://www.kernel.org/doc/ols/2005/ols2005v1-pages-21-36.pdf
[1] Ben-Yossef, Gilad. "Building Murphy-compatible embedded Linux systems." Linux Symposium. 2005.
I absolutely agree that an update must be atomic - I have started recently a Open Source project with the goal to provide a safe and flexible way for software management, with both local and remote update. I know my answer comes very late, but it could maybe help you on next projects.
You can find sources for "swupdate" (the name of the project) at github.com/sbabic/swupdate.
Stefano
Currently, there are quite a few Open Source embedded Linux update tools growing, with different focus each.
Another one that is worth being mentioned is RAUC, which focuses on handling safe and atomic installations of signed update bundles on your target while being really flexible in the way you adapt it to your application and environment. The sources are on GitHub: https://github.com/rauc/rauc
In general, a good overview and comparison of current update solutions you might find on the Yocto Project Wiki page about system updates:
https://wiki.yoctoproject.org/wiki/System_Update
Atomicity is critical for embedded devices, one of the reasons highlighted is power loss; but there could be others like hardware/network issues.
Atomicity is perhaps a bit misunderstood; this is a definition I use in the context of updaters:
An update is always either completed fully, or not at all
No software component besides the updater ever sees a half installed update
Full image update with a dual A/B partition layout is the simplest and most proven way to achieve this.
For Embedded Linux there are several software components that you might want to update and different designs to choose from; there is a newer paper on this available here: https://mender.io/resources/Software%20Updates.pdf
File moved to: https://mender.io/resources/guides-and-whitepapers/_resources/Software%2520Updates.pdf
If you are working with the Yocto Project you might be interested in Mender.io - the open source project I am working on. It consists of a client and server and the goal is to make it much faster and easier to integrate an updater into an existing environment; without needing to redesign too much or spend time on custom/homegrown coding. It also will allow you to manage updates centrally with the server.
You can journal an update and divide your update flash into two slots. Power failure always returns you to the currently executing slot. The last step is to modify the journal value. Non atomic and no way to make it brick. Even it if fails at the moment of writing the journal flags. There is no such thing as an atomic update. Ever. Never seen it in my life. Iphone, adroid, my network switch -- none of them are atomic. If you don't have enough room to do that kind of design, then fix the design.

Stripping down a kernel in linux?

I recently read a post (admittedly its a few years old) and it was advice for fast number-crunching program:
"Use something like Gentoo Linux with 64 bit processors as you can compile it natively as you install. This will allow you to get the maximum punch out of the machine as you can strip the kernel right down to only what you need."
can anyone elaborate on what they mean by stripping down the kernel? Also, as this post was about 6 years old, which current version of Linux would be best for this (to aid my google searches)?
There is some truth in the statement, as well as something somewhat nonsensical.
You do not spend resources on processes you are not running. So as a first instance I would try minimise the number of processes running. For that we quite enjoy Ubuntu server iso images at work -- if you install from those, log in and run ps or pstree you see a thing of beauty: six or seven processes. Nothing more. That is good.
That the kernel is big (in terms of source size or installation) does not matter per se. Many of this size stems from drivers you may not be using anyway. And the same rule applies again: what you do not run does not compete for resources.
So think about a headless server, stripped down -- rather than your average desktop installation with more than a screenful of processes trying to make the life of a desktop user easier.
You can create a custom linux kernel for any distribution.
Start by going to kernel.org and downloading the latest source. Then choose your configuration interface (you have the choice of console text, 'config', ncurses style 'menuconfig', KDE style 'xconfig' and GNOME style 'gconfig' these days) and execute ./make whateverconfig. After choosing all the options, type make to create your kernel. Then make modules to compile all the selected modules for this kernel. Then, make install will copy the files to your /boot directory, and make modules_install, copies the modules. Next, go to /boot and use mkinitrd to create the ram disk needed to boot properly, if needed. Then you'll add the kernel to your GRUB menu.lst, by editing menu.lst and copying the latest entry and adding a similar one pointing to the new kernel version.
Of course, that's a basic overview and you should probably search for 'linux kernel compile' to find more detailed info. Selecting the necessary kernel modules and options takes a bit of experience - if you choose the wrong options, the kernel might not be bootable and you'll have to start over, which is a pain because selecting the options and compiling the kernel can take 15-30 minutes.
Ultimately, it isn't going to make a large difference to compile a stripped-down custom kernel unless your given task is very, very performance sensitive. It makes sense to remove things you're never going to use from the kernel, though, like say ISDN support.
I'd have to say this question is more suited to SuperUser.com, by the way, as it's not quite about programming.

Resources