Python subprocess performance for multiple pipelined commands - python-3.x

I was writing a python code using subprocess module and I got stuck in this situation where I need to use pipes to pass a result of a commnad to another to obtain specific data I need.
However, this also can be achieved through pure Python code.
Ex)
from subprocess import Popen
cmd_result = Popen('ls -l ./ | awk -F " " \'{if ($5 > 10000) print $0}\'' | grep $USER', shell=True).communicate().split('\n')
Or
cmd_result = Popen('ls -l ./', shell=True).communicate().split('\n')
result_lst = []
for result in cmd_result:
result_items = result.split()
if int(result_item[4]) > 10000 and result_item[2] == "user_name":
result_lst.append(result)
And I am wondering which method is better than the other in efficiency-wise.
I found that the one with pure python code is slower than the one with pipelines, but not sure if that means using pipes is more efficient.
Thank you in advance.

The absolutely best solution to this is to avoid using a subprocess at all.
import os
myuid = os.getuid()
for file in os.scandir("."):
st = os.stat(file)
if st.st_size > 10000 and st.st_uid == myuid:
print(file)
In general, if you want to run and capture the output of a command, the simplest by far is subprocess.check_output; but really, don't parse ls output, and, of course, try to avoid superfluous subprocesses like useless greps if efficiency is important.
files = subprocess.check_output(
"""ls -l . | awk -v me="$USER" '$5 > 10000 && $2 == me { print $9 }'""",
text=True, shell=True)
This has several other problems; $4 could contain spaces (it does, on my system) and $9 could contain just the beginning of the file name if it contains spaces.
If you need to run a process which could produce a lot of output concurrently and fetch its output as it arrives, not when the process has finished, the Stack Overflow subprocess tag info page has a couple of links to questions about how to do that; I am guessing it is not worth the effort for this simple task you are asking about, though it could be useful for more complex ones.

Related

A clean way of combining r and f strings, with multiple escapes?

I have no idea if my Python is on the right track here, but with this example, I need to combine both r and f-strings whilst escaping a handful of characters in a particularly ugly command.
import subprocess
cust = input("Cust: ")
subprocess.run(r"""pd service:list -j | grep -i {cust} | awk '/name/ {print $0} /description/ {print $0} /summary/ {print $0 "\n"}' >> services_detailed.txt""".format(cust=cust), shell=True)
I have tried a few different methods, but this is the closest I have come to getting one to work.
A single r-string works absolutely fine (if I don't need to take user input), but when I need to take user input it becomes a problem.
Python syntax looks fine in VSCode but when running it spits out:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'print $0'
This doesn't really send me anywhere and I am not sure if I should be continuing with finding a way to get this string to work or going back to the drawing board.
I figured I would ask the question in case anyone else falls down the 'rf-string' rabbit hole.

Writing to the same file with awk and truncate

My system is Arch Linux and my window manager is DWM. I use dash as my shell interpreter.
I have written this extension shell script for my timer.
xev -root |
awk -F'[ )]+' '/^KeyPress/ { a[NR+2] }
NR in a {
if ($8 == "Return") {
exit 0;
} else if ($8 == "BackSpace") {
system("truncate -s-1 timer.txt");
} else if (length($8) == 1) {
printf "%s", $8;
fflush(stdout);
}
system("pkill -RTMIN+3 dwmblocks");
}' | tee timer.txt
The timer itself sits in dwmblocks status bar. I want to name my timers first and then let it start. But I don't think that's that important.
The purpose of this script - I want to input characters into the root window of DWM and have them appear in my status bar instantly. So, xev produces the key pressed information, then awk takes that information, finds the exact key (from all the information that xev outputs) and checks. If the key is "Return", awk exits (job done). If key is "BackSpace" awk calls truncate from the system. If it's a regular character key, then awk outputs it to timer.txt with tee (I could use "> timer.txt" too, I think, but I want to see the output in my terminal for debugging.
After every relevant keypress (single character) I fflush stdout. After all of that I finally call pkill so that dwmblocks knows that it should update. (dwmblocks issues cat operation on the file)
Okay, "Return" and character input works fine. But there's a problem with "BackSpace". I've read about it a bit (I'd say I'm still a Unix newbie even though I've been using Linux for two years now) and I found out that writing to the same file from different processes is bad news. Still. Could it be done somehow? The fact is that truncate only writes to the file when awk, doesn't, so, maybe, it wouldn't be that big of a deal?
This exact script worked earlier yesterday but now it doesn't. At first, I tried using sed instead of truncate and truncate seemed to let me delete characters from timer.txt but now truncate seems to not work anymore too. Well, it kinda works. I can input my characters and then I can delete them. BUT. After pressing Backspace I can not enter any more characters. If I try to enter a character Backspace stops working too.
So yeah. I'd have several questions. First - what the hell is the problem? As I've said, it used to work and now it doesn't. Am I wandering into undefined behavior in this script?
Second - could this be done - meaning - could I somehow write and delete from the same file. Maybe with some other tool, not awk?
Thanks in advance.
This probably isn't an answer but it's too much to go in a comment. I don't know the details of most of the tools you mention, nor do I really understand what it is you're trying to do but:
A shell is a tool to manipulate files and processes and schedule calls to other tools. Awk is a tool to manipulate text. You're trying to use awk like a shell - you have it sequencing calls to truncate and pkill and calling system to spawn a subshell each time you want to execute either of them. What you should be doing, for example, is just:
shell { truncate }
but what you're actually doing is:
shell { awk { system { shell { truncate } } } }
Can you take that role away from awk and give it back to your shell? It should make your overall script simpler, conceptually at least, and probably more robust.
Maybe try something like this (untested):
#!/usr/bin/env bash
while IFS= read -r str; do
case $str in
Return ) exit 0 ;;
BackSpace ) truncate -s-1 timer.txt ;;
? ) printf "%s" "$str" | tee -a timer.txt ;;
esac
pkill -RTMIN+3 dwmblocks
done < <(
xev -root |
awk -F'[ )]+' '/^KeyPress/{a[NR+2]} NR in a{print $8; fflush()}'
)
I moved the write to timer.txt inside the loop to make sure tees not trying to write to it while you're truncating it - that may not be necessary.

How to execute svn command along with grep on windows?

Trying to execute svn command on windows machine and capture the output for the same.
Code:
import subprocess
cmd = "svn log -l1 https://repo/path/trunk | grep ^r | awk '{print \$3}'"
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
'grep' is not recognized as an internal or external command,
operable program or batch file.
I do understand that 'grep' is not windows utility.
Getting error as "grep' is not recognized as an internal or external command,
operable program or batch file."
Is it only limited to execute on Linux?
Can we execute the same on Windows?
Is my code right?
For windows your command will look something like the following
svn log -l1 https://repo/path/trunk | find "string_to_find"
You need to use the find utility in windows to get the same effect as grep.
svn --version | find "ra"
* ra_svn : Module for accessing a repository using the svn network protocol.
* ra_local : Module for accessing a repository on local disk.
* ra_serf : Module for accessing a repository via WebDAV protocol using serf.
Use svn log --search FOO instead of grep-ing the command's output.
grep and awk are certainly available for Windows as well, but there is really no need to install them -- the code is easy to replace with native Python.
import subprocess
p = subprocess.run(["svn", "log", "-l1", "https://repo/path/trunk"],
capture_output=True, text=True)
for line in p.stdout.splitlines():
# grep ^r
if line.startswith('r'):
# awk '{ print $3 }'
print(line.split()[2])
Because we don't need a pipeline, and just run a single static command, we can avoid shell=True.
Because we don't want to do the necessary plumbing (which you forgot anyway) for Popen(), we prefer subprocess.run(). With capture_output=True we conveniently get its output in the resulting object's stdout atrribute; because we expect text output, we pass text=True (in older Python versions you might need to switch to the old, slightly misleading synonym universal_newlines=True).
I guess the intent is to search for the committer in each revision's output, but this will incorrectly grab the third token on any line which starts with an r (so if you have a commit message like "refactored to use Python native code" the code will extract use from that). A better approach altogether is to request machine-readable output from svn and parse that (but it's unfortunately rather clunky XML, so there's another not entirely trivial rabbithole for you). Perhaps as middle ground implement a more specific pattern for finding those lines -- maybe look for a specific number of fields, and static strings where you know where to expect them.
if line.startswith('r'):
fields = line.split()
if len(fields) == 13 and fields[1] == '|' and fields[3] == '|':
print(fields[2])
You could also craft a regular expression to look for a date stamp in the third |-separated field, and the number of changed lines in the fourth.
For the record, a complete commit message from Subversion looks like
------------------------------------------------------------------------
r16110 | tripleee | 2020-10-09 10:41:13 +0300 (Fri, 09 Oct 2020) | 4 lines
refactored to use native Python instead of grep + awk
(which is a useless use of grep anyway; see http://www.iki.fi/era/unix/award.html#grep)

Linux - Redirection of a shell script into a text file

I'm new to Linux, and have been trying to solve an assignment but to no avail.
I have a shell script which prints out lines of a text file in a certain manner (a line within every few seconds):
python << END
import time,random
a= open ('/home/ch/pshety/course/fielding_history.txt','r')
flag =False
for i in range(1000):
b=a.readline()
if i==402 or flag:
print(a.readline())
flag=True
time.sleep(2)
END
sh th.sh
If I run it without trying to redirect it anywhere, I get the output on the terminal. However, when I tried to redirect it into a new text file, it doesn't do anything - the text remains empty:
sh th.sh > debug.txt
I've tried looking for answers, I've stumbled upon a lot of suggestions including tee but nothing helps - the file remains empty.
What am I doing wrong?
Try this:
import time,random
a = open('/home/ch/pshety/course/fielding_history.txt', 'r')
for i in range(1000):
b = a.readline()
if i >= 402:
print(b, flush=True)
time.sleep(2)
Your Python script likely needs to flush the contents of the output buffer before you can see it.
Note: aside from the sleep() call, Unix provides other ways of accomplishing this. I would take a look at man tail and read about the -f and -n switches.
Edit: didn't realize that tail has a switch (-s) to sleep as well!

Why is using tail to copy a file so much slower than cp, and using awk twice as fast?

I'm trying to strip out the header line of a large csv file. But the first methods I tried (using tail and awk) work so slowly compared to copying the entire file!
So, just for fun, let's try a few silly but potentially didactically interesting methods for copying files.
Using cp:
$ time cp my_big_file.csv copy_of_my_big_file.csv
real 0m2.208s
user 0m0.002s
sys 0m2.171s
Using tail:
$ time tail -n+1 my_big_file.csv > copy_of_my_big_file.csv
real 0m44.506s
user 0m37.521s
sys 0m3.107s
Using awk:
$ time awk '{if (NR!=0) {print}}' my_big_file.csv > copy_of_my_big_file.csv
real 0m24.951s
user 0m20.336s
sys 0m2.869s
What accounts for such large discrepancies between using tail vs cp vs awk?
cp is copying fs block by block, without asking itself question. Most thing are happening at kernel level.
tail is reading line by line and does some filtering to recreate a file line by line. Of course, the fs will bufferize in the read and write case, but it is less efficient, cause have to cross several layers (kernel-user space), back and forth

Resources