Is there a way to extract only specific lines from a text file using python - python-3.x

I have a big text file that has around 200K lines of records/lines.
But I need to extract only specific lines which Start with CLM. For example, if the file has 100K lines that start with CLM I should print all that 100K lines alone.
Can anyone help me to achieve this using python script?

There can be multiple ways to achieve this.
you can simply iterate through the lines and search for a pattern using the re library
Solution 1
# Note :- Regex is faster in terms of execution as compared to string match
import re
pattern = re.compile("CLM")
for line in open("sample.txt"):
for match in re.finditer(pattern, line):
print(line)
If you want you can also run the bash command inside the python script.
Solution 2
There are two popular modules to use:- os and subprocess
os is kind of deprecated, I would recommend using the subprocess module as below:-
Below is the code to print the output on the console: -
import subprocess
process = subprocess.Popen(['grep', '-i', '^hel*', 'sample.txt'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,universal_newlines=True)
stdout, stderr = process.communicate()
print(stdout)
In the above, we are passing the argument universal_newlines=True because the output (stdout) is of type bytes.
In the above grep command I have passes -i argument to ignore case sensitivity. If you want only to search for CLM and not clm, remove that and use it
I have used the grep command to depict the use case, you can also use awk or sed or any command as per your requirement.
Just an addon, if you want to save the output in some file, let's say ouput.txt you can achieve this as below:-
import subprocess
with open('output.txt', 'w') as f:
process = subprocess.Popen(['grep', '-i', '^hel*', 'file.txt'], stdout=f)
If your file is extremely large, you can also do a poll and check for the subprocess execution status. Refer to the below link for more details on that.
Python-Shell-Commands

try:
with open('file.txt') as f:
for line in f:
if line.startswith('CLM'):
print(line.rstrip())

Related

Python script does not print output as supposed

I have a very simple (test) code which I'm running either from a Linux shell, or in interactive mode, and I have two different behaviours I cannot figure out the reason of.
I have a file generated by a Popen call, previously, where each line is a file path. This is the code used to generate the file:
with open('find.txt','w') as f:
find = subprocess.Popen(["find",".","-name","myfile.out"],stdout=f)
(Incidentally, I was trying to build a PIPE originally, namely inputting the output of this command to a grep command, and since I wasn't successful in any way, I decided to break the problem down and just read the file paths from a file, and process them one by one. So maybe there is a common issue that is blocking me somewhere in this procedure).
Since in this second step I wasn't even able to open and process the files by opening the addresses contained in each line of the find.txt file, I just tried to print the file lines out, because for sure they're available in there:
with open('find.txt','r') as g:
for l in g.readlines():
print(l)
Now, the interesting part:
if I paste the lines above into a python shell, everything works fine and I get my outputs as expected
if, on the other hand, I try to run python test.py, where test.py is the name of the file containing the lines above, no output appears in the shell's stdout.
I've tried sys.stdout.flush() to no avail. I've also inserted some dummy print() statements along the way: everything gets printed but what's after the g.readlines() statement.
Here's the full script I'm trying to make work (a pre-precursor of what I'm actually after, tbh).
#!/usr/bin/env python3
import subprocess
import sys
with open('find.txt','w') as f:
find = subprocess.Popen(["find",".","-name","myfile.out"],stdout=f)
print('hello')
with open('find.txt','r') as g:
print('hello?')
for l in g.readlines():
print('help me!')
print(l)
sys.stdout.flush()
output being:
{ancis:>106> python test.py
hello
hello?
{ancis:>106>
EDIT
I've quickly tried the very same lines (but without the call to find, which isn't available) on my python installation in Windows: it works as expected)
Based on that, I've tried to run the simpler code below:
print('hello')
with open('find.txt','r') as g:
print('hello?')
for l in g.readlines():
print('help me!')
print(l)
sys.stdout.flush()
as a script, in Linux - This also works w/o problems.
This should mean that somehow I'm messing things up with the call to Popen... But what?
This is a race condition.
Your call to
find = subprocess.Popen(["find",".","-name","myfile.out"],stdout=f)
is opening another process and running your find command which takes a bit of time to fully execute.
Python then continues on and reaches the reading of the file portion before the command is fully executed and the file is generated.
Want to test it out?
Add a time.sleep(1) just before the opening of the file.
Full test script:
#!/usr/bin/env python3
import subprocess
import time
with open('find.txt','w') as f:
find = subprocess.Popen(["find",".","-name","myfile.out"],stdout=f)
time.sleep(1)
with open('find.txt','r') as g:
for l in g:
print(l)
To block until the process is complete you can use find.communicate().
With this you can also optionally set a timeout if that's something that you want.
#!/usr/bin/env python3
import subprocess
with open('find.txt','w') as f:
find = subprocess.Popen(["find",".","-name","myfile.out"],stdout=f)
find.communicate()
with open('find.txt','r') as g:
for l in g:
print(l)
Source:
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.communicate

Linux - Redirection of a shell script into a text file

I'm new to Linux, and have been trying to solve an assignment but to no avail.
I have a shell script which prints out lines of a text file in a certain manner (a line within every few seconds):
python << END
import time,random
a= open ('/home/ch/pshety/course/fielding_history.txt','r')
flag =False
for i in range(1000):
b=a.readline()
if i==402 or flag:
print(a.readline())
flag=True
time.sleep(2)
END
sh th.sh
If I run it without trying to redirect it anywhere, I get the output on the terminal. However, when I tried to redirect it into a new text file, it doesn't do anything - the text remains empty:
sh th.sh > debug.txt
I've tried looking for answers, I've stumbled upon a lot of suggestions including tee but nothing helps - the file remains empty.
What am I doing wrong?
Try this:
import time,random
a = open('/home/ch/pshety/course/fielding_history.txt', 'r')
for i in range(1000):
b = a.readline()
if i >= 402:
print(b, flush=True)
time.sleep(2)
Your Python script likely needs to flush the contents of the output buffer before you can see it.
Note: aside from the sleep() call, Unix provides other ways of accomplishing this. I would take a look at man tail and read about the -f and -n switches.
Edit: didn't realize that tail has a switch (-s) to sleep as well!

How can I have subprocess or Popen show the command line that it has just run?

Currently, I have a command that looks something like the following:
my_command = Popen([activate_this_python_virtualenv_file, \
"-m", "my_command", "-l", \
directory_where_ini_file_for_my_command_is + "/" + my_ini_file_name], \
stderr=subprocess.STDOUT, stdout=subprocess.PIPE, shell=False,
universal_newlines=False, cwd=directory_where_my_module_is)
I have figured out how to access and process the output, deal with subprocess.PIPE, and make subprocess do a few other neat tricks.
However, it seems odd to me that the standard Python documentation for subprocess doesn't mention a way to just get the actual command line as subprocess.Popen puts it together from arguments to the Popen constructor.
For example, perhaps my_command.get_args() or something like that?
Is it just that getting the command line run in Popen should be easy enough?
I can just put the arguments together on my own, without accessing the command subprocess runs with Popen, but if there's a better way, I'd like to know it.
It was added in Python 3.3. According to docs:
The following attributes are also available:
Popen.args The args argument as it was passed to Popen – a sequence of
program arguments or else a single string.
New in version 3.3.
So sample code would be:
my_args_list = [] # yourlist
p = subprocess.Popen(my_args_list)
assert p.args == my_args_list

subprocess.call() problems using the '>'

I'm having trouble with the call function
I'm trying to redirect the output of a program to a text file by using the '>'
This is what I've tried:
import subprocess
subprocess.call(["python3", "test.py", ">", "file.txt"])
but it still displaying the output on the command prompt and not in the txt file
There are two approaches to solving this.
Have python handle the redirection:
with open('file.txt', 'w') as f:
subprocess.call(["python3", "test.py"], stdout=f)
Have the shell handle redirection:
subprocess.call(["python3 test.py >file.txt"], shell=True)
Generally, the first is to be preferred because it avoids the vagaries of the shell.
Lastly, you should look into the possibility that test.py can be run as an imported module rather than calling it via subprocess. Python is designed so that it is easy to write scripts so that the same functionality is available either at the command line (python3 test.py) or as a module (import test).

In Python, list certain type of file in a directory on Linux

In my directory, there are a kind of type of file end in .log file.
In ordinary, I use ls .*log commands to list all files.
However, I wanna to use Python code to handle with it. There are two ways I've tried.
First:
import subprocess
ls_al = subprocess.check_output(['ls','.*log'])
but it returns ls: .*log: No such file or directory
Second:
import subprocess
ls_al = subprocess.check_Popen(['ls','.*log'],stdout=subprocess.PIPE)
ls = ls_al.stdout.read().strip()
but those two didn't work.
Can anyone help with this?
Globbing patterns are expanded by the shell, but you are running the command directly. You'd have to run the command through the shell:
ls_al = subprocess.check_output('ls *.log', shell=True)
where you pass in the full command line to the shell as a string (and use the correct glob syntax).
Demo (using *.py):
>>> subprocess.check_output(['ls', '*.py'])
ls: *.py: No such file or directory
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['ls', '*.py']' returned non-zero exit status 1
>>> subprocess.check_output('ls *.py', shell=True)
'calc.py\ndAll.py\nexample.py\ninplace.py\nmyTests.py\ntest.py\n'
Note that the correct way in Python is to use os.listdir() with manual filtering, filter with the fnmatch module, or use the glob module to list and filter together:
>>> import glob
>>> glob.glob('*.py')
['calc.py', 'dAll.py', 'example.py', 'inplace.py', 'myTests.py', 'test.py']
.*log seems like regular expression, not globbing pattern. Do you mean *.log? (need shell=True argument to make shell do glob expansion)
BTW, glob.glob('*.log') is more preferable way if you want list of file paths.
Rather than run an external command, you could use Python's os module to get the files in the directory. Then the re module can be used to create a regular expression to filter for your log files. I think this would be a more pythonic approach. It should also work on multiple platforms without modification. Note that in the code below I'm assuming your log files all end with '.log'; if you need something else you'll need to tinker with the regex.
import os
import re
import sys
the_dir = sys.argv[1]
all_files = os.listdir(the_dir)
log_files = []
log_pattern = re.compile('.*\.log')
for fn in all_files:
if re.match(log_pattern, fn):
log_files.append(fn)
print log_files
Why not use glob?
$ ls
abc.txt bar.log def.txt foo.log ghi.txt zoo.log
$ python
>>> import glob
>>> for logfile in glob.glob('*.log'):
... print(logfile)
...
bar.log
foo.log
zoo.log
>>>

Resources