How to pass string via STDIN into terminal command being executed within python script? - python-3.x

I need to generate postgres schema from a dataframe. I found csvkit library to come closet to matching datatypes. I can run csvkit and generate postgres schema over a csv on my desktop via terminal through this command found in docs:
csvsql -i postgresql myFile.csv
csvkit docs - https://csvkit.readthedocs.io/en/stable/scripts/csvsql.html
And I can run the terminal command in my script via this code:
import os
a=os.popen("csvsql -i postgresql Desktop/myFile.csv").read()
However I have a dataframe, that I have converted to a csv string and need to generate schema from the string like so:
csvstr = df.to_csv()
In the docs it says that under positional arguments:
The CSV file(s) to operate on. If omitted, will accept
input on STDIN
How do I pass my variable csvstr into the line of code a=os.popen("csvsql -i postgresql csvstr").read() as a variable?
I tried to do the below line of code but got an error OSError: [Errno 7] Argument list too long: '/bin/sh':
a=os.popen("csvsql -i postgresql {}".format(csvstr)).read()
Thank you in advance

You can't pass such a big string via commandline! You have to save the data to a file and pass its path to csvsql.
import csv
csvstr = df.to_csv()
with open('my_cool_df.csv', 'w', newline='') as csvfile:
csvwriter= csv.writer(csvfile)
csvwriter.writerows(csvstr)
And later:
a=os.popen("csvsql -i postgresql my_cool_df.csv")

Related

Is there a way to extract only specific lines from a text file using python

I have a big text file that has around 200K lines of records/lines.
But I need to extract only specific lines which Start with CLM. For example, if the file has 100K lines that start with CLM I should print all that 100K lines alone.
Can anyone help me to achieve this using python script?
There can be multiple ways to achieve this.
you can simply iterate through the lines and search for a pattern using the re library
Solution 1
# Note :- Regex is faster in terms of execution as compared to string match
import re
pattern = re.compile("CLM")
for line in open("sample.txt"):
for match in re.finditer(pattern, line):
print(line)
If you want you can also run the bash command inside the python script.
Solution 2
There are two popular modules to use:- os and subprocess
os is kind of deprecated, I would recommend using the subprocess module as below:-
Below is the code to print the output on the console: -
import subprocess
process = subprocess.Popen(['grep', '-i', '^hel*', 'sample.txt'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,universal_newlines=True)
stdout, stderr = process.communicate()
print(stdout)
In the above, we are passing the argument universal_newlines=True because the output (stdout) is of type bytes.
In the above grep command I have passes -i argument to ignore case sensitivity. If you want only to search for CLM and not clm, remove that and use it
I have used the grep command to depict the use case, you can also use awk or sed or any command as per your requirement.
Just an addon, if you want to save the output in some file, let's say ouput.txt you can achieve this as below:-
import subprocess
with open('output.txt', 'w') as f:
process = subprocess.Popen(['grep', '-i', '^hel*', 'file.txt'], stdout=f)
If your file is extremely large, you can also do a poll and check for the subprocess execution status. Refer to the below link for more details on that.
Python-Shell-Commands
try:
with open('file.txt') as f:
for line in f:
if line.startswith('CLM'):
print(line.rstrip())

Execute a subprocess that takes an input file and write the output to a file

I am using a third-party C++ program to generate intermediate results for the python program that I am working on. The terminal command that I use looks like follows, and it works fine.
./ukb/src/ukb_wsd --ppr_w2w -K ukb/scripts/wn30g.bin -D ukb/scripts/wn30_dict.txt ../data/glass_ukb_input2.txt > ../data/glass_ukb_output2w2.txt
If I break it down into smaller pieces:
./ukb/src/ukb_wsd - executable program
--ppr_w2w - one of the options/switches
-K ukb/scripts/wn30g.bin - parameter K indicates that the next item is a file (network file)
-D ukb/scripts/wn30_dict.txt - parameter D indicate that the next item is a file (dictionary file)
../data/glass_ukb_input2.txt - input file
> - shell command to write the output to a file
../data/glass_ukb_output2w2.txt - output file
The above works fine for one instance. I am trying to do this for around 70000 items (input files). So found a way by using the subprocess module in Python. The body of the python function that I created looks like this.
with open('../data/glass_ukb_input2.txt', 'r') as input, open('../data/glass_ukb_output2w2w_subproc.txt', 'w') as output:
subprocess.run(['./ukb/src/ukb_wsd', '--ppr_w2w', '-K', 'ukb/scripts/wn30g.bin', '-D', 'ukb/scripts/wn30_dict.txt'],
stdin=input,
stdout=output)
This error is no longer there
When I execute the function, it gives an error as follows:
...
STDOUT = subprocess.STDOUT
AttributeError: module 'subprocess' has no attribute 'STDOUT'
Can anyone shed some light about solving this problem.
EDIT
The error was due to a file named subprocess.py in the source dir which masked Python's subprocess file. Once it was removed no error.
But the program could not identify the input file given in stdin. I am thinking it has to do with having 3 input files. Is there a way to provide more than one input file?
EDIT 2
This problem is now solved with the current approach:
subprocess.run('./ukb/src/ukb_wsd --ppr_w2w -K ukb/scripts/wn30g.bin -D ukb/scripts/wn30_dict.txt ../data/glass_ukb_input2.txt > ../data/glass_ukb_output2w2w_subproc.txt',shell=True)

Converting pcapng file into csv file

I'm using Ubuntu terminal and I'm running using
tshark -r file.pcapng -T fields -e 6lowpan.src -e frame.proto >file.csv
I also can't get protocol info. I want to convert a .pcapng file into a .csv file.
But, I'm not able to retrieve 6 lowpan source address using 6lowpan.src. In the csv file i am getting empty file without any output and also I want the output data in text format.

How to call and pipe multiple postgres commands from python

In order to copy a file-like object to a postgres database, I take the following steps:
~$ sudo psql -U postgres
password for root:
password for user postgres:
postgres=# \c migration v0
You are now connected to database "migration_v0" as user "postgres".
migration_v0=# cat file.csv | \copy table1 from stdin csv
I want to take the exact same steps, but from within Python and want to pass a StringIO buffer instead of a literal file. My first attempt consisted of the following steps:
# test.py
fmt = r"copy table1 FROM stdin csv"
sql = fmt.format(string_io)
psql = ['psql', '-U', 'postgres', '-c', sql]
output = subprocess.check_output(psql)
print(output)
The command is executed (a prompt pops up to type the password for the user postgres) but I get the following error:
ERROR: relation "table1" does not exist
This happens because I am currently trying to execute \copy on the default database postgres instead of migration_v0. Thus, I want to include both commands in the subprocess call (\c migration_v0 and \copy ...) and I don't know how to do this, since the postgres' flag -c takes only a single command.
I looked up a workaround and came across with this command line example:
\c migration_v0 \\ \copy ... | psql -U postgres
, but I have no idea how I can port this to python code.
Any suggestions on how I can pull this off?
Edit 1
I realized the flag -d also enables switching databases so now I don't need to run multiple commands. My code now looks like this:
p = subprocess.Popen([
'psql', '-U', 'postgres',
'-d', 'migration_v0',
'-c', '\copy table1 FROM stdin csv'],
shell=False,
stdin=string_io)
but I get the following error:
io.UnsupportedOperation: fileno
Apparently StringIO doesn't implement fileno. At this point I'm wondering if it's even possible to achieve what I want to through a subprocess call.

Elasticsearch-py bulk helper equivalent of curl with file

I am looking to replicate the following command using the elasticsearch python client (and without using subprocess):
curl -s -XPOST "localhost:9200/index_name/_bulk" --data-binary #file
I have attempted to use the bulk helper without any luck:
es = Elasticsearch()
with open("file") as fp:
bulk(
client=es,
index="index_name",
actions=fp
)
This results in type is missing errors.
The file, which is processed just fine when using curl, looks a bit like this:
{"index":{"_type":"someType","_id":"123"}}
{"field1":"data","field2":"data",...}
{"index":{"_type":"someType","_id":"456"}}
{"field1":"data","field2":"data",...}
...
Please note, I'd rather not change the contents of the file since I have around 21000 with the same format.
The actions parameter must take an iterable (not a file handle) that will iterate over the lines of your file, so you need to do it like this instead:
es = Elasticsearch()
def readbulk():
for line in open("file"):
yield line
bulk(
client=es,
index="index_name",
actions=readbulk
)

Resources