Does Beam supports custom delimiter when reading from text file - python-3.x

I have ldif file format and delimiter as empty line
dn: uid=12345,ab=users,xy=random
phone: 111
address: someaddress
email: true
username:abc
password:abc
dn: uid=12345,ab=users,xy=random
objectClass: inetOrgPerson
objectClass: top
phone: 111
address: someaddress
email: true
username:abcd
password:abcd
I want to write something like
data = (p
| 'Read File From GCS' >> beam.io.textio.ReadFromText('gs://my-ldif.ldiff', delimiter='\r\n')
)
But looks like there is no option to specify delimiter in python. Quoting from official docs, but does not say how to mention delimiters.
Parses a text file as newline-delimited elements, by default assuming UTF-8 encoding. Supports newline delimiters \n and \r\n.
I see this is present in java and can any one say if python supports delimiter or not?
PAssert.that(p.apply(TextIO.read().from(filename).withDelimiter(new byte[] {'|', '*'})))
.containsInAnyOrder(
"To be, or not to be: that |is the question: To be, or not to be: "
+ "that *is the question: Whether 'tis nobler in the mind to suffer ",
"The slings and arrows of outrageous fortune,|");
p.run();

You are correct: it is not yet possible in Python.
I found this open feature request ticket: https://issues.apache.org/jira/browse/BEAM-12730. This would be a great starter task for someone interested in contributing to Beam!

Related

How to delete last n characters of .txt file without having to re-write all the other characters [duplicate]

After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.

How to use python to split lines of a txt file into a list with variables

This is the example of two lines from the sample.txt file
2021-06-12 16:40:49,225 INFO:URL: http://localhost:8000/page
2021-06-14 16:56:46,488 INFO:URL: http://localhost:8000/gpage
Result for each line:
['2021-06-14','16:56:46','488','INFO','URL','http://localhost:8000/gpage']
How can we get this result without using regular expressions?
first of all replace text "INFO:URL:" with "INFO URL " then replace "," with space and split text by space
a="2021-06-14 16:56:46,488 INFO:URL: http://localhost:8000/gpage"
aa=a.replace("INFO:URL:","INFO URL ").replace(","," ").split()
print(aa)
it gives
['2021-06-14', '16:56:46', '488', 'INFO', 'URL', 'http://localhost:8000/gpage']

lists and reading from file comma separated elements

Here is my code
#course registration
list_courses=[]
for line in open("courses.txt",'r').readlines():
list_courses.append(line.strip())
print ("Gathering course information from file: \n",list_courses)
close("courses.txt")
list_student=[]
for line in open("students.txt",'r').readlines():
list_student.append(line.strip())
print("Here is student info: \n",list_student)
close("students.txt")
this is giving me errors when I try to close the files. How do I close, I am basically reading contents of file and storing them in a list. Now later on want to close the open files.There I get error.
I edited the code as per suggestions below.
The new code is
list_courses=[]
with open("courses.txt",'r') as myfile1:
list_courses=myfile1.readlines()
list_courses=[x.strip() for x in list_courses]
print ("Gathering course information from file: \n",list_courses)
list_student=[]
with open("students.txt",'r') as myfile1:
list_student=myfile1.readlines()
list_student=[x.strip() for x in list_student]
print("Here is student info: \n",list_student)
The information in courses.txt is
cs101,C programming
cs102,Digital logic and design
cs103,Electrical engineering
cs231,IT networks
cs232,IT Workshop
cs233,IT programming
cs301,Compilers and automata
cs302,Operating Systems
cs303,Networks
cs401,Game Theory
cs402,Systems Programming
cs403,Automata
ec101,Digitization
ec102,Analog cicuit design
ec103,IP Telephony
ec201,Wireless Network
ec202,Microwave engineering
ec203,Antenna
ec301,Maths2
ec302,Theory of Circuits
ec303,PCB design
ec401,PLC programming
ec402,Scada
ec403,VLSI
When I run the code I get output
Gathering course information from file:
['cs101,C programming', 'cs102,Digital logic and design', 'cs103,Electrical engineering', 'cs231,IT networks', 'cs232,IT Workshop', 'cs233,IT programming', 'cs301,Compilers and automata', 'cs302,Operating Systems', 'cs303,Networks', 'cs401,Game Theory', 'cs402,Systems Programming', 'cs403,Automata', 'ec101,Digitization', 'ec102,Analog cicuit design', 'ec103,IP Telephony', 'ec201,Wireless Network', 'ec202,Microwave engineering', 'ec203,Antenna', 'ec301,Maths2', 'ec302,Theory of Circuits', 'ec303,PCB design', 'ec401,PLC programming', 'ec402,Scada', 'ec403,VLSI']
Instead of it what I want is the input cs101 from the first line of courses.txt to go in list_courses[0] and list_courses[1] to have c programming i.e.
list_courses[0]=cs101
list_courses[1]=C programming
So I tried methods where programe have taken a line and read the line stored that line as an element of list but there is a comma which separates two elements in courses.txt and comma separated values should be separate list elements.
This will work for you:
students=[]
subject=[]
with open("students.txt","r") as f:
for line in f.readlines():
eachl=line.split(",")
students.append(eachl[0])
subject.append(eachl[1][:-1])
you will get two lists containing student names and another one with subjects:
students list will look like:
['cs101', 'cs102', 'cs103', 'cs231', 'cs232', 'cs233', 'cs301', 'cs302', 'cs303', 'cs401', 'cs402', 'cs403', 'ec101', 'ec102', 'ec103', 'ec201', 'ec202', 'ec203', 'ec301', 'ec302', 'ec303', 'ec401', 'ec402', 'ec40`3]
subjects list will look like:
['C programming', 'Digital logic and design', 'Electrical engineering', 'IT networks', 'IT Workshop', 'IT programming', 'Compilers and automata', 'Operating Systems', 'Networks', 'Game Theory', 'Systems Programming', 'Automata', 'Digitization', 'Analog cicuit design', 'IP Telephony', 'Wireless Network', 'Microwave engineering', 'Antenna', 'Maths2', 'Theory of Circuits', 'PCB design', 'PLC programming', 'Scada', 'VLSI']
Why don’t you try
students = []
courses = []
open(“courses.txt”, “r”) as f
for line in f.readlines()
a, b = line.split(“,”)
students.append(a)
courses.append(b[:-1])
f.close()
This will produce two lists students and courses

Bash How To Select Multiple Lines from ldif-type File based on dn

I am wondering how best to parse an ldif file (and ldif-like files) so that I can import each DN entry and its associated attributes into variables, without crossing over into other DNs and their attributes, as everything is in a single file.
Please how can this be done?
Thanks for helping a noob out.
Edit:
An example ldif-like file looks something like this:
dn: cn=admins,cn=groups,cn=accounts,dc=mydom,dc=com
Group name: admins
Description: Account administrators group
GID: 721800000
Member users: admin, user2, user1
ipauniqueid: 2dafa3a2-b903-11e2-8a28-525400a60ac3
objectclass: top, groupofnames, posixgroup, ipausergroup, ipaobject, nestedGroup
dn: cn=editors,cn=groups,cn=accounts,dc=mydom,dc=com
Group name: editors
Description: Limited admins who can edit other users
GID: 721800002
Member users: user1
ipauniqueid: 2dc4446a-b903-11e2-a2fa-525400a60ac3
objectclass: top, groupofnames, posixgroup, ipausergroup, ipaobject, nestedGroup
dn: cn=employees,cn=groups,cn=accounts,dc=mydom,dc=com
Group name: employees
Description: Default group for all Qrios employees
GID: 721800006
Member users: user2, user3
ipauniqueid: 134ae6e0-b910-11e2-a7f3-525400a60ac3
objectclass: top, groupofnames, nestedgroup, ipausergroup, ipaobject, posixgroup
I would like to be able to select sections of the file, based on the first keyword (dn), and import the values of the lines into variables so I can make use of them, and then move to the next section.
sina, I am working with the LDIF format quite a lot and bash just does not cut it.
I would strongly advise you to start using perl or python with their respective LDAP modules:
http://ldap.perl.org/
http://www.python-ldap.org/
Just a small example of perl with its LDAP module:
# Read in the LDIF file specified in $input_file
my $ldif = Net::LDAP::LDIF->new($input_file, "r", onerror => 'warn', change => 1);
#
# Process the LDIF input file
#
while($entry = $ldif->read())
{
# Get the Member attribute
my #mbr = $entry->get_value('Member');
foreach my $value (#mbr)
{
# Here you have a 'Member' value in $value, do what you want
}
}
As you can see, this makes things quite simple. Also, these modules take into account all the different conventions within LDIF like abbreviated lines, changetypes, and so on.

Search in directory of files based on keywords from another file

Perl Newbie here and looking for some help.
I have a directory of files and a "keywords" file which has the attributes to search for and the attribute type.
For example:
Keywords.txt
Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk
For each file in the directory, I have to:
lookup the keywords.txt
search based on Attribute type
something like the below.
IF attribute_type = boolean THEN
search for attribute;
set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
extract string where attribute is Found
ELSIF attribute_type = chunk THEN
extract the complete chunk of paragraph where attribute is found.
This is what I have so far and I'm sure there is a more efficient way to do this.
I'm hoping someone can guide me in the right direction to do the above.
Thanks & regards,
SiMa
# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text,
# variable flag is set to Y else N
# End Code: For each text file in directory loop.
# Run the below for each document.
use strict;
use warnings;
# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
chomp;
# open the file
open(FILE,'attribute_config.txt');
while (<FILE>) {
chomp;
($attribute,$attribute_type) = split("\t");
$is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";
# For each boolean attribute, check if the keyword exists
# in the file and return Y or N
if ($is_boolean eq "Y") {
print "Yes\n";
# search for keyword in doc and assign values
}
print "Attribute: $attribute\n";
print "Attribute_Type: $attribute_type\n";
print "is_boolean: $is_boolean\n";
print "-----------\n";
}
close(FILE);
}
close(DOC_FILE);
exit;
It is a good idea to start your specs/question with a story ("I have a ..."). But
such a story - whether true or made up, because you can't disclose the truth -
should give
a vivid picture of the situation/problem/task
the reason(s) why all the work must be done
definitions for uncommon(ly used)terms
So I'd start with: I'm working in a prison and have to scan the emails
of the inmates for
names (like "Al Capone") mentioned anywhere in the text; the director
wants to read those mails in toto
order lines (like "weapon: AK 4711 quantity: 14"); the ordnance
officer wants those info to calculate the amount of ammunition and
rack space needed
paragraphs containing 'family'-keywords like "wife", "child", ...;
the parson wants to prepare her sermons efficiently
Taken for itself, each of the terms "keyword" (~running text) and
"attribute" (~structured text) of may be 'clear', but if both are applied
to "the X I have to search for", things get mushy. Instead of general ("chunk")
and technical ("string") terms, you should use 'real-world' (line) and
specific (paragraph) words. Samples of your input:
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
and your expected output:
--- Robin.txt ----
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife:
knife: Bowie quantity: 8
machine gun:
stinger rocket:
weapon:
weapon: AK 4711 quantity: 14
social relations paragaphs:
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Pseudo code should begin at the top level. If you start with
for each file in folder
load search list
process current file('s content) using search list
it's obvious that
load search list
for each file in folder
process current file using search list
would be much better.
Based on this story, examples, and top level plan, I would try to come
up with proof of concept code for a simplified version of the "process
current file('s content) using search list" task:
given file/text to search in and list of keywords/attributes
print file name
print "keywords:"
for each boolean item
print boolean item text
if found anywhere in whole text
print "Yes"
else
print "No"
print "order line:"
for each line item
print line item text
if found anywhere in whole text
print whole line
print "social relations paragaphs:"
for each paragraph
for each social relation item
if found
print paragraph
no need to check for other items
first implementation attempt:
use Modern::Perl;
#use English qw(-no_match_vars);
use English;
exit step_00();
sub step_00 {
# given file/text to search in
my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface
Hi Scarface,
tell Al Capone to send a car to the prison gate on sunday.
For the riot we need:
weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Regards
Robin
EOT
# print file name
say "--- Robin.txt ---";
# print "keywords:"
say "keywords:";
# for each boolean item
for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
# print boolean item text
printf " %s: ", $bi;
# if found anywhere in whole text
if ($whole_text =~ /$bi/) {
# print "Yes"
say "Yes";
# else
} else {
# print "No"
say "No";
}
}
# print "order line:"
say "order lines:";
# for each line item
for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
# print line item text
# if found anywhere in whole text
if ($whole_text =~ /^$li.*$/m) {
# print whole line
say " ", $MATCH;
}
}
# print "social relations paragaphs:"
say "social relations paragaphs:";
# for each paragraph
for my $para (split /\n\n/, $whole_text) {
# for each social relation item
for my $sr ("wife", "son", "husband") {
# if found
if ($para =~ /$sr/) {
## if ($para =~ /\b$sr\b/) {
# print paragraph
say $para;
# no need to check for other items
last;
}
}
}
return 0;
}
output:
perl 16953439.pl
--- Robin.txt ---
keywords:
Al Capone: Yes
Billy the Kid: No
Scarface: Yes
order lines:
knife: Bowie quantity: 8
weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.
Such (premature) code helps you to
clarify your specs (Should not-found keywords go into the output?
Is your search list really flat or should it be structured/grouped?)
check your assumptions about how to do things (Should the order line
search be done on the array of lines of thw whole text?)
identify topics for further research/rtfm (eg. regex (prison!))
plan your next steps (folder loop, read input file)
(in addition, people in the know will point out all my bad practices,
so you can avoid them from the start)
Good luck!

Resources