No output using join -- files are sorted, fields match - linux

I want to join these two files by the long hash strings but when I execute the code it does emits no output at all. Both files are sorted by the field being used as a join key.
sort.txt
bondsba01:06997f04a7db92466a2baa6ebc8b872d
mccovwi01:07563a3fe3bbe7e3ba84431ad9d055af
thomafr04:07563a3fe3bbe7e3ba84431ad9d055af
willite01:07563a3fe3bbe7e3ba84431ad9d055af
bankser01:10a7cdd970fe135cf4f7bb55c0e3b59f
matheed01:10a7cdd970fe135cf4f7bb55c0e3b59f
ramirma02:15de21c670ae7c3f6f3f1f37029303c9
ortizda01:285e19f20beded7d215102b49d5c09a0
robinfr02:605ff764c617d3cd28dbbdd72be8f9a2
mantlmi01:65658fde58ab3c2b6e5132a39fae7cb9
mayswi01:68264bdb65b97eeae6788aa3348e553c
rodrial01:7f5d04d189dfb634e6a85bb9d9adf21e
palmera01:8b16ebc056e613024c057be590b542eb
schmimi01:8d34201a5b85900908db6cae92723617
jacksre01:8eefcfdf5990e441f0fb6f3fad709e21
mcgwima01:9ad6aaed513b73148b7d49f70afcfb32
griffke02:9cc138f8dc04cbf16240daa92d8d50e2
ottme01:a760880003e7ddedfef56acb3b09697f
pujolal01:a9a6653e48976138166de32772b1bf40
murraed02:b337e84de8752b27eda3a12363109e80
foxxji01:c399862d3b9d6b76c8436e924a68c45b
aaronha01:ccb0989662211f61edae2e26d58ea92f
ruthba01:d14220ee66aeec73c49038385428ec4c
sosasa01:d7a728a67d909e714c0774e22cb806f2
sheffga01:e2230b853516e7b05d79744fbd4c9c13
killeha01:e5f6ad6ce374177eef023bf5d0c018b6
thomeji01:f76a89f0cb91bc419542ce9fa43902dc
cracked.txt
06997f04a7db92466a2baa6ebc8b872d:762
07563a3fe3bbe7e3ba84431ad9d055af:521
10a7cdd970fe135cf4f7bb55c0e3b59f:512
15de21c670ae7c3f6f3f1f37029303c9:555
285e19f20beded7d215102b49d5c09a0:503
605ff764c617d3cd28dbbdd72be8f9a2:586
65658fde58ab3c2b6e5132a39fae7cb9:536
68264bdb65b97eeae6788aa3348e553c:660
7f5d04d189dfb634e6a85bb9d9adf21e:687
8b16ebc056e613024c057be590b542eb:569
8d34201a5b85900908db6cae92723617:548
8eefcfdf5990e441f0fb6f3fad709e21:563
9ad6aaed513b73148b7d49f70afcfb32:583
9cc138f8dc04cbf16240daa92d8d50e2:630
a760880003e7ddedfef56acb3b09697f:511
a9a6653e48976138166de32772b1bf40:560
b337e84de8752b27eda3a12363109e80:504
c399862d3b9d6b76c8436e924a68c45b:534
ccb0989662211f61edae2e26d58ea92f:755
d14220ee66aeec73c49038385428ec4c:714
d7a728a67d909e714c0774e22cb806f2:609
e2230b853516e7b05d79744fbd4c9c13:509
e5f6ad6ce374177eef023bf5d0c018b6:573
f76a89f0cb91bc419542ce9fa43902dc:612
Code
join -t ':' -1 2 -2 1 sort.txt cracked.txt

You need to ensure that both input files are using UNIX newlines.
DOS text files have two character newlines (carriage return, linefeed). UNIX text files have only a linefeed.
Thus, when reading a DOS text file on UNIX, every line appears to have an extra character on the end (a CR, aka $'\r'). Since these characters are carriage returns, they send the cursor back to the beginning of the current line when printed, rather than having a visual effect, so their presence isn't always obvious.
So, when you're reading from first field of cracked.txt, your hashes are literal -- but when you read from the last field of sort.txt, they have an invisible carriage return character on the end. Thus, they never match, thus, you get no output.

Related

How to read a text file and insert the data into next line on getting \n character

I have a text file where data is comma delimited with a litral \n character in between, i would like to insert the data into newline just after getting the \n character.
text file sample:
'what,is,your,name\n','my,name,is,david.hough\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,eric.knot\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,fisher.cold\n','i,am,a,software,prof\n',..
expected:
I need the output in the below form.
'what,is,your,name',
'my,name,is,david.hough',
'i,am,a,software,prof',
Tried:
file1 = open("test.text", "r")
Lines = file1.readlines()
for line in Lines:
print(line)
result:
'what,is,your,name\n','my,name,is,david.hough\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,eric.knot\n','i,am,a,software,prof\n','what,is,your,name\n','my,name,is,fisher.cold\n','i,am,a,software,prof\n',..
well my comment does exactly what you asked, break lines at \n. your data is structured quite weirdly, but if you want the expected result that badly you can use regex
import re
file1 = open("test.text","r")
Lines = re.findall(r'\'.*?\',',file1.read().replace("\\n",""))
for line in Lines:
print(line)
Well you don't need push data to the other line manually. The \n does that work when you run the code.
I guess the problem is that you used quotes very frequently, try using a single pair of quotes and use \n after the first sentence and yeah without white space
'what,is,your,name\nmy,name,is,david.hough\ni,am,a,software,prof'

Splitting File contents by regex

As the title stated, I need to split a file by regex in Python.
The lay out of the .txt file is as follows
[text1]
file contents I need
[some text2]
more file contents I need
[more text 3]
last bit of file contents I need
I originally tried splitting the files like so:
re.split('\[[A-Za-z]+\]\n', data)
The problem with doing it this way was that it wouldn't capture the blocks that had spaces in between the text within the brackets.
I then tried using a wild card character: re.split('\[(.*?)\]\n', data)
The problem I ran into this was that I found it would split the file contents as well. What's the best way to to get the following result:
['file contents I need','more file contents I need','last bit of file contents I need']?
Thanks in advance.
Instead of using re.split, you could use a capturing group with re.findall which will return the group 1 values.
In the group, match all the lines that do not start with the [.....] pattern
^\[[^][]*]\r?\n\s*(.*(?:\r?\n(?!\[[^][]*]).*)*)
In parts
^ Start of line
\[[^][]*]
\r?\n\s* Match an newline and optional whitespace chars
( Capture group 1
.* Match any char except a newline 0+ times
(?: Non capture group
\r?\n(?!\[[^][]*]).* Match the line if it does not start with with the [...] pattern using a negative lookahead (?!
)* Close group and repeat 0+ times to get all the lines
) Close group
See a regex demo or a Python demo
Example code
import re
regex = r"^\[[^][]*]\r?\n\s*(.*(?:\r?\n(?!\[[^][]*]).*)*)"
data = ("[text1]\n\n"
"file contents I need\n\n"
"[some text2]\n\n"
"more file contents I need\n\n"
"[more text 3]\n\n"
"last bit of file contents I need\n"
"last bit of file contents I need")
matches = re.findall(regex, data, re.MULTILINE)
print(matches)
Output
['file contents I need\n', 'more file contents I need\n', 'last bit of file contents I need\nlast bit of file contents I need']
Given:
txt='''\
[text1]
file contents I need
[some text2]
more file contents I need
multi line at that
[more text 3]
last bit of file contents I need'''
(Which could be from a file...)
You can do:
>>> [e.strip() for e in re.findall(r'(?<=\])([\s\S]*?)(?=\[|\s*\Z)', txt)]
['file contents I need', 'more file contents I need\nmulti line at that', 'last bit of file contents I need']
Demo
You can also use re.finditer to locate each block of interest:
with open(ur_file) as f:
for i, block in enumerate(re.finditer(r'^\s*\[[^]]*\]([\s\S]*?)(?=^\s*\[[^]]*\]|\Z)', f.read(), flags=re.M)):
print(i, block.group(1))
The individual blocks leading and trailing whitespace can be dealt with as desired...

Different behaviour shown when running the same code for a file and for a list

I have observed this unusual behaviour when I try to do a string slicing on the words in a file and the words in a list.Both the results are quite different.
For example I have a file 'words.txt' which contains the following content
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
When I write the below piece of code, I expect to get a list of words with last letter omitted.
with open("words.txt", "r") as fo:
for l in fo:
print(l[:-1])
But instead I get this result below.No string slicing takes place and the words are similar as before.
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
But if I write the below code, I get what I want
lis = ["POPE", "POPS", "ROPE", "POKE", "COPE", "PAPE", "NOPE", "POLE"]
for i in lis:
print(i[:-1])
I am able to delete the last letter of each of the words as expected.
POP
POP
ROP
POK
COP
PAP
NOP
POL
So why do I see two different results for the same operation [: -1] ?
The line ends with \n in files where as you dont need line endings in lists.
Your actual file contents are as follows
POPE\n
POPS\n
ROPE\n
POKE\n
COPE\n
PAPE\n
NOPE\n
POLE\n
hence the print(l[:-1]) is actually trimming the line ending i.e. \n.
To verify this, declare an empty list before the loop, and add each line to that list and print it. You will find the that the lines contain the \n on every line
stuff = []
with open("words.txt", "r") as fo:
for line in fo:
stuff.append(line)
print stuff
this will print ['POPE\n', 'POPS\n', 'ROPE\n', 'POKE\n']
If I am not wrong, you want to carry out the slicing operation on the file contents. I think you should look into strip() method.

How find text in file and get lines up and down according to pattern

How can I find in file particular text '12345' and get all lines up and down till to the 'Received notification:' using linux console commands without hardcoding numbers of lines for up and down?
Received notification:
Random text
Random text
...
12345
random text
...
Random text
Received notification:
You can use the following approach:
$ awk '/str1/ {p=1}; p; /str2/ {p=0}' file
When it finds str1, then makes variable p=1. It just prints lines when p==1. This is accomplished with the p condition. If it is true, it performs the default awk action, that is, print $0. Otherwise, it does not.
When it finds str2, then makes variable p=0. As this condition is checked after p condition, it will print the line in which str2 appears for the first time.

Read.table to skip lines with errors

I have large .csv file separated by tabs, which has strict structure with colClasses = c("integer", "integer", "numeric") . For some reason, there are number of trash irrelevant character lines, that broke the pattern, that's why I get
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'an integer', got 'ExecutiveProducers'
How can I ask read.table to continue and just to skip this lines? The file is large, so it's troublesome to perform the task by hand.
If it's impossible, should I use scan + for-loop ?
Now I just read everything as characters and then delete irrelevant rows and convert columns back to numeric, which I think not very memory-efficient
If your file fits into memory, you could first read the file, remove unwanted lines and then read those using read.csv:
lines <- readLines("yourfile")
# remove unwanted lines: select only lines that do not contain
# characters; assuming you have column titles in the first line,
# you want to add those back again; hence the c(1, sel)
sel <- grep("[[:alpha:]]", lines, invert=TRUE)
lines <- lines[c(1,sel)]
# read data from selected lines
con <- textConnection(lines)
data <- read.csv(file=con, [other arguments as normal])
If the character strings are always the same, or always contain the same word, you can define them as NA values using
read.csv(..., na.strings="")
and the delete all of them afterwards with
omit.na(dataframe)

Resources