Read.table to skip lines with errors - string

I have large .csv file separated by tabs, which has strict structure with colClasses = c("integer", "integer", "numeric") . For some reason, there are number of trash irrelevant character lines, that broke the pattern, that's why I get
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'an integer', got 'ExecutiveProducers'
How can I ask read.table to continue and just to skip this lines? The file is large, so it's troublesome to perform the task by hand.
If it's impossible, should I use scan + for-loop ?
Now I just read everything as characters and then delete irrelevant rows and convert columns back to numeric, which I think not very memory-efficient

If your file fits into memory, you could first read the file, remove unwanted lines and then read those using read.csv:
lines <- readLines("yourfile")
# remove unwanted lines: select only lines that do not contain
# characters; assuming you have column titles in the first line,
# you want to add those back again; hence the c(1, sel)
sel <- grep("[[:alpha:]]", lines, invert=TRUE)
lines <- lines[c(1,sel)]
# read data from selected lines
con <- textConnection(lines)
data <- read.csv(file=con, [other arguments as normal])

If the character strings are always the same, or always contain the same word, you can define them as NA values using
read.csv(..., na.strings="")
and the delete all of them afterwards with
omit.na(dataframe)

Related

set function with file- python3

I have a text file with given below content
Credit
Debit
21/12/2017
09:10:00
Written python code to convert text into set and discard \n.
with open('text_file_name', 'r') as file1:
same = set(file1)
print (same)
print (same.discard('\n'))
for first print statement print (same). I get correct result:
{'Credit\n','Debit\n','21/12/2017\n','09:10:00\n'}
But for second print statement print (same.discard('\n')) . I am getting result as
None.
Can anybody help me to figure out why I am getting None. I am using same.discard('\n') to discard \n in the set.
Note:
I am trying to understand the discard function with respect to set.
The discard method will only remove an element from the set, since your set doesn't contain just \n it can't discard it. What you are looking for is a map that strips the \n from each element like so:
set(map(lambda x: x.rstrip('\n'), same))
which will return {'Credit', 'Debit', '09:10:00', '21/12/2017'} as the set. This sample works by using the map builtin which applies it's first argument to each element in the set. The first argument in our map usage is lambda x: x.rstrip('\n') which is simply going to remove any occurrences of \n on the right-hand side of each string.
discard removes the given element from the set only if it presents in it.
In addition, the function doesn't return any value as it changes the set it was ran from.
with open('text_file_name', 'r') as file1:
same = set(file1)
print (same)
same = {elem[:len(elem) - 1] for elem in same if elem.endswith('\n')}
print (same)
There are 4 elements in the set, and none of them are newline.
It would be more usual to use a list in this case, as that preserves order while a set is not guaranteed to preserve order, plus it discards duplicate lines. Perhaps you have your reasons.
You seem to be looking for rstrip('\n'). Consider processing the file in this way:
s = {}
with open('text_file_name') as file1:
for line in file1:
s.add(line.rstrip('\n'))
s.discard('Credit')
print(s) # This displays 3 elements, without trailing newlines.

Different behaviour shown when running the same code for a file and for a list

I have observed this unusual behaviour when I try to do a string slicing on the words in a file and the words in a list.Both the results are quite different.
For example I have a file 'words.txt' which contains the following content
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
When I write the below piece of code, I expect to get a list of words with last letter omitted.
with open("words.txt", "r") as fo:
for l in fo:
print(l[:-1])
But instead I get this result below.No string slicing takes place and the words are similar as before.
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
But if I write the below code, I get what I want
lis = ["POPE", "POPS", "ROPE", "POKE", "COPE", "PAPE", "NOPE", "POLE"]
for i in lis:
print(i[:-1])
I am able to delete the last letter of each of the words as expected.
POP
POP
ROP
POK
COP
PAP
NOP
POL
So why do I see two different results for the same operation [: -1] ?
The line ends with \n in files where as you dont need line endings in lists.
Your actual file contents are as follows
POPE\n
POPS\n
ROPE\n
POKE\n
COPE\n
PAPE\n
NOPE\n
POLE\n
hence the print(l[:-1]) is actually trimming the line ending i.e. \n.
To verify this, declare an empty list before the loop, and add each line to that list and print it. You will find the that the lines contain the \n on every line
stuff = []
with open("words.txt", "r") as fo:
for line in fo:
stuff.append(line)
print stuff
this will print ['POPE\n', 'POPS\n', 'ROPE\n', 'POKE\n']
If I am not wrong, you want to carry out the slicing operation on the file contents. I think you should look into strip() method.

No output using join -- files are sorted, fields match

I want to join these two files by the long hash strings but when I execute the code it does emits no output at all. Both files are sorted by the field being used as a join key.
sort.txt
bondsba01:06997f04a7db92466a2baa6ebc8b872d
mccovwi01:07563a3fe3bbe7e3ba84431ad9d055af
thomafr04:07563a3fe3bbe7e3ba84431ad9d055af
willite01:07563a3fe3bbe7e3ba84431ad9d055af
bankser01:10a7cdd970fe135cf4f7bb55c0e3b59f
matheed01:10a7cdd970fe135cf4f7bb55c0e3b59f
ramirma02:15de21c670ae7c3f6f3f1f37029303c9
ortizda01:285e19f20beded7d215102b49d5c09a0
robinfr02:605ff764c617d3cd28dbbdd72be8f9a2
mantlmi01:65658fde58ab3c2b6e5132a39fae7cb9
mayswi01:68264bdb65b97eeae6788aa3348e553c
rodrial01:7f5d04d189dfb634e6a85bb9d9adf21e
palmera01:8b16ebc056e613024c057be590b542eb
schmimi01:8d34201a5b85900908db6cae92723617
jacksre01:8eefcfdf5990e441f0fb6f3fad709e21
mcgwima01:9ad6aaed513b73148b7d49f70afcfb32
griffke02:9cc138f8dc04cbf16240daa92d8d50e2
ottme01:a760880003e7ddedfef56acb3b09697f
pujolal01:a9a6653e48976138166de32772b1bf40
murraed02:b337e84de8752b27eda3a12363109e80
foxxji01:c399862d3b9d6b76c8436e924a68c45b
aaronha01:ccb0989662211f61edae2e26d58ea92f
ruthba01:d14220ee66aeec73c49038385428ec4c
sosasa01:d7a728a67d909e714c0774e22cb806f2
sheffga01:e2230b853516e7b05d79744fbd4c9c13
killeha01:e5f6ad6ce374177eef023bf5d0c018b6
thomeji01:f76a89f0cb91bc419542ce9fa43902dc
cracked.txt
06997f04a7db92466a2baa6ebc8b872d:762
07563a3fe3bbe7e3ba84431ad9d055af:521
10a7cdd970fe135cf4f7bb55c0e3b59f:512
15de21c670ae7c3f6f3f1f37029303c9:555
285e19f20beded7d215102b49d5c09a0:503
605ff764c617d3cd28dbbdd72be8f9a2:586
65658fde58ab3c2b6e5132a39fae7cb9:536
68264bdb65b97eeae6788aa3348e553c:660
7f5d04d189dfb634e6a85bb9d9adf21e:687
8b16ebc056e613024c057be590b542eb:569
8d34201a5b85900908db6cae92723617:548
8eefcfdf5990e441f0fb6f3fad709e21:563
9ad6aaed513b73148b7d49f70afcfb32:583
9cc138f8dc04cbf16240daa92d8d50e2:630
a760880003e7ddedfef56acb3b09697f:511
a9a6653e48976138166de32772b1bf40:560
b337e84de8752b27eda3a12363109e80:504
c399862d3b9d6b76c8436e924a68c45b:534
ccb0989662211f61edae2e26d58ea92f:755
d14220ee66aeec73c49038385428ec4c:714
d7a728a67d909e714c0774e22cb806f2:609
e2230b853516e7b05d79744fbd4c9c13:509
e5f6ad6ce374177eef023bf5d0c018b6:573
f76a89f0cb91bc419542ce9fa43902dc:612
Code
join -t ':' -1 2 -2 1 sort.txt cracked.txt
You need to ensure that both input files are using UNIX newlines.
DOS text files have two character newlines (carriage return, linefeed). UNIX text files have only a linefeed.
Thus, when reading a DOS text file on UNIX, every line appears to have an extra character on the end (a CR, aka $'\r'). Since these characters are carriage returns, they send the cursor back to the beginning of the current line when printed, rather than having a visual effect, so their presence isn't always obvious.
So, when you're reading from first field of cracked.txt, your hashes are literal -- but when you read from the last field of sort.txt, they have an invisible carriage return character on the end. Thus, they never match, thus, you get no output.

Julia: comparing strings with special characters

I need to read a text file which contains csv data with headers separating individual blocks of data. The headers always start with the dollar sign $. So my text file looks like:
$Header1
2
1,2,3,4
2,4,5,8
$Header2
2
1,1,0,19,9,8
2,1,0,18,8,7
What I want to do is if the program reaches to $Header2, I want to read all the next lines following it till it reaches, say, $Header3 or end of the file. I think I can use `cmp' in Julia for this. I tried with a small file that contains following text:
# file julia.txt
Julia
$Julia
and my code reads:
# test.jl
fname = "julia.txt"
# set some string values
str1 ="Julia";
str2 ="\$Julia";
# print the strings and check the length
println(length(str1),",",str1);
println(length(str2),",",str2);
# now read the text file to check if you are able to find the strings
# str1 and str2 above
println ("Reading file...");
for ln in eachline(fname)
println(length(ln),",",ln);
if (cmp(str1,ln)==0)
println("Julia match")
end
if (cmp(str2,ln)==0)
println("\$Julia match")
end
end
what I get as output from the above code is:
5,Julia
6,$Julia
Reading file...
6,Julia
7,$Julia
I don't understand why I get character length of 6 for string Julia and 7 for the string $Julia when they are read from the file. I checked the text file by turning on white spaces and there are none. What am i doing wrong?
The issue is that the strings returned by eachline contain a newline character at the end.
You can use chomp to remove it:
julia> first(eachline("julia.txt"))
"Julia\n"
julia> chomp(first(eachline("julia.txt")))
"Julia"
Also, you can simply use == instead of cmp to test whether two strings are equal. Both use a ccall to memcmp but == only does that for strings of equal length and is thus probably faster.

str.format places last variable first in print

The purpose of this script is to parse a text file (sys.argv[1]), extract certain strings, and print them in columns. I start by printing the header. Then I open the file, and scan through it, line by line. I make sure that the line has a specific start or contains a specific string, then I use regex to extract the specific value.
The matching and extraction work fine.
My final print statement doesn't work properly.
import re
import sys
print("{}\t{}\t{}\t{}\t{}".format("#query", "target", "e-value",
"identity(%)", "score"))
with open(sys.argv[1], 'r') as blastR:
for line in blastR:
if line.startswith("Query="):
queryIDMatch = re.match('Query= (([^ ])+)', line)
queryID = queryIDMatch.group(1)
queryID.rstrip
if line[0] == '>':
targetMatch = re.match('> (([^ ])+)', line)
target = targetMatch.group(1)
target.rstrip
if "Score = " in line:
eValue = re.search(r'Expect = (([^ ])+)', line)
trueEvalue = eValue.group(1)
trueEvalue = trueEvalue[:-1]
trueEvalue.rstrip()
print('{0}\t{1}\t{2}'.format(queryID, target, trueEvalue), end='')
The problem occurs when I try to print the columns. When I print the first 2 columns, it works as expected (except that it's still printing new lines):
#query target e-value identity(%) score
YAL002W Paxin1_129011
YAL003W Paxin1_167503
YAL005C Paxin1_162475
YAL005C Paxin1_167442
The 3rd column is a number in scientific notation like 2e-34
But when I add the 3rd column, eValue, it breaks down:
#query target e-value identity(%) score
YAL002W Paxin1_129011
4e-43YAL003W Paxin1_167503
1e-55YAL005C Paxin1_162475
0.0YAL005C Paxin1_167442
0.0YAL005C Paxin1_73182
I have removed all new lines, as far I know, using the rstrip() method.
At least three problems:
1) queryID.rstrip and target.rstrip are lacking closing ()
2) Something like trueEValue.rstrip() doesn't mutate the string, you would need
trueEValue = trueEValue.rstrip()
if you want to keep the change.
3) This might be a problem, but without seeing your data I can't be 100% sure. The r in rstrip stands for "right". If trueEvalue is 4e-43\n then it is true the trueEValue.rstrip() would be free of newlines. But the problem is that your values seem to be something like \n43-43. If you simply use .strip() then newlines will be removed from either side.

Resources