I have a string:
line="123 123 test testing"
and I want to go through the pieces separated by space and check to see if each piece (123, 123, test, testing) matches a pre-determined sequence.
I know how to tokenize a string:
for token in string.gmatch(line,'%w+') do
print (token)
end
I am not sure however how to iterate through each piece one by one and compare to my variable, local var.
Basically I want to be able to get each piece of the string, compare to variable var's contents.
Here is a psudo-code:
Read file line by line {
Split every line by space (each line is in this format 123 456 string string)
Var1="123"
Var2="456"
If first token of the line= var1 then
If second token of the line=var2 then
Print the line
End
End
}
for v in str:gmatch('%s('..var..')%s') do
print(v)
end
Without more information this is all I can think of.
Related
The question:
Write a function file_size(filename) that returns a count of the number of characters in the file whose name is given as a parameter. You may assume that when being tested in this CodeRunner question your function will never be called with a non-existent filename.
For example, if data.txt is a file containing just the following line: Hi there!
A call to file_size('data.txt') should return the value 10. This includes the newline character that will be added to the line when you're creating the file (be sure to hit the 'Enter' key at the end of each line).
What I have tried:
def file_size(data):
"""Count the number of characters in a file"""
infile = open('data.txt')
data = infile.read()
infile.close()
return len(data)
print(file_size('data.txt'))
# data.txt contains 'Hi there!' followed by a new line
character.
I get the correct answer for this file however I fail a test that users a larger/longer file which should have a character count of 81 but I still get 10. I am trying to get the code to count the correct size of any file.
I'm extracting data in a loop from a text file between two strings with Python 3.6. I've got multiple strings of which I would like to extract data between those strings, see code below:
for i in range(0,len(strings1)):
with open('infile.txt','r') as infile, open('outfile.txt', 'w') as outfile:
copy = False
for line in infile:
if line == strings1[i]:
copy = True
elif line == strings2[i]:
copy = False
elif copy:
outfile.write(line)
continue
To decrease the processing time of the loop, I would like to modify my code such that after it has extracted data between two strings, let's say strings1[1] and strings2[1], it remembers the line index of strings2[1] and starts the next iteration of the loop at that line index. Therefore it doesn't have to read the whole file during each iteration. The string lists are build such that the previous strings will never occur after a current string, so modifying my code to what I want won't break the loop.
Does anyone how to do this?
===========================================================================
EDIT:
I've got a file in a format such as:
the first line
bla bla bla
FIRST some string 1
10 10
15 20
5 2.5
SECOND some string 2
bla bla bla
bla bla bla
FIRST some string 3
10 10
15 20
5 2.5
SECOND some string 4
The file goes on like this for many lines.
I want to extract the data between 'FIRST some string 1' and 'SECOND some string 2', and plot this data. When that is done, I want to do the same for the data between 'FIRST some string 3' and 'SECOND some string 4' (thus also plot the data). All the 'FIRST some string ..' are stored in strings1 list and all the 'SECOND some string ..' are stored in strings2 list.
To decrease computational time, I would like to modify the code such that after the first iteration, it knows that it can start from line with string 'some string 2' and not from 'the first line' AND also that when during the first iteration, it knows that it can stop the first iteration when it has found 'SECOND some string 2'.
Does anyone how to do this? Please let me know when something is unclear.
The key issue is you're reopening your files in a for loop, of course it will reiterate the files from the beginning each time. I wouldn't open the files in a for loop, that's horribly inefficient. You can load the files into memory first and then loop through strings1.
There are some other issues, namely here:
copy = False
for line in infile:
if line == strings1[i]:
copy = True
elif line == strings2[i]:
copy = False
elif copy:
outfile.write(line)
continue
The elif copy: line will never execute in the first iteration of the second loop because copy is only ever True once the line == strings1[i] is met. After that condition is met, for the rest of the iterations it will always write the lines from infile to outfile. Unless this is precisely what you're trying to achieve the logic doesn't work.
Without a full context it's hard to understand what exactly you're looking for.
But maybe what you want to do instead is simply this:
with open('infile.txt','r') as infile, open('outfile.txt', 'w') as outfile:
for line in infile.readlines():
if line.rstrip('\n') in strings1:
outfile.write(line)
What this code is doing:
1.) Open both files into memory.
2.) Iterate through the lines of the infile.
3.) Check if the iterated line, stripping the trailing newline character is in the list strings1, assuming your strings1 is a list that doesn't have any trailing newline characters. If each item in strings1 already has a trailing \n, then don't rstrip the line.
4.) If line occurs in strings1, write the line to outfile.
This looks to be the gist of what you're attempting.
How can I find in file particular text '12345' and get all lines up and down till to the 'Received notification:' using linux console commands without hardcoding numbers of lines for up and down?
Received notification:
Random text
Random text
...
12345
random text
...
Random text
Received notification:
You can use the following approach:
$ awk '/str1/ {p=1}; p; /str2/ {p=0}' file
When it finds str1, then makes variable p=1. It just prints lines when p==1. This is accomplished with the p condition. If it is true, it performs the default awk action, that is, print $0. Otherwise, it does not.
When it finds str2, then makes variable p=0. As this condition is checked after p condition, it will print the line in which str2 appears for the first time.
I need to read a text file which contains csv data with headers separating individual blocks of data. The headers always start with the dollar sign $. So my text file looks like:
$Header1
2
1,2,3,4
2,4,5,8
$Header2
2
1,1,0,19,9,8
2,1,0,18,8,7
What I want to do is if the program reaches to $Header2, I want to read all the next lines following it till it reaches, say, $Header3 or end of the file. I think I can use `cmp' in Julia for this. I tried with a small file that contains following text:
# file julia.txt
Julia
$Julia
and my code reads:
# test.jl
fname = "julia.txt"
# set some string values
str1 ="Julia";
str2 ="\$Julia";
# print the strings and check the length
println(length(str1),",",str1);
println(length(str2),",",str2);
# now read the text file to check if you are able to find the strings
# str1 and str2 above
println ("Reading file...");
for ln in eachline(fname)
println(length(ln),",",ln);
if (cmp(str1,ln)==0)
println("Julia match")
end
if (cmp(str2,ln)==0)
println("\$Julia match")
end
end
what I get as output from the above code is:
5,Julia
6,$Julia
Reading file...
6,Julia
7,$Julia
I don't understand why I get character length of 6 for string Julia and 7 for the string $Julia when they are read from the file. I checked the text file by turning on white spaces and there are none. What am i doing wrong?
The issue is that the strings returned by eachline contain a newline character at the end.
You can use chomp to remove it:
julia> first(eachline("julia.txt"))
"Julia\n"
julia> chomp(first(eachline("julia.txt")))
"Julia"
Also, you can simply use == instead of cmp to test whether two strings are equal. Both use a ccall to memcmp but == only does that for strings of equal length and is thus probably faster.
I have large .csv file separated by tabs, which has strict structure with colClasses = c("integer", "integer", "numeric") . For some reason, there are number of trash irrelevant character lines, that broke the pattern, that's why I get
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'an integer', got 'ExecutiveProducers'
How can I ask read.table to continue and just to skip this lines? The file is large, so it's troublesome to perform the task by hand.
If it's impossible, should I use scan + for-loop ?
Now I just read everything as characters and then delete irrelevant rows and convert columns back to numeric, which I think not very memory-efficient
If your file fits into memory, you could first read the file, remove unwanted lines and then read those using read.csv:
lines <- readLines("yourfile")
# remove unwanted lines: select only lines that do not contain
# characters; assuming you have column titles in the first line,
# you want to add those back again; hence the c(1, sel)
sel <- grep("[[:alpha:]]", lines, invert=TRUE)
lines <- lines[c(1,sel)]
# read data from selected lines
con <- textConnection(lines)
data <- read.csv(file=con, [other arguments as normal])
If the character strings are always the same, or always contain the same word, you can define them as NA values using
read.csv(..., na.strings="")
and the delete all of them afterwards with
omit.na(dataframe)