I need to read a text file which contains csv data with headers separating individual blocks of data. The headers always start with the dollar sign $. So my text file looks like:
$Header1
2
1,2,3,4
2,4,5,8
$Header2
2
1,1,0,19,9,8
2,1,0,18,8,7
What I want to do is if the program reaches to $Header2, I want to read all the next lines following it till it reaches, say, $Header3 or end of the file. I think I can use `cmp' in Julia for this. I tried with a small file that contains following text:
# file julia.txt
Julia
$Julia
and my code reads:
# test.jl
fname = "julia.txt"
# set some string values
str1 ="Julia";
str2 ="\$Julia";
# print the strings and check the length
println(length(str1),",",str1);
println(length(str2),",",str2);
# now read the text file to check if you are able to find the strings
# str1 and str2 above
println ("Reading file...");
for ln in eachline(fname)
println(length(ln),",",ln);
if (cmp(str1,ln)==0)
println("Julia match")
end
if (cmp(str2,ln)==0)
println("\$Julia match")
end
end
what I get as output from the above code is:
5,Julia
6,$Julia
Reading file...
6,Julia
7,$Julia
I don't understand why I get character length of 6 for string Julia and 7 for the string $Julia when they are read from the file. I checked the text file by turning on white spaces and there are none. What am i doing wrong?
The issue is that the strings returned by eachline contain a newline character at the end.
You can use chomp to remove it:
julia> first(eachline("julia.txt"))
"Julia\n"
julia> chomp(first(eachline("julia.txt")))
"Julia"
Also, you can simply use == instead of cmp to test whether two strings are equal. Both use a ccall to memcmp but == only does that for strings of equal length and is thus probably faster.
Related
I am making a small project in python that lets you make notes then read them by using specific arguments. I attempted to make an if statement to check if the string has a comma in it, and if it does, than my python file should find the comma then find the character right below that comma and turn it into an integer so it can read out the notes the user created in a specific user-defined range.
If that didn't make sense then basically all I am saying is that I want to find out what line/bit of code is causing this to not work and return nothing even though notes.txt has content.
Here is what I have in my python file:
if "," not in no_cs: # no_cs is the string I am searching through
user_out = int(no_cs[6:len(no_cs) - 1])
notes = open("notes.txt", "r") # notes.txt is the file that stores all the notes the user makes
notes_lines = notes.read().split("\n") # this is suppose to split all the notes into a list
try:
print(notes_lines[user_out])
except IndexError:
print("That line does not exist.")
notes.close()
elif "," in no_cs:
user_out_1 = int(no_cs.find(',') - 1)
user_out_2 = int(no_cs.find(',') + 1)
notes = open("notes.txt", "r")
notes_lines = notes.read().split("\n")
print(notes_lines[user_out_1:user_out_2]) # this is SUPPOSE to list all notes in a specific range but doesn't
notes.close()
Now here is the notes.txt file:
note
note1
note2
note3
and lastly here is what I am getting in console when I attempt to run the program and type notes(0,2)
>>> notes(0,2)
jeffv : notes(0,2)
[]
A great way to do this is to use the python .partition() method. It works by splitting a string from the first occurrence and returns a tuple... The tuple consists of three parts 0: Before the separator 1: The separator itself 2: After the separator:
# The whole string we wish to search.. Let's use a
# Monty Python quote since we are using Python :)
whole_string = "We interrupt this program to annoy you and make things\
generally more irritating."
# Here is the first word we wish to split from the entire string
first_split = 'program'
# now we use partition to pick what comes after the first split word
substring_split = whole_string.partition(first_split)[2]
# now we use python to give us the first character after that first split word
first_character = str(substring_split)[0]
# since the above is a space, let's also show the second character so
# that it is less confusing :)
second_character = str(substring_split)[1]
# Output
print("Here is the whole string we wish to split: " + whole_string)
print("Here is the first split word we want to find: " + first_split)
print("Now here is the first word that occurred after our split word: " + substring_split)
print("The first character after the substring split is: " + first_character)
print("The second character after the substring split is: " + second_character)
output
Here is the whole string we wish to split: We interrupt this program to annoy you and make things generally more irritating.
Here is the first split word we want to find: program
Now here is the first word that occurred after our split word: to annoy you and make things generally more irritating.
The first character after the substring split is:
The second character after the substring split is: t
I have a .tgz file that was formatted as shell code, it looks like this (Hex):
"\x1F\x8B\x08\x00\x44\x7A\x91\x4F\x00\x03\xED\x59\xED\x72.."
It was generated this way (python3):
import os
def main():
dump_src = "MyPlugin.tgz"
fc = ""
try:
with open(dump_src, 'rb') as fd:
fcr = fd.read()
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
except:
fcr = dump_src
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
print(fc)
# failed attempt:
fcback = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
print (fcback)
if __name__ == "__main__":
main()
How can I convert this back to the original tgz archive?
Edit: failed attempt in the last section outputs this:
b'\x8b\x00\x10]\x03\x93o0\x85%\xe2!\xa4H\xf1Fi\xa7\x15\xf61&\x13N\xd9[\xfag\x11V\x97\xd3\xfb%\xf7\xe3\\\xae\xc2\xff\xa4>\xaf\x11\xcc\x93\xf1\x0c\x93\xa4\x1b\xefxj\xc3?\xf9\xc1\xe8\xd1\xd9\x01\x97qB"\x1a\x08\x9cO\x7f\xe9\x19\xe3\x9c\x05\xf2\x04a\xaa\x00A,\x15"RN-\xb6\x18K\x85\xa1\x11\x83\xac/\xffR\x8a\xa19\xde\x10\x0b\x08\x85\x93\xfc]\x8a^\xd2-T\x92\x9a\xcc-W\xc7|\xba\x9c\xb3\xa6V0V H1\x98\xde\x03#\x14\'\n 1Y\xf7R\x14\xe2#\xbe*:\xe0\xc8\xbb\xc9\x0bo\x8bm\xed.\xfd\xae\xef\x9fT&\xa1\xf4\xcf\xa7F\xf4\xef\xbb"8"\xb5\xab,\x9c\xbb\xfc3\x8b\xf5\x88\xf4A\x0ek%5eO\xf4:f\x0b\xd6\x1bi\xb6\xf3\xbf\xf7\xf9\xad\xb5[\xdba7\xb8\xf9\xcd\xba\xdd,;c\x0b\xaaT"\xd4\x96\x17\xda\x07\x87& \xceH\xd6\xbf\xd2\xeb\xb4\xaf\xbd\xc2\xee\xfc\'3zU\x17>\xde\x06u\xe3G\x7f\x1e\xf3\xdf\xb6\x04\x10A\x04\x10A\x04\x10A\x04\x10A\xff\x9f\xab\xe8(\x00'
And when I output it to a file (e.g. via python3 main.py > MyFile.tgz) the file is corrupted.
Since you know the format of the data (each byte is encoded as a string of 4 characters in the format "\xAB") it's easy to revert the conversion and get the original bytes again. It'll only take one line of Python code:
data = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
This uses:
range(start, stop, step) with step 4 to iterate in groups of 4 characters through your string
slicing to get each group of 2 hexadecimal digits
int(x, base) to convert the hexadecimal string to an integer
a generator expression to immediately pass the converted elements to:
bytes() to create a bytes object with the data
The variable data is now of type bytes and you could directly write it to a file (to decompress with an external zip program), or pass it to zlib.decompress() (to further process it in Python).
UPDATE (follow-up on the comments and updated question):
Firstly, I have tested the above code and it does result in the same bytes as the input. Are you really sure that the example output in your question is the actual result of the code in your question? Please try to be careful when copying code and/or output. A few remarks:
Your code is not properly formatted, so I cannot run it without making modifications. And when I have made modifications to the code, I might run different code than you do, yielding different results. So next time please copy-paste your exact (working, tested) code without modifications.
The format string in your code uses lowercase hexadecimal format, and your first example output uses uppercase. So that output cannot be from this code.
I don't have access to your file "MyPlugin.tgz", but when I test your code with another .tgz file (after fixing the IndentationErrors), my output is correct. It starts with \x1f\x8b as expected (this is the magic number in the gzip header). I can't explain why your output is different...
Secondly, it seems like you don't fully understand how bytes and string representations work. When you write print(fcback), a string representation of the Python object fcback (in this case a bytes object) is printed. The string representation of a bytes object is not the same as the binary data! When printing a bytes object, each byte that corresponds to a printable ASCII character is replaced by that character, other bytes are escaped (similar to the formatted string that your code generates). Also, it starts with b' and ends with '.
You cannot print binary data to your terminal and then pipe the output to a file. This will result in a different file. The correct way to write the data to a file is using file.write(data) in your Python code.
Here's a fully working example:
def binary_to_text(data):
"""Convert a bytes object to a formatted text string."""
text = ""
for byte in data:
text += "\\x{:02x}".format(byte)
return text
def text_to_binary(text):
"""Convert a formatted text string to a bytes object."""
return bytes(int(text[i+2:i+4], 16) for i in range(0, len(text), 4))
def main():
# Read the binary data from input file:
with open('MyPlugin.tgz', 'rb') as input_file:
input_data = input_file.read()
# Convert binary to text (based on your original code):
text = binary_to_text(input_data)
print(text[0:100])
# Convert the text back to binary:
output_data = text_to_binary(text)
print(output_data[0:100])
# Write the binary data back to a file:
with open('MyPlugin-restored.tgz', 'wb') as output_file:
output_file.write(output_data)
if __name__ == '__main__':
main()
Note that I only print the first 100 elements to keep the output short. Also notice that the second print-statement prints a much longer text. This is because the first print gets 100 characters (which are printed "as is"), while the second print gets 100 bytes (of which most bytes are escaped, causing the output to be longer).
I have a string:
line="123 123 test testing"
and I want to go through the pieces separated by space and check to see if each piece (123, 123, test, testing) matches a pre-determined sequence.
I know how to tokenize a string:
for token in string.gmatch(line,'%w+') do
print (token)
end
I am not sure however how to iterate through each piece one by one and compare to my variable, local var.
Basically I want to be able to get each piece of the string, compare to variable var's contents.
Here is a psudo-code:
Read file line by line {
Split every line by space (each line is in this format 123 456 string string)
Var1="123"
Var2="456"
If first token of the line= var1 then
If second token of the line=var2 then
Print the line
End
End
}
for v in str:gmatch('%s('..var..')%s') do
print(v)
end
Without more information this is all I can think of.
I have a text file TF including a set of the following kind of strings:
"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.T",
"linStru.twoZoneBuildingStructure.north.vol.Xi[1]",
"linStru.twoZoneBuildingStructure.south.airLeakage.senTem.T",
"linStru.twoZoneBuildingStructure.south.vol.Xi[1]", "
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer1Nf.T[1]",
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer2Nf.T[2]",
Given a line L, starting from the end let the substring s denote the portion of the string between ," and the first .
To make it clearer, for L=1: s=T, for L=2: s=Xi[1], for L=5: s=T[1], etc.
Given a text file TF in the above format, I want to write a MATLAB function which takes TF and replaces the corresponding s on each line with der(s).
For example, the function should change the above strings as follows:
"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.der(T)",
"linStru.twoZoneBuildingStructure.north.vol.der(Xi[1])",
"linStru.twoZoneBuildingStructure.south.airLeakage.senTem.der(T)",
"linStru.twoZoneBuildingStructure.south.vol.der(Xi[1])", "
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer1Nf.der(T[1])",
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer2Nf.der(T[2])",
How can such a function be written?
Something like
regexprep(TF, '\.([^.]+)",$', '.der($1)",', 'dotexceptnewline', 'lineanchors')
It finds the longest sequence of non-dot characters appearing between a dot before and quote-comma-endline after, and encloses that inside der( ).
I see there is a small " typo on the fourth line of your text file. I'm going to remove this to make things simpler.
As such, the simplest way that I can see you do this is iterate through all of your strings, remove the single quotes, then find the point in your string where the last . occurs. Extract this substring, then manually insert the der() in between this string. Assuming that those strings are in a text file called functions.txt, you would read in your text file using textread to read in individual strings. As such:
names = textread('functions.txt', '%s');
names should now be a cell array of names where each element is each string encapsulated in double quotes. Use findstr to extract where the . is located, then extract the last location of where this is. Extract this substring, then replace this string with der(). In other words:
out_strings = cell(1, numel(names)); %// To store output strings
for idx = 1 : numel(names)
%// Extract actual string without quotes and comma
name_str = names{idx}(2:end-2);
%// Find the last dot
dot_locs = findstr(name_str, '.');
%// Last dot location
last_dot_loc = dot_locs(end);
%// Extract substring after dot
last_string = name_str(last_dot_loc+1:end);
%// Create new string
out_strings{idx} = ['"' name_str(1:last_dot_loc) 'der(' last_string ')",'];
end
This is the output I get:
celldisp(out_strings)
out_strings{1} =
"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.der(T)",
out_strings{2} =
"linStru.twoZoneBuildingStructure.north.vol.der(Xi[1])",
out_strings{3} =
"linStru.twoZoneBuildingStructure.south.airLeakage.senTem.der(T)",
out_strings{4} =
"linStru.twoZoneBuildingStructure.south.vol.der(Xi[1])",
out_strings{5} =
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer1Nf.der(T[1])",
out_strings{6} =
"linStru.twoZoneBuildingStructure.north_ext.layMul.nMat[1].monoLayer2Nf.der(T[2])",
The last thing you want to do is write each line of text to your text file. You can use fopen to open up a file for writing. fopen returns a file ID that is associated with the file you want to write to. You then use fprintf to print your strings and name a newline for each string using this file ID. You then close the file using fclose with this same file ID. As such, if we wanted to output a text file called functions_new.txt, we would do:
%// Open up the file and get ID
fid = fopen('functions_new.txt', 'w');
%// For each string we have...
for idx = 1 : numel(out_strings)
%// Write the string to file and make a new line
fprintf(fid, '%s\n', out_strings{idx});
end
%// Close the file
fclose(fid);
Another way to do it with regexprep:
str_out = regexprep(str_in, '\.([^\.]+)"$','\.der($1)"');
Example: for
str_in = {'"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.T"'
'"linStru.twoZoneBuildingStructure.north.vol.Xi[1]"'};
this gives
str_out =
'"linStru.twoZoneBuildingStructure.north.airLeakage.senTem.der(T)"'
'"linStru.twoZoneBuildingStructure.north.vol.der(Xi[1])"'
I have large .csv file separated by tabs, which has strict structure with colClasses = c("integer", "integer", "numeric") . For some reason, there are number of trash irrelevant character lines, that broke the pattern, that's why I get
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'an integer', got 'ExecutiveProducers'
How can I ask read.table to continue and just to skip this lines? The file is large, so it's troublesome to perform the task by hand.
If it's impossible, should I use scan + for-loop ?
Now I just read everything as characters and then delete irrelevant rows and convert columns back to numeric, which I think not very memory-efficient
If your file fits into memory, you could first read the file, remove unwanted lines and then read those using read.csv:
lines <- readLines("yourfile")
# remove unwanted lines: select only lines that do not contain
# characters; assuming you have column titles in the first line,
# you want to add those back again; hence the c(1, sel)
sel <- grep("[[:alpha:]]", lines, invert=TRUE)
lines <- lines[c(1,sel)]
# read data from selected lines
con <- textConnection(lines)
data <- read.csv(file=con, [other arguments as normal])
If the character strings are always the same, or always contain the same word, you can define them as NA values using
read.csv(..., na.strings="")
and the delete all of them afterwards with
omit.na(dataframe)