get chunk from a very large file without reading previous lines

get chunk from a very large file without reading previous lines - io

I got a very large file, like:
c,c,c,c
v,v,v,v
v,v,v,v
.
.
.
v,v,v,v # line number 10000000
.
.
.
I need to read line 10000000, how can I get it without iterate previous lines?
Is there anything like a pointer/address to access it directly?

Related

How to concatenate files split by round robin?

I split a file using split -n r/12 file, now how do I concatenate these 12 files? I've tried cat <files> and paste <files>, but after using diff, whole file was different from the original.
How do I concatenate these 12 files so that cmp/diff will show no differences? Any special arguments for paste/cat to use?

Is round robin splitting an absolute requirement? If not you might just split into sections:
$split --number=12 file
This creates 12 files:
$ ls x*
xaa xab xac xad xae xaf xag xah xai xaj xak xal
Now you can concat without any difference:
$cat x* > file.new
$diff file file.new
But if there is no way around the round robin requirement I would create a bash script - not pretty. Just providing a pseudocode
Something like:
Create working directory
Copy all x* files into working directory
Change to working directory
Touch new concatenated file
While all x* files are not empty
Iterate over files in alpha order
Remove the first line in file
Append the line to the new concatenated file

grab a number after file extension

I have gigabytes of files with the following naming convention:
H__Flights_SCP_Log_Analysis_Log_Store_Extracted_File_Store_Aircraft_023_Logs_06Apr2021_164418_dtd_slotb_MDN_Gateway_Logs_audit_audit.log.1
The number at the end (1) is important and I need to grab it. Currently my code looks like this:
#!/bin/bash
#this grabs all the log files in the folder that will be converted and places it in a variable
files=$(find ./ -name "*.log*")
#I iterate through each file in files to convert each one
for file in $files
do
#this grabs the file name except the file extension and places it in a variable
name=${file%.*}
#this converts the file and places it in a file with the same name plus a csv extension
ausearch -if "file" --format csv>>$name.csv
done
This works fine to convert the logs and name them except that it does not grab the number at the end of the file extension. How could I grab that?

Continuing with OPs current use of parameter substitution ...
$ file='H__Flights_SCP_Log_Analysis_Log_Store_Extracted_File_Store_Aircraft_023_Logs_06Apr2021_164418_dtd_slotb_MDN_Gateway_Logs_audit_audit.log.1'
$ mynum="${file##*.}"
$ echo "${mynum}"
1
# or
$ mynum="${file//*./}"
$ echo "${mynum}"
1

Splitting large tar file into multiple tar files

I have a tar file which is 3.1 TB(TeraByte)
File name - Testfile.tar
I would like to split this tar file into 2 parts - Testfil1.tar and Testfile2.tar
I tried the following so far
split -b 1T Testfile.tar "Testfile.tar"
What i get is Testfile.taraa(what is "aa")
And i just stopped my command. I also noticed that the output Testfile.taraa doesn't seem to be a tar file when I do ls in the directory. It seems like it is a text file. May be once the full split is completed it will look like a tar file?

The behavior from split is correct, from man page online: http://man7.org/linux/man-pages/man1/split.1.html
Output pieces of FILE to PREFIXaa, PREFIXab, ...
Don't stop the command let it run and then you can use cat to concatenate (join) them all back again.
Examples can be seen here: https://unix.stackexchange.com/questions/24630/whats-the-best-way-to-join-files-again-after-splitting-them
split -b 100m myImage.iso
# later
cat x* > myImage.iso
UPDATE
Just as clarification since I believe you have not understood the approach. You split a big file like this to transport it for example, files are not usable this way. To use it again you need to concatenate (join) pieces back. If you want usable parts, then you need to decompress the file, split it in parts and compress them. With split you basically split the binary file. I don't think you can use those parts.

You are doing the compression first and the partition later.
If you want each part to be a tar file, you should use 'split' first with de original file, and then 'tar' with each part.

Append and read last line of file

Is there a way to append to a file and read it with the same "open" file command in Python3.7? I mean I do not want to have two open statements, one for open("//", "a") and one for open("//", "r"). What I am trying to achieve is run a script which appends the output to the file, and then read the last line of the file. "a+" does not help; it gives a index of out range for readlines()[-1].
Thanks in advance.

Opening the file in a+ makes the file pointer point to the end of the file, which makes it hard to read the last line. You can instead open the file in r+ mode, iterate over the file object until you obtain the last line, and then append the additional output to the file:
with open('file', 'r+') as file:
for line in file:
pass
file.write(output)
# variable line now holds the last line
# and the file now has the content of output at the end

qsub: Specifying non-consecutive datasets with the -t option

I'm submitting jobs to a Sun Grid Engine using the qsub command. The -t option to qsub enables me to specify the datasets upon which I want to call my script -- e.g.,
$ qsub . . . -t 101-103 my_script.sh
My question is, is it possible to specify non-consecutive datasets with the -t option? For example, say I wanted to run the script on 101 and 103, but not 102. How would I accomplish that?
And, more generally, how would I select arbitrarily numbered datasets?
I would like an answer that works in practice for a large number of datasets -- far beyond the two used in this toy example.

Not sure about that, but quoting from qsub's man page, on paragraph where -t is explained:
. . .
The task id range specified in the option argument may be a single
number, a simple range of the form n-m or a range with a step size.
Hence, the task id range specified by 2-10:2 would result in the
task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks,
. . .
So, maybe:
$ qsub . . . -t 101-103:2 my_script.sh
would do.

If the goal is to run regularly spaced datasets -- e.g., 1, 3, 5, . . . or 10, 15, 20, . . . -- then #chrk's answer is the one to use.
For arbitrarily numbered datasets, using -t is not possible. The same functionality can be attained, however, using the submit command (with the -f option) instead of qsub.
$ submit . . . -s my_script.sh -f my_datasets.txt
The file my_datasets.txt contains one dataset per line, as in
101
103
I'm not sure how specific this solution is to the particular configuration of my computing environment.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

get chunk from a very large file without reading previous lines - io

I got a very large file, like: c,c,c,c v,v,v,v v,v,v,v . . . v,v,v,v # line number 10000000 . . . I need to read line 10000000, how can I get it without iterate previous lines? Is there anything like a pointer/address to access it directly?

Related

How to concatenate files split by round robin?

grab a number after file extension

Splitting large tar file into multiple tar files

Append and read last line of file

qsub: Specifying non-consecutive datasets with the -t option

Categories

Resources