I have a project that involves extracting data from a database into a text file, and then ingesting it into Hadoop. So i want to create a shell script that NiFi can run to automatically to check if a text file is extracted and ingest it, but I need to make sure that the whole data has been extracted first before ingesting it. Meaning I would need to check that the text file has an EOF, how do I do that?
Don't have any code as of yet, I have very little knowledge writing shell scripts.
While creating the file, use a different name. Rename it to the expected name once the extraction is done. Then, the other process can start its work once the file exists.
EOF is not something that actually gets put in the text file - in fact, there isn't really any EOF value. EOF or end-of-file is a condition that occurs when you try to consume input from a source that has none to give.
There is no general marker you can look for in your text files that will tell you whether they are complete. You'll need to make your script indicate when a given chunk of data has been extracted in some other way. There are many possibilities; you could change the name of the file as choroba suggested, or you could create a lock file and remove it once the data extraction is done, or you could have your extraction program write a distinctive sequence of bytes to the file at the end, or so on.
Related
Long story short, after a crash course in Python/BeautifulSoup, I managed to create a script to take an input text file that contains a list of URLs (1 on each line), scrape the URL, and write output to a database. There are some cases where I want an error to exit the script (including some trapped errors as well as unexpected), but as the list of URLs to scrape is pretty large, it would be handy if I could edit the input text file (or create a copy and edit that) to remove each URL as it is successfully processed. The idea being that if the script exits (by trap or crash), I'd have a list of the URLs left to be processed. Is something like this possible? I can find code samples to edit the text file, but I'm getting stuck at how to take out the row just processed.
Finally came across the post here that achieves the answer, though I'm not positive it's the most efficient way as it's reading the entire file and saving each time, but that may be the best that can be done in Python. In my case, the file is in the 1200 lines range, so it easily fits into memory.
I a series of applications on Linux systems that I need to basically constantly 'stream' out or even just 'tail' out but the challenge is the filenames are constantly rolling and changing.
The are all date encoded (dates being in different formats) and each then have different incremented formats.
Most of them start with one and increase, but one doesn't have an extension and then adds an extension past the first file and the other increments a number but once hitting 99 rolls to increment a alpha and returns the numeric to 01 and then up again as it rolls so quickly.
I just have the OS level shell scripting, OS command line utilities, and perl available to me to handle this situation for another application to pickup and read these logs.
The new files are always created right when it starts writing to the new file and groups of different logs (some I am reading some I am not) are being written to the same directory so I cannot just pickup anything hitting the directory.
If I simply 'tail -n 1000000 -f |' them today this works fine for the reader application I am using until the file changes and I cannot setup file lists ranges within the reader application, but can pre-process them so they basically appear as a continuous stream to the reader vs. the reader directly invoking commands to read them. A simple Perl log reader like this also work fine for a static filename but not for dynamic ones. It is critical I don't re-process any logs lines and just capture new lines being written to the logs.
I admit I am not any form a Perl guru, and the best answers / clue I've been able to find so far is the use of Perl's Glob function to possibly do this but the examples I've found basically reprocess all of the files on each run then seem to stop.
Example File Names I am dealing with across multiple apps I am trying to handle..
appA_YYMMDD.log
appA_YYMMDD_0001.log
appA_YYMMDD_0002.log
WS01APPB_YYMMDD.log
WS02APPB_YYMMDD.log
WS03AppB_YYMMDD.log
APPCMMDD_A01.log
APPCMMDD_B01.log
YYYYMMDD_001_APPD.log
As denoted above the files do not have the same inode and simply monitoring the directory for change is not possible as a lot of things are written there. On the dev system it has more than 50 logs being written to the directory and thousands of files and I am only trying to retrieve 5. I am seeing if multitail can be made available to try that suggestion but it is not currently available and installing any additional RPMs in the environment is generally a multi-month battle.
ls -i
24792 APPA_180901.log
24805 APPA__180902.log
17011 APPA__180903.log
17072 APPA__180904.log
24644 APPA__180905.log
17081 APPA__180906.log
17115 APPA__180907.log
So really the root of what I am trying to do is simply a continuous stream regardless if the file name changes and not have to run the extract command repeatedly nor have big breaks in the data feed while some script figures out that the file being logged to has changed. I don't need to parse the contents (my other app does that).. Is there an easy way of handling this changing file name?
How about monitoring the log directory for changes with Linux inotify, e.g. Linux::inotify2? Then you could detect when new log files are created, stop reading from the old log file and start reading from the new log file.
Try tailswitch. I created this script to tail log files that are rotated daily and have YYYY-MM-DD on their names. To use this script, you just say:
% tailswitch '*.log'
The quoting prevents the shell from interpreting the glob pattern. The script will perform glob pattern from time to time to switch to a newer file based on its name.
I have to create two boilerplate files, both of which always have the same content, with the EXCEPTION of a single word. I'm thinking of creating a command or something that I can run in the Linux terminal (Ubuntu), along with an argument that represents the one word which can vary in the files created. Perhaps a batch file will accomplish this, but I don't know what it will look like.
I will be able to run this command every time I create these boilerplate files, instead of pasting the boilerplate and changing the one word in the file that has to be changed.
These file paths relative to my current working directory are:
registration.php
etc/module.xml
A simple Python script that reads in the file as string and replaces the occurrence would probably be the quickest. Something like:
with open('somefile.txt', 'r+') as inputFile:
txt=inputFile.read().replace('someword', 'replacementword')
inputFile.seek(0)
inputFile.write(txt)
inputfile.close()
Sorry if this belongs on serverfault
I'm wondering what the proper way is to use an SVG(xml) string as standard input
for a "convert msvg:- jpeg:- 2>&1" command (using linux)
Currently I'm just saving a temp file to use as input,
but the data originates from an API in my case, so feeding
the string directly to the command would obviously be most efficient.
I appreciate everyone's help. Thanks!
This should work:
convert - output.jpg
Example:
convert logo: logo.svg
cat logo.svg | convert - logo.jpg
Explanation:
The example's first line creates an SVN file and writes it to disk. This is only a preparatory stop so that we can run the second line.
The second line is a pipeline of two commands: cat streams the bytes of the file to stdout (standard output).
The first line served only as preparation for the next command in the pipeline, so that this next command has something to read in.
This next command is convert.
The - character is a way to tell convert to read its input data not from disk, but from stdin (standard input).
So convert reads its input data from its stdin and writes its JPEG output to the file logo.jpg.
So my first command/line is similar to your step described as 'currently I'm just saving a temp file to use as input'.
My second command/line does not use your API (I don't have access to it, do I?), but it demonstrates a different method to 'feeding a string directly to the command'.
So the most important lesson is this: Whereever convert would usually read input from a file and where you would write the file's name on the commandline, you can replace the filename by - to tell convert it should read from stdin. (But you need to make sure that there is actually something offered on convert's standard input which it can digest...)
Sorry, I can't explain better than this...
I'm making a file monitor for a folder where I download subtitles. So far, it works like this:
Look for new .rar files in the folder.
If found, extract the subtitles and delete the .rar file
If a single .srt file was extracted, save the file name to a variable.
Now, I'm clueless about how to achieve the next (and final) part of the script:
I want to find a pattern based on the way subtitles are named.
Let's say, the subtitles file can be something like this:
SomeShow.1x03.stuff.srt
some_show s01e03-stuff.srt
some show 1-03 stuff.srt
etc.
I want to get something like: SomeShow 1 3 and based on that, start the video with the name that matches that pattern, which I guess would be a matter of reversing the process that was used to get the Show, season and episode based on the name of the .srt file.
Is this possible at all? It'd be really simple stuff in most languages, but I really need this to be a .bat and I'm clueless about how to approach this... so far all I've managed to do is to remove the extension from the variable.
Thanks in advance.
Batch files are Turing complete - you can do anything in them, but it is usually not wise to go to extremes. You might be able to package a sed or grep or your own binary alongside your .bat file for a good compromise between batchiness and function. If you can assume a suitable operating system, you will have Powershell installed and go that route.
You should recognize that the task is not exactly defined and that the "solution" may need some tweaking and be never robust enough.
For this reason, the richer language you can pick, the further you will get.