Logstash File plugin custom delimiter

Logstash File plugin custom delimiter - logstash

I'm using logstash for quite a time. I tried using a custom delimiter in File plugin. I'm reading a static file. I see file plugin extracts 32KB data and passes it to tokenizer for splitting by delimiter.
data = watched_file.file_read(32768)
changed = true
watched_file.buffer_extract(data).each do |line|
listener.accept(line)
#sincedb[watched_file.inode] += (line.bytesize + #delimiter_byte_size)
end
What happens when the last byte is not new line ( ie: part of a line ). My regex fails on the partial line and skips that. I lose an event in this case. I have seen this on a custom delimiter which can happen on \n delimiter as well.
Please enlighten me.

Maybe this link will help. Basically, there's a known issue with that modifier.

Related

Removing the first date and timestamp in each line of a log file using Python

I have a series of log files in text file format.
The document format is this:
[2021-12-11T10:21:30.370Z] Branch indexing
[2021-12-11T10:21:30.374Z] Starting the program with default pipeID
[2021-12-11T10:21:30.374Z] Running with durable level: max_survivbility will make this program crash if left running for 20 minutes
[2021-12-11T10:21:30.374Z] Starting the program with default pipeID
Each line in the document starts with:[2021-12-11T10:21:30.370Z]
I want to remove the first set of characters that represent date and timestamp and have a result something like this:
Branch indexing
Starting the program with default pipeID
Running with durable level: max_survivbility will make this program crash if left running for 20 minutes
Starting the program with default pipeID
Can anyone please help me explain how I can do this?
I tried to use this method but it doesn't work since I have '[]' in the date stamp.
import re
text = "[2021-12-11T10:21:30.370Z] Branch indexing"
re.sub("[.*?]", "", text)
This doesn't work for me.
If I try the same method on a text like text = "<2021-12-11T10:21:30.370Z> Branch indexing".
import re
text = "<2021-12-11T10:21:30.370Z> Branch indexing"
re.sub("<.*?>", "", text)
It removes <2021-12-11T10:21:30.370Z>. Why does this not work with [2021-12-11T10:21:30.370Z]?
I need help removing every instance of this format "[2021-12-11T10:21:30.370Z]" in all the log files.
Thank you so much.

I'd rather go with a simple solution for this case, pal. Split the string where the ] ends, then trim the second element of the resulting list, to remove all those extra spaces and then print it, bud. Hope this helps, cheers!
import re
text = "[2021-12-11T10:21:30.370Z] Branch indexing"
print(re.split("]", text)[1].strip())

Your current regex pattern is off because square brackets are regex metacharacters which need to be escaped. Also, you should be running the regex in multiline mode. And the timestamp pattern should be more generic.
text = re.sub(r'^\[.*?\]\s+', '', text, flags=re.M)

Azure Data Factory removing spaces from column names of csv file

I'm a bit new to azure data factory so apologies if I'm missing anything obvious. I've done several searches and I can't find anything that quite fits.
So the situation is that we have an existing pipeline that will take the path to a csv file and pass this in as a delimited data set. As a sink it is using a parquet data set. This is a generic process that we can pass any delimited file into and it will output it as parquet.
This has been working well but now we have started receiving files with spaces and special characters in the header which causes the output to parquet to fail. Unfortunately we don't have control over the format of the files we receive so I can't handle this at source.
What I would like to do is on ingestion of the file replace any spaces and other special characters in the header with an underscore. If I were doing this on premise I could quickly create a powershell script to do it. I had thought about creating a custom task in AFD to call a powershell script to do this in the blob storage but that seems more complicated than it should be. Is there something else I can do to get this process working while keeping it generic?

As #Joel Cochran mentioned, you can use the below expression in Select transformation to replace space and special characters in the header.
regexReplace($$,'[^a-zA-Z]','_')
Source:
In Select transformation, remove the auto mappings and add new rule base mapping to use this expression.
preview:

You can change the output filename not directly in the Copy activity, assuming you are using this activity.
The workaround is to use a parameter for the filename output that you can cleanup.
You can use the Get Metadata activity to get all filenames from the source csv files.
Then loop over these files with a foreach activity.
Within the foreach activity you can set the output filename with the new name with the cleaned value.
The function could look like this:
#replace(item().name, ' ', '_')
More information on the replace function

IBM Domino xpage - parse iCalendar summary with new lines manually/ical4j

So far I was parsing the NotesCalendarEntry ics manually and overwriting certain properties, and it worked fine. Today i stumbled upon a problem, where a long summary name of the appointment gets split into multiple lines, and my parsing goes wrong, it replaces the part up to the first line break and the old part is still there.
Here's how I do this "parsing":
NotesCalendarEntry calEntry = cal.getEntryByUNID(apptuid);
String iCalE = calEntry.read();
StringBuilder sb = new StringBuilder(iCalE);
int StartIndex = iCalE.indexOf("BEGIN:VEVENT"); // care only about vevent
tmpIndex = sb.indexOf("SUMMARY:") + 8;
LineBreakIndex = sb.indexOf(Character.toString('\n'), tmpIndex);
if(sb.charAt(LineBreakIndex-1) == '\r') // take \r\n into account if exists
LineBreakIndex--;
sb.delete(tmpIndex, LineBreakIndex); // delete old content
sb.insert(tmpIndex, subject); // put my new content
It works when line breaks are where they are supposed to be, but somehow with long summary name, line breaks are put into the summary (not literal \r\n characters, but real line breaks).
I split the iCalE string by \r\n and got this (only a part obviously):
SEQUENCE:6
ATTENDEE;ROLE=CHAIR;PARTSTAT=ACCEPTED;CN="test/Test";RSVP=FALSE:
mailto:test#test.test
ATTENDEE;CUTYPE=ROOM;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED
;CN="Room 2/Test";RSVP=TRUE:mailto:room2#test.test
CLASS:PUBLIC
DESCRIPTION:Test description\n
SUMMARY:Very long name asdjkasjdklsjlasdjlasjljraoisjroiasjroiasjoriasoiruasoiruoai Mee
ting long name
LOCATION:Room 2/Test
ORGANIZER;CN="test/Test":mailto:test#test.test
Each line is one array element from iCalE.split("\\r\\n");. As you can see, the Summary field got split into 2 lines, and a space was added after the line break.
Now I have no idea how to parse this correctly, I thought about finding the index of next : instead of a new line break, and then finding the first line break before that : character, but that wouldn't work if the summary also contained a : after the injected line-break, and also wouldn't work on fields like that ORGANIZER;CN= as it doesn't use : but ;
I tried importing external ical4j jar into my xpage to overcome this problem, and while everything is recognized in Domino Designer it resulted in lots of NoClassDefFound exceptions after trying to reach my xpage service, despite the jars being in the build path and all.
java.lang.NoClassDefFoundError: net.fortuna.ical4j.data.CalendarBuilder
How can I safely parse this manually, or how can I properly import ical4j jar to my xpage? I just want to modify 3 fields, the DTSTART, DTEND and SUMMARY, with the dates I had no problems so far. Fields like Description are using literal \n string to mark new lines, it should be the same in other fields...
Update
So I have read more about iCalendar, and it seems that there is a standard for this called line folds, these are crlf line endings followed by a space. I made a while loop checking until the last line-break not followed by a space, and it works great so far. Will use this unless there's a better solution (ical4j is one, but I can't get it working with Domino)

Read nth line in Node.js without reading entire file

I'm trying to use Node.js to get a specific line for a binary search in a 48 Million line file, but I don't want to read the entire file to memory. Is there some function that will let me read, say, line 30 million? I'm looking for something like Python's linecache module.
Update for how this is different: I would like to not read the entire file to memory. The question this is identified as a duplicate of reads the entire file to memory.

You should use readline module from Node’s standard library. I deal with 30-40 million rows files in my project and this works great.
If you want to do that in a less verbose manner and don’t mind to use third party dependency use nthline package:
const nthline = require('nthline')
, filePath = '/path/to/100-million-rows-file'
, rowNumber = 42
nthline(rowNumber, filePath)
.then(line => console.log(line))

According to the documentation, you can use fs.createReadStream(path[, options]), where:
options can include start and end values to read a range of bytes from the file instead of the entire file.
Unfortunately, you have to approximate the desired position/line, but it seems to be no seek like function in node js.
EDIT
The above solution works well with lines that have fixed length.
New line character is nothing more than a character like all the others, so looking for new lines is like looking for lines that start with the character a.
Because of that, if you have lines with variable length, the only viable approach is to load them one at a time in memory and discard those in which you are not interested.

How to know which csv delimiter is being used on a pc using php?

I am generating a csv file.
I would like to set up the delimiter dynamically, that is, get the value of the list-separator set up on the pc and then use it in my csv.
Is it possible??

No it's not. You want your csv to open in Excel. Correct? Then use [TAB] and direct the file to be saved ".xls" and not ".csv". The worst case is a warning message at opening regarding the file format for the user, but data will be visible in cells for sure.
Another approach for excel files is to use Excel(X)ML; it's very simple for raw data (http://office.microsoft.com/en-us/excel-help/overview-of-xml-in-excel-HA010206396.aspx)

I did generate a CSV using ',' as delimiter..But on my client side, it was not working. I replaced the delimiter by ';' and it works on the client side.
That means the client expected ; instead of ,. You cannot know what the client expects unless he tells you. There are an uncountable number of programs that use the CSV format, there's no universal way to figure out which delimiter they expect. Usually it works the other way around: you create the CSV and specify to the client what delimiter you have used. There's no technical protocol to do this though, CSV is too informal to have any such universal specification.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Logstash File plugin custom delimiter - logstash

Maybe this link will help. Basically, there's a known issue with that modifier.

Related

Removing the first date and timestamp in each line of a log file using Python

Azure Data Factory removing spaces from column names of csv file

IBM Domino xpage - parse iCalendar summary with new lines manually/ical4j

Read nth line in Node.js without reading entire file

How to know which csv delimiter is being used on a pc using php?

Categories

Resources