search large files for special characters

search large files for special characters - search

One of the data validation steps that we perform is related to 'special characters' in output files. These are text files, pipe delimited. Today, we open the file in UltraEdit and then do a Ctrl+F. These output files range in size, with the largest being over 54GB. Looking for a more efficient (aka automated) approach to this step. Any suggestions?

using java
for 15gb file it took 30sec.
long found = Files.lines(Paths.get("dummy.txt"))
.filter(s -> s.contains("test"))
.count();
System.out.println(count);

Related

How to get information from text and safe it in variable with python

So I am trying to make an offline dictionary and as a source for the words, I am using a .txt file. I have some questions related to that. How can I find a specific word in my text file and save it in a variable? Also does the length of my file matter and will it affect the speed? That's just a part of my .txt file:
Abendhimmel m вечерно небе.|-|
Abendkasse f Theat вечерна каса.|-|
Abendkleid n вечерна рокля.|-|
Abendland n o.Pl. geh Западът.|-|
The thing that I want is to save the wort, for example, Abendkasse and everything else till this symbol |-| in one variable. Thanks for your help!

I recommend you to look at python's standard library functions (on open files) called realines() and read(). I don't know how large your file is, but you can usually just read the entire thing into ram (with read or readlines) and then search through the string you then get. Searchin can be done with regex or just with a simple loop.
The length of your file will sort of matter, in that opening larger files will take slightly longer. Though usually this is still pretty fast, even for large textfiles. In fact, I think in many cases it will be faster to first read the entire file, because once it is read into ram, all operations on it will be way faster.
an example:
with open("yourlargetextfile.txt", f):
contents = f.readlines()
for line in contents:
# split every line into parts from |-| to the next |-|
parts = line.split("|-|")

pycharm jetbrains to search for text NOT in test*.py filename pattern

In Pycharm, when search for files that contain a given text e.g. hitting Ctrl-Shift-F, we have the File mask box as in below snapshot to filter file name pattern.
I want a NOT filter here e.g. search for files not starting with test*.py. How can we archive this?

Note
Although this question is old, I'll leave the answer here just in case someone else reaches the question.
Solution
The way to exclude results in File mask is by adding a ! before the mask, for example: !*test*.py
However, this might generate an unwanted situation, because it can bring results from configuration files, or temporary files, or any file we don't want. For this, the solution is to have multiple masks at once, and this can be achieved by separating masks with , (comma).
Example
If we want all files containing the word def in .py files, excluding files containing the word tests for any type of file, and models ending in .py, we would use the File mask: !*test*, !*models*.py, *.py
Hope this helps!

Search substring in binary file

friends! Please, help me with my issue. I have an application which processes data and generates output files (different formats, but mostly images). In every generated file that application puts it's watermark - string, that looks like "03-24-5532 [some cyrillic text]".
And every time I use that application, I need to edit each file in photoshop to replace watermark string with required one and it takes a lot of time.
Is this possible to search that substring in application binary data files (using Hex Editor or something else) and replace? Which is the better way to solve this problem?

Pentaho - CSV Input not understanding special character [Windows to Linux]

I have a transformation on Pentaho Data Integration where the first thing I do is I use the "CSV Input" to map my flat file.
I've never had a problem with it on windows, but now I'm chaning my server that spoon is going to run to a linux server and now I'm having problems with special characters.
The first thing I noticed was that my tables where being updated because the system was understanding the names as diferent strings to the ones that are at my database.
Checking for the problem, I also noticed that if I go to my "CSV Input" -> Preview, it will show me the preview of my data with the problem above:
Special characters are not showing.
Where it should be:
Diretoria de Suporte à Decisão e Aplicação
I used a command to checked my file charset/codification and it showed:
$ file -bi foo.csv
text/plain; charset=iso-8859-1
If I open foo.csv on vi, it understands the special characters.
Any idea on what could be the problem or what should I try?

I don't have any data files with this encoding, so you'll have to do some experimenting, but there are some steps designed to deal with these issues.
First, the CSV Input step has a field that allows you to select the encoding of the source file. The Text File Input step has both a "Format" (meaning line terminator) and "Encoding" selector under the "Content" tab.
In Transforms, you have the Change file encoding step under the Utility tab. This step is designed to copy many files while changing their encoding; that's why it's in a transform.
In Jobs, there's the Convert file between Windows and Unix step under the File Management tab, but this appears to only deal with line terminators.
Either way it appears if the CSV/Text file input steps don't suit your needs, you'll have to copy the file to a new encoding before reading it in. It will probably be easiest to try handling it with the file input steps first.

How to read output from Excel file into Fortran?

I have an excel sheet with the following columns for a stock chart:
Open
High
Low
Close
Day Average
How do i use Fortran to pull only the "Day Average" from the excel file?
I am new to Fortran and haven't been able to find anything that could help except for the link below but its pretty difficult for me to grasp since i am looking for something different than what the link is showing me:
http://en.wikibooks.org/wiki/Fortran/Fortran_simple_input_and_output

No, contrary to the other answers CSV is not the easiest file to read. Go to File/Save as/Other Formats and save it as Formatted text (space delimited). Depending on your locale, you will either have a comma or a full stop as a decimal point, so you'll have to (either use an external editor to do a simple search/replace) or write a fortran subroutine that goes character by character, and replaces every comma with a full stop.
After that it's easy, no ;'s to parse, so you just
program FreeFormat
real(4), dimension(5) :: open, high, low, close, dayaverage
real(4) :: average
open(unit=1, file='filename.prn', status='old')
do i=1,5
read(1,*)open(i), high(i), low(i), close(i), dayaverage(i)
enddo
average = sum(dayaverage)/5
write(*,'("Average is",f5.2)')average
end program FreeFormat
You get the point ...
Here are a few links to get you started (Excel/Fortran DLL related) ...
Trouble with file location in excel/fortran dll connection
Fortran DLL & MS Excel

The native binary format of an Excel file will be very difficult to parse. Export the file to text or CSV, as already suggested. CSV will probably be easiest. You probably want to use "list directed IO", which has the source form:
read (unit-number, *) format-items
Fortran list-directed IO will read into the variables in the list "format-items" is a very flexible manner. The items in the file should be separated by deliminators such as spaces or commas. For your case, have five variables corresponding to the five columns in order to reach the 5th one that you want. Fortran is record-oriented, so you do one read per line.

You'll have to read and parse the Excel file in Fortran to get the values you want. If you are new to the language then this might be very hard to do. Maybe it's easier to save the Excel sheet in a CSV format and parse that? Good luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

search large files for special characters - search

using java for 15gb file it took 30sec. long found = Files.lines(Paths.get("dummy.txt")) .filter(s -> s.contains("test")) .count(); System.out.println(count);

Related

How to get information from text and safe it in variable with python

pycharm jetbrains to search for text NOT in test*.py filename pattern

Search substring in binary file

Pentaho - CSV Input not understanding special character [Windows to Linux]

How to read output from Excel file into Fortran?

Categories

Resources