i have to process 300+ HTML files, extract a string from each one and place it in a separate text file for import downstream. upside: the string format is identical in each file and is +/- two lines from the same position as well.
i thought maybe using Python, but then i thought PERL might be a better way since this kinda plays to it's backyard.
sadly, i have no access to UNIX/LINUX or i'd just grep it...
this is such an odd client request that i'm a bit goggle-eyed ATM.
so: what is the best way to extract a target string from a BATCH of files?
WR!
If you give us more details (i.e. path and name of the files, the string you want to extract, etc) perhaps I may write a Windows Batch .BAT file to achieve this task...
EDIT
To write a Batch file that successfully run I need a couple additional data, so I made some assumptions. You may help me to fix the details. This is my method:
Seek for a line that contains ">Text link<". I suppose there is just one; this may be fixed.
Read the next line. I assumed that each td is located in independent lines; this may be fixed.
In this line remove the text from beginning of line until value string.
Replace quotes by $ (the next step cannot process quotes).
Get the text between $; this is the result.
for /F skip... command may read a wrong line if thefile contains empty lines; this may be fixed.
#echo off
setlocal DisableDelayedExpansion
findstr /n ">Text link<" thefile.htm > linefound.tmp
for /F "delims=:" %%a in (linefound.tmp) do set lineNo=%%a
for /F "skip=%lineNo% delims=" %%a in (thefile.htm) do (
set "theLine=%%a"
goto continue
)
:continue
setlocal EnableDelayedExpansion
set theLine=!theLine:*value=!
set theLine=!theLine:"=$!
for /F "tokens=2 delims=$" %%a in ("!theLine!") do set URL=%%a
echo Result: %URL%
EDIT no. 2
You are confusing me. Worked the first code or not? The second example you posted in the comments seems not be related to the first one (is the data within second <td> or after [url=http://?). Is it the same problem or a different one? Please, don't assume I know about HTML file format (I don't). I DO know about Batch files, but I can't guess what to do if I have not complete details...
The following Batch file show everything between square brackets that comes IN THE SAME LINE that have the [url=http:// string in the file given in the first parameter:
#echo off
for /F "tokens=2 delims=[]" %%a in ('findstr /n "[url=http://" %1') do echo %%a
As you're already familiar with Grep, why not use a Windows port, such as the Grep in GnuWin32?
Another great way to get a ton of *nix functionality in Windows is Cygwin http://www.cygwin.com
Related
I am trying to get this batch file to work but not sure exactlty what to do from here...
What I need is to have a batch file add extensions to multiple text strings that have special characters in half of them, then output to new txt file.
With this batch the way it is now, it will add the extension to the strings in the text file, and also output the new txt file, but will pass on the one's that have special characters.
Here is what I have:
#echo off
setlocal EnableDelayedExpansion
set addtext=.jpg
for /f "delims=*" %%a in (list.txt) do (echo/|set /p =%%a%addtext% & echo\)
>>new_list.txt
Any Help is Over Appreciated!
THANKS...
Would
for /f "delims=*" %%a in (list.txt) do (echo(%%a%addtext%)>>new_list.txt
Do what you want?
Without examples of what's on your input file and what you require on the output, we have to analyse the code for what's in your head - and remember, what's in your code doesn't do what's in youir head. If it did, you wouldn't be asking your question.
I need a method to split files into multiple (or even half) based on KB not on number of lines.
I am a Senior EDI Analyst and wrapped data tends to show up as one single long line. Every "solution" I find splits based on number of lines. I need something that will split based on size.
The end-goal is to "Unwrap" this data, meaning each segment will be on its own line. To do this I need to change the delimiters (as there are "special characters" as delimiters).
I do have a solution for that (see below), but for some reason this will not work on files larger than 10 KB. If you know anything about EDI, that's not very big.
I need to find a solution to split files into smaller files of about 5KB each (then I can use the string replacement and re-combine them myself).
Does anyone have an idea of how I might accomplish this with one, huge line?
(Sorry I have to remove the code I placed here only as AN EXAMPLE because someone flagged this as a duplicate WITHOUT READING IT. Please read above and advise.)
The reason you cannot process files > 10k byte is because batch variables (and command lines) are limited to ~8191 bytes.
You are attacking the problem in an inefficient way. Rather than look for a way to split a file into chunks so that you can use your slow batch "solution", you should be looking for a tool that allows you to work with the large files directly, without resorting to splitting, processing, and re-assembly.
As others have stated, PowerShell, JavaScript, and VBS are all good scripting languages that can solve your problem, and they are native to Windows.
If your files are all less than 1 gigabyte in length, then I suggest you try JREPL.BAT - a regex text processing utility. It is pure script (hybrid batch/JScript) that runs natively on any Windows machine from XP onward - no 3rd party exe file required. Full documentation is available from the command line via jrepl /?, or jrepl /?? for paged help.
To Unwrap a file, translating | into *\r\n (\r is carriage return, and \n a newline):
jrepl "|" "*\r\n" /l /m /x /f "wrappedFileName" /o "unwrappedFileName"
To wrap a file (reverse the process)
jrepl "*\r\n" "|" /l /m /x /f "unwrappedFileName" /o "wrappedFileName"
If you put either command within a batch script, then you must use call jrepl instead of jrepl. This is because JREPL is also a batch script, so control will not return to your script unless you use CALL.
Although your description is extensive, there are multiple points that are not clear. There are too many unrelated details that just deviates from the core point of the problem. If each segment in the line is separated by a | delimiter (you did not explained this point, but it is assumed from the example code) and you want to split the file based on a certain KB size (you did not specified how many KB), then a segment may be splitted in two different files. Also, I don't understand how changing the | delimiters by asterisks may help to solve the problem. After read this question several times, I assumed that the problem is this:
"Split a file that just contain a very long line (with not a single CR+LF pair) into segments delimited by | character, so each segment will be on its own line".
The Batch file below is a solution for this problem:
#echo off
setlocal EnableDelayedExpansion
call :ProcessFile < input.txt > output.txt
goto :EOF
:ProcessFile
set "previous="
:nextChunk
rem Read the next 1023-bytes chunk
set /P "chunk="
if errorlevel 1 goto endOfFile
rem Break segment if previous one ends at a chunk limit
if "!chunk:~0,1!" equ "|" if defined previous (
echo !previous!
set "previous="
)
rem Extract each segment from the chunk and place it on its own line
set "last="
for /F "delims=" %%a in (^"!chunk:^|^=^
% This line separate segments by the given delimiter %
!^") do (
if defined last echo !last!
set "last=!previous!%%a"
set "previous="
)
set "previous=!last!"
goto nextChunk
:endOfFile
rem Show the last segment
if defined previous echo !previous!
exit /B
EDIT: JScript solution added
As others have mentioned, you may also use a solution based on JScript, that is a standard programming language preinstalled in all Windows versions from XP on. In this way, the solution is really simple, because you just need to insert the following two lines in your Batch file:
echo WScript.Stdout.Write(WScript.Stdin.ReadAll().replace(/\^|/g,"\r\n")) > replace.js
cscript //nologo replace.js < input.txt > output.txt
This is a very simple, but powerful method that you may use in other similar replace operations; just read the corresponding documentation.
Split file into 5kB chunks:
set file="x.edb"
set max=5000
REM Findstr line limit 8k
REM Workaround: wrap in an archive to generate CRLF pairs for chunks > 8kB
for %i in (%file%) do (
set /a num=%~zi/%max% >nul &REM No. of chunks
set /a last=%~zi%%max% >nul &REM size of last chunk
if %last%==0 set /a num=num-1 &REM ove zero byte chunk
set size=%~zi
)
ren %file% %file%.0
for /l %i in (1 1 %num%) do (
set /a s1=%i*%max% >nul
set /a s2="(%i+1)*%max%" >nul
set /a prev=%i-1 >nul
echo Writing %file%.%i
type %file%.!prev! | (
(for /l %j in (1 1 %max%) do pause)>nul& findstr "^"> %file%.%i)
FSUTIL file seteof %file%.!prev! %max% >nul
)
if not %last%==0 FSUTIL file seteof %file%.%num% %last% >nul
echo Done.
Tested on Win 10
I have been searching for hours, however most of the results give examples that deal with directories. I need to read a text file to achieve this and extract after the last \ and output into a new file
Below is what my file contains in a text file
HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\CurrentVersion\Internet Settings\ZoneMap\Domains\yahoo.com
HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\CurrentVersion\Internet Settings\ZoneMap\Domains\NYU.edu
HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\CurrentVersion\Internet Settings\ZoneMap\Domains\openssl.org
I need to extract all the different strings after the last \ in the string. How do I go about doing this?
I tried this one but it is not working properly Extracting string after last instance of delimiter in a Batch file
This is on Windows 7 x64 SP1 and I can't install other software to achieve this either.
This should work:
#ECHO OFF
SETLOCAL EnableDelayedExpansion
SET inFile=C:\SomePath\someFile.txt
SET outFile=C:\SomePath\anotherFile.txt
TYPE NUL>%outFile%
FOR /F "tokens=*" %%L IN (%inFile%) DO (
ECHO %%~nxL>>%outFile%
)
You'll have to replace inFile and outFile with proper paths. If they contain spaces, surround them with quotation marks.
%outFile% will always be overwritten so make sure you don't loose your data!
I was given this code by Aacini (thanks!) but I don't know how to set which text file to search for the data.
#echo off
setlocal EnableDelayedExpansion
set "line=Username:Desired Information</br><br>"
set "string1=Username:"
set "string2=</br><br>"
rem Remove from beginning until string1
set "line=!line:*%string1%=!"
rem Change the string2 by a one character delimiter
set "line=!line:%string2%=|!"
rem Get the desired information
for /F "delims=|" %%a in ("%line%") do set "result=%%a"
echo Result: "%result%"
How would I do this? I'm sure it's just a set textfile=inbox.txt and another line of code to make it use the %TEXTFILE% variable, but I just don't know how.
You don't get the rigth answer because you don't post the right question. Neither in your original question nor in this one you asked for something like "How to read a file and get the text between two strings". Also, you should specify the details of the file and post a section that contain the desired information; otherwise a Batch solution may fail because a large number of factors.
Anyway, I created this simple data file just for testing:
This is a sample file
with unknown format
Previous info. Username:Desired Information</br><br>
End of file
And this is the solution:
#echo off
setlocal EnableDelayedExpansion
set "string1=Username:"
set "string2=</br><br>"
for /F "delims=" %%a in ('findstr "%string1%" file.txt') do (
rem Get the next line
set "line=%%a"
rem Remove from beginning until string1
set "line=!line:*%string1%=!"
rem Change the string2 by a one character delimiter
set "line=!line:%string2%=|!"
rem Get the desired information
for /F "delims=|" %%a in ("!line!") do set "result=%%a"
echo Result: "!result!"
)
Output:
Result: "Desired Information"
However, this solution is prone to fail because multiple reasons, for example:
May the file contain exclamation marks?
May the text that contain the desired information be split in several lines?
May that text contain the "|" choosen separator?
Do you want all instances of the desired information? Or just the first one? Or just the last one?
Etc...
That's because it's parsing specified variables. To read a line from a file
for /f "usebackq tokens=1 delims=" %%A in ("c:\somefolder\somefile.txt") do (
... put your other for loop here
)
See for /? for help. See set /? and call /? (and also for /? again) to see syntax details.
WHERE IS MY FORMATTING AND PREVIEW.
I've been trying to concatenate strings with the lines of a text file, but something is wrong with my code and I belive is the agruments I am using in the the For cycle. If any one can help me I'll much appreciate it.
My code is:
#echo off
set "input=C:\Users\123\Desktop\List.txt"
for /f "usebackq tokens=*" %%F in ("%input%") do (
set "str1=C:\some directory\"
set "str2=%%~F"
set "str3=.pdf"
set "str4=%str1%%str2%%str3%"
echo.%str4%
)
and the text file is something like:
121122
122233
123344
124455
But I am only getting a wrong answer and I have to run it like 3 times to get a real result and it's wrong, the first two are blank spaces and the third one gives back the last line in the text file but repeated n times where, n is the number of lines in the text file.
Result:
C:\Users\123\Desktop>concatenate.bat
C:\Users\123\Desktop>concatenate.bat
C:\Users\123\Desktop>concatenate.bat
C:\some directory\124455.pdf
C:\some directory\124455.pdf
C:\some directory\124455.pdf
C:\some directory\124455.pdf
C:\some directory\124455.pdf
C:\Users\123\Desktop>
So, if any one has a clue on what is wrong please let me know.
Regards
-Victor-
You'll need the Enable Delayed Expansion feature. It's required since within a FOR command block, you need to refer variables that was modified.
#echo off
setlocal enabledelayedexpansion
set "input=C:\Users\123\Desktop\List.txt"
for /f "usebackq tokens=*" %%F in ("%input%") do (
set "str1=C:\some directory\"
set "str2=%%~F"
set "str3=.pdf"
set "str4=!str1!!str2!!str3!!"
echo. !str4!
)