I have a large text file, over 1gb large containing data line by
line. This is text file A.txt
I then have the second file, text file B.txt that contains 30,000 unique words
that I want to extract from text file A, along with the rest of the
line where the word is found in text file A.
An example of this is:
--Text File A--
dog in house
cat at school
kid in playground
tom at oaks
so much stuff
inhouse cool stuff
--Text File B--
house
oaks
--Result File Output--
dog in house
tom at oaks
inhouse cool stuff
How would I go about doing this that would work the fastest way possible? Is there any software on the market for purchase that specializes in this type of task?
I don't know any programming languages whatsoever so if anyone knows a solution that takes writing code I would need newbie instructions on how to carry it out.
I've searched for hours and hours on google in hopes to finding a solution to this but have come up with absolutely nothing meaningful.
Thanks in Advance
Using Java MapReduce you can do it as below:
Load File A in HDFS
Pass line by line as input to Mapper.
Share File B as distributed cache so is accessible by all Mappers as it is without getting divided into chunks.
In mapper check the input line received (from File A) for any of the words present in File B (shared as distributed cache).
If not found skip that line.
if found output the line to Reducer.
From Reducer write to output file.
Related
I have got a folder with notes on different topics, say ~/notes/topica.txt ~/notes/topicb.txt etc... . I have been using them in different computers for some time now so they have diverted their content
So I have now something like
#computerA:
cat ~/notes/topica.txt
Lines added in Computer A
----------
Some common text
#computerB:
cat ~/notes/topica.txt
Lines added in Computer B
----------
Some common text
I want to be able to merge them in a way that I get
cat ~/notes/topica.txt
Lines added in Computer A
Lines added in Computer B
----------
Some common text
I have tried some tools like diff but they are really not achieving the objective I want.
Also I need to do it for an entire folder with a bunch of files. (Im alright with doing a bash script)
Thanks for your help
So I am trying to make an offline dictionary and as a source for the words, I am using a .txt file. I have some questions related to that. How can I find a specific word in my text file and save it in a variable? Also does the length of my file matter and will it affect the speed? That's just a part of my .txt file:
Abendhimmel m вечерно небе.|-|
Abendkasse f Theat вечерна каса.|-|
Abendkleid n вечерна рокля.|-|
Abendland n o.Pl. geh Западът.|-|
The thing that I want is to save the wort, for example, Abendkasse and everything else till this symbol |-| in one variable. Thanks for your help!
I recommend you to look at python's standard library functions (on open files) called realines() and read(). I don't know how large your file is, but you can usually just read the entire thing into ram (with read or readlines) and then search through the string you then get. Searchin can be done with regex or just with a simple loop.
The length of your file will sort of matter, in that opening larger files will take slightly longer. Though usually this is still pretty fast, even for large textfiles. In fact, I think in many cases it will be faster to first read the entire file, because once it is read into ram, all operations on it will be way faster.
an example:
with open("yourlargetextfile.txt", f):
contents = f.readlines()
for line in contents:
# split every line into parts from |-| to the next |-|
parts = line.split("|-|")
I am writing a REXX program which will update a PS dataset. I can edit a particular line using my REXX code. But I would want a code to insert a particular string after a particular line.
For Example: My PS dataset has 100 lines. I want to insert a text "ABCDE" after 44th line (in 45th line) which will increase the total lines of the file to 101 lines. The remaining lines should remain unchanged. Is this possible using REXX?
Independent of REXX you need to effectively read the old dataset and write it out to a new file and add your new record (string) to the output file and then write the rest. There is no way to “insert” a record in a Physical Sequential (PS) dataset. At the end you would delete the old and rename the newly created file to the old name.
Another option would be to use a generation dataset group (GDG) and read the current (0) and create the new (+1) as the output. This way you still are referring to the same dataset name for others to reference.
What #Hogstrom suggests is a good solution to the problem you describe. In the interest of completeness, here is a solution that may be necessary under extreme circumstances.
Create an edit macro...
/*REXX*/
ADDRESS ISREDIT 'MACRO NOPROCESS'
aLine = 'ABCDE'
ADDRESS ISREDIT 'LINE_AFTER 44 = DATALINE (ALINE)'
...and run ISPF edit in batch, executing this macro.
The JCL to run ISPF in batch is shop-specific, but many shops have created a cataloged procedure to do so.
If you are willing to copy your dataset to the z/Unix file system, you could also use sed or awk to make your changes.
I'm not recommending any of this, I'm just pointing out it can be done if #Hogstrom's solution won't work for you for some reason.
I have 4 100GB csv files where two fields need to be concatenated. Luckily the two fields are next to each other.
My thought is to remove the 41st occurence of "," from each line and then my two fields will be properly united and ready to be uploaded to an analytical tool that I use.
The development machine is a Windows 10 machine with 4 x 3.6GHz and 64G RAM and I push the file to a server on Centos 7 system with 40 x 2.4GHz and 512G RAM. I have sudo access on the server and I can technically change the file there if someone has a solution that is dependent on Linux tools. The idea is to accomplish the task in the fastest/easiest way possible. I have to repeat this task monthly and would be ecstatic to automate it.
My original way of accomplishing this was to load the csv to MySQL, concat the fields and remove the old fields. Export the table as a csv again and push to the server. This takes two days and is laborious.
Right now I'm torn between learning to use sed or using a something I'm more familiar with like node.js to stream the files line by line into a new file and then push those to the server.
If you recommend using sed, I've read here and here but don't know how to remove the nth occurrence from each line.
Edit: Cyrus asked for a sample input/output.
Input file formatted thusly:
"field1","field2",".........","field41","field42","......
Output file formatted like so:
"field1","field2",".........","field41field42","......
If you want to remove 41st occurrence of , then you can try :
sed -i 's/","//41' file
I have a program that searches using ReadAllText from a file. The file is only 2.2meg. It reads about 80% and does not give an error. It just does not find my searches in the 20% extra text. What can i do. using c#. Thanks