trying to convert txt file to csv but doesnt work
orginal text:
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
expected result :
text value
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
i have tried this but its doesn't work for space and comma constrain
awk 'BEGIN{print "text,value"}{print $1","$2"}' ifile.txt
also i have tired this with python but it doesn't contain all of them
import pandas as pd
df = pd.read_fwf('log.txt')
df.to_csv('log.csv')
Your request is unclear, how do you want to format the last field.
I created a script that align the last line on column 60.
script.awk
BEGIN {printf("text%61s\n","value")} # formatted printing heading line
{
lastField = $NF; # store current last field into var
$NF = ""; # remove last field from line
alignLen = 60 - length() + length(lastField); # compute last field alignment
alignFormat = "%s%"alignLen"s\n"; # create printf format for computed alignment
printf(alignFormat, $0, lastField); # format print current line and last field
}
run script.awk
awk -f script.awk ifile.txt
output
text value
استقالة #رئيس_القضاء #السودان OBJ
أهنئ الدكتور أحمد جمال الدين، مناسبة صدور أولى روايته POS
يستقوى بامريكا مرةاخرى و يرسل عصام العريان الي واشنطن شئ NEG
#انتخبوا_العرص #انتخبوا_البرص #مرسى_رئيسى #_ #__ö NEUTRAL
Related
I have a large text file looking like:
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
......
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
.....
xxxx
.......
asdfghjkl
I want to split the text files into multiple small text files and save them as .txt in my system on occurences of ..... [multiple period markers] saved like
group1_sdsdsd.txt
....
sdsdsd
..........
asdfhjgjksdfk dfkaskk sdkfk skddkf skdf sdk ssaaa akskdf sdksdfsdf ksdf sd kkkkallwow.
sdsdllla lsldlsd lsldlalllLlsdd asdd. sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
group1_ddss.txt
ddss
................
asdfhjgjksdfk ddjafjijjjj.dfsdfsdfsdfsi dfodoof ooosdfow oaosofoodf aosolflldlfl , dskdkkfkdsa asddf;akkdfkdkk . sdlsllall asdsdlallOEFOOASllsdl lsdlla.
slldlllasdlsd.ss;sdsdasdas.
and
group1_xxxx.txt
.....
xxxx
.......
asdfghjkl
I have figured that by usinf regex of sort of following can be done
txt =re.sub(r'(([^\w\s])\2+)', r' ', txt).strip() #for letters more than 2 times
but not able to figure out completely.
The saved text files should be named as group1_sdsdsd.txt , group1_ddss.txt and group1_xxxx.txt [group1 being identifier for the specific big text file as I have multiple bigger text files and need to do same on all to know which big text file i am splitting.
If you want to get the parts with multiple dots only on the same line, you can use and get the separate parts, you might use a pattern like:
^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*
Explanation
^ Start of string
\.{3,}\n Match 3 or more dots and a newline
(\S+)\n Capture 1+ non whitespace chars in group 1 for the filename and match a newline
\.{3,} Match 3 or more dots
(?: Non capture group to repeat as a whole part
\n Match a newline
(?!\.{3,}\n\S+\n\.{3,}) Negative lookahead, assert that from the current position we are not looking at a pattern that matches the dots with a filename in between
.* Match the whole line
)* Close the non capture group and optionally repeat it
Then you can use re.finditer to loop the matches, and use the group 1 value as part of the filename.
See a regex demo and a Python demo with the separate parts.
Example code
import re
pattern = r"^\.{3,}\n(\S+)\n\.{3,}(?:\n(?!\.{3,}\n\S+\n\.{3,}).*)*"
s = ("....your data here")
matches = re.finditer(pattern, s, re.MULTILINE)
your_path = "/your/path/"
for matchNum, match in enumerate(matches, start=1):
f = open(your_path + "group1_{}".format(match.group(1)), 'w')
f.write(match.group())
f.close()
I have written a very simple python programme, called wc.py, which mimics "bash wc" behaviour to count the number of words, lines and bytes in a file. My programme is as follow:
import sys
path = sys.argv[1]
w = 0
l = 0
b = 0
for currentLine in file:
wordsInLine = currentLine.strip().split(' ')
wordsInLine = [word for word in wordsInLine if word != '']
w += len(wordsInLine)
b += len(currentLine.encode('utf-8'))
l += 1
#output
print(str(l) + ' ' + str(w) + ' ' + str(b))
In order to execute my programme you should execute the following command:
python3 wc.py [a file to read the data from]
As the result it shows
[The number of lines in the file] [The number of words in the file] [The number of bytes in the file] [the file directory path]
The files I used to test my code is as follow:
file.txt which contains the following data:
1
2
3
4
Executing "wc file.txt" returns
4 4 8
Executing "python3 wc.py file.txt" returns 4 4 8
Download "Annual enterprise survey: 2020 financial year (provisional) – CSV" from CSV file download
Executing "wc [fileName].csv" returns
37081 500273 5881081
Executing "python3 wc.py [fileName].csv" returns
37081 500273 5844000
and a [something].pdf file
Executing "wc [something].pdf" works.
Executing "python3 code.py" throws the following errors:
Traceback (most recent call last):
File "code.py", line 10, in <module>
for currentLine in file:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 10: invalid start byte
As you can see, the output of python3 code.py [something].pdf and python3 code.py [something].csv is not the same as what wc returns. Could you help me to find the reason of this erroneous behaviour in my code?
Regarding the CSV file, if you look at the difference between your result and that of wc:
5881081 - 5844000 = 37081 which is exactly the number of lines.
That is, every line has one additional character in the original file. That character is the carriage return \r which got lost in Python because you iterate over lines and don't specify the linebreaks. If you want a byte-correct result, you have to first identify the type of linebreaks used in the file (and watch out for inconsistencies throughout the document).
My data is like below - stored in a .OUT file:
{ID=ISIN Name=yes PROGRAM=abc START_of_FIELDS CODE END-OF-FIELDS TIMESTARTED=Mon Nov 30 20:45:56
START-OF-DATA
CODE|ERR CODE|NUM|EXCH_CODE|
912828U rp|0|1|BERLIN|
1392917 rp|0|1|IND|
3CB0248 rp|0|1|BRAZIL|
END-OF-DATA***}
I need to extract the lines between START-OF-DATA and END-OF-DATA from above .OUT file using Python and load it in CSV file.
CODE|ERR CODE|NUM|EXCH_CODE|
912828U rp|0|1|BERLIN|
1392917 rp|0|1|IND|
3CB0248 rp|0|1|FRANKFURT|
You can use non greedy quantifier regex to get the entries between two strings.
with open('file.txt', 'r') as file:
data = file.read()
pattern = pattern = re.compile(r'(?:START-OF-DATA(.*?)END-OF-DATA)', re.MULTILINE|re.IGNORECASE | re.DOTALL)
g = re.findall(pattern,data)
O/P
[' \nCODE|ERR CODE|NUM|EXCH_CODE|\n912828U rp|0|1|BERLIN|\n1392917 rp|0|1|IND| \n3CB0248 rp|0|1|BRAZIL| \n']
#remove whitespaces and split by new line and remove empty entries of list
t = g[0].replace(" ","").split("\n")
new = list(filter(None, t))
O/P
['CODE|ERRCODE|NUM|EXCH_CODE|', '912828Urp|0|1|BERLIN|', '1392917rp|0|1|IND|', '3CB0248rp|0|1|BRAZIL|']
#create dataframe with pipe demoted
df = pd.DataFrame([i.split('|') for i in new])
O/P
0 1 2 3
0 CODE ERRCODE NUM EXCH_CODE
1 912828Urp 0 1 BERLIN
2 1392917rp 0 1 IND
3 3CB0248rp 0 1 BRAZIL
#create csv from df
df.to_csv('file.csv')
The regex pattern defined here will capture everything whenever a match is found for a string that begins with "START-OF-DATA" and ends with "END-OF-DATA" and leave you its output
I have data as below:
83997000|17561815|20370101000000 83997000|3585618|20370101000000
83941746|13898890|20361231230000 83940169|13842974|20171124205011
83999444|3585618|20370101000000 83943970|10560874|20370101000000
83942000|13898890|20371232230000 83999333|3585618|20350101120000
Now, what I want to achieve is as below:
If column 2 is 17561815, print 22220 to replace 17561815.
If column 2 is 3585618, print 23330 to replace 3585618.
If column 2 is 13898890, print 24440 to replace 13898890.
If column 2 is 13842974, print 25550 to replace 13842974.
If column 2 is 3585618, print 26660 to replace 3585618.
If column 2 is 10560874, print 27770 to replace 10560874.
Output to be like this:
83997000|22220|20370101000000 83997000|23330|20370101000000
83941746|24440|20361231230000 83940169|25550|20171124205011
83999444|26660|20370101000000 83943970|27770|20370101000000
83942000|24440|20371232230000 83999333|26660|20350101120000
awk solution:
awk 'BEGIN{
FS=OFS="|";
a["17561815"]=22220; a["13898890"]=24440;
a["3585618"]=26660; a["13842974"]=25550;
a["10560874"]=27770
}
$2 in a{ $2=a[$2] }
$4 in a{ $4=a[$4] }1' file
The output:
83997000|22220|20370101000000 83997000|26660|20370101000000
83941746|24440|20361231230000 83940169|25550|20171124205011
83999444|26660|20370101000000 83943970|27770|20370101000000
83942000|24440|20371232230000 83999333|26660|20350101120000
I have a file where the first couple of rows start with # mark, then follow the classical netlist, where also can be there rows begin with # mark. I need to insert one row with text protect between block of first rows begining on # and first row of classical netlist. In the end of file i need insert row with word unprotect. It will be good to save this modified text to new file with specific name because of the original file protected.
Sample file:
// Generated for: spectre
// Design library name: Kovi
// Design cell name: T_Line
// Design view name: schematic
simulator lang=spectre
global 0
parameters frequency=3.8G Zo=250
// Library name: Kovi
// Cell name: T_Line
// View name: schematic
T8 (7 0 6 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T7 (net034 0 net062 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T5 (net021 0 4 0) tline z0=Zo f=3.8G nl=0.5 vel=1
T4 (net019 0 2 0) tline z0=Zo f=3.8G nl=0.5 vel=1
How about sed
sed -e '/^#/,/^#/!iprotect'$'\n''$aunprotect'$'\n' input_file > new_file
Inserts 'protect' on a line by itself after the first block of commented lines, then adds 'unprotect' at the end.
Note: Because I use $'\n' in place of literal newline bash is assumed as the shell.
Since you awk'd the post
awk 'BEGIN{ protected=""} { if($0 !~ /#/ && !protected){ protected="1"; print "protect";} print $0}END{print "unprotect";}' input_file > output_file
As soon a row is detected without # as the first non-whitespace character, it will output a line with protect. At the end it will output a line for unprotect.
Test file
#
#
#
#Preceded by a tab
begin protect
#
before unprotect
Result
#
#
#
#Preceded by tab
protect
begin protect
#
before unprotect
unprotect
Edit:
Removed the [:space:]* as it seems that is already handled by default.
Support //
If you wanted to support both # and // in the same script, the regex portion would change to /#|\//. The special character / has to be escaped by using \.
This would check for at least one /.
Adding a quantifier {2} will match // exactly: /#|\/{2}/