Splitting string involving parenthesis, whilst retaining the parenthesis - python-3.x

I have a column named Roles in a df, that has values like the below :
ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer)
I want to split this string, such that i get a value containing lists of each individual
So the output should be:
[[ABCD (Actor)], [XYZ (Actor, Director)], [PQR (Producer, Writer)]]
I am trying to use the below, however what happens is the ) gets cut and i end up with the string without the ) in the output
df['Role_Split'] = df['Roles'].str.split("\), ")
results in
['ABCD (Actor, XYZ (Actor, Director, PQR (Producer, Writer)']
Further, my plan was to create new columns, each for Actor, Director, Producer, etc
And populate those columns if the list element contains the string "Actor" or "Director" or "Producer" etc
Can you advise if there is a easier way to do this?
So the final output
Some more columns Roles Role_Split Actor Other Roles
ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer) [[ABCD (Actor)], [XYZ (Actor, Director)], [PQR (Producer, Writer)]] ABCD, XYZ XYZ, PQR

Looks like you can use str.findall
Ex:
df = pd.DataFrame({"Roles":['ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer)']})
df['Role_Split'] = df['Roles'].str.findall(r"(\w+ \(.*?\))")
print(df['Role_Split']) # print(df['Role_Split'][0]
Output:
['ABCD (Actor)', 'XYZ (Actor, Director)', 'PQR (Producer, Writer)']
Edit as per comment
df = pd.DataFrame({"Roles":['ABCD Walters Sr (Actor), XYZ PQR AB (Lead Role, Producer, Director)']})
df['Role_Split'] = df['Roles'].str.findall(r"([\w\s]+ \(.*?\))")
print(df['Role_Split'][0])
# ->['ABCD Walters Sr (Actor)', ' XYZ PQR AB (Lead Role, Producer, Director)']
Use str.extractall with named regex groups
Ex:
df2 = df['Roles'].str.extractall(r"(?P<Actor>[\w\s]+) (?P<roles>\(.*?\))")
print(df2)
Output:
Actor roles
match
0 0 ABCD Walters Sr (Actor)
1 XYZ PQR AB (Lead Role, Producer, Director)

Using a regular expression with a look-behind for a bracket:
(?<=\)),\s+
It splits on the comma and any following space (requiring at least one forward space) where that comma is after a closing bracket.
s = 'ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer)'
pd.Series([s]).str.split(r'(?<=\)),\s+', expand=True)
Yields output like so:
Out[5]:
0 1 2
0 ABCD (Actor) XYZ (Actor, Director) PQR (Producer, Writer)
Note that pd.Series.str.split takes a regular expression input directly. If you want to place it into different columns, pass expand=True to your s.str.split(pattern) call; otherwise, if desiring a lists in each series cell, don't pass that parameter.

Related

How to Print All line between matching first occurrence of word?

input.txt
ABC
CDE
EFG
XYZ
ABC
PQR
EFG
From above file i want to print lines between 'ABC' and first occurrence of 'EFG'.
Expected output :
ABC
CDE
EFG
ABC
PQR
EFG
How can i print lines from one word to first occurrence of second word?
EDIT: In case you want to print all occurrences of lines coming between ABC to DEF and leave others then try following.
awk '/ABC/{found=1} found;/EFG/{found=""}' Input_file
Could you please try following.
awk '/ABC/{flag=1} flag && !count;/EFG/{count++}' Input_file
$ awk '/ABC/,/EFG/' file
Output:
ABC
CDE
EFG
ABC
PQR
EFG
This might work for you (GNU sed):
sed -n '/ABC/{:a;N;/EFG/!ba;p}' file
Turn off implicit printing by using the -n option.
Gather up lines between ABC and EFG and then print them. Repeat.
If you want to only print between the first occurrence of ABC to EFG, use:
sed -n '/ABC/{:a;N;/EFG/!ba;p;q}' file
To print the second through fourth occurrences, use:
sed -En '/ABC/{:a;N;/EFG/!ba;x;s/^/x/;/^x{2,4}$/{x;p;x};x;}' file

Display indention of long lines in following lines in vim

When I have a long line indented in vim, it wraps at the end of the window automatically (just visually). I'd like to show the indention in the next lines as well. Is it possible?
This visually indents lines that have been wrapped:
:set wrap
:let &showbreak=' '
Note, the indent width is fixed; it doesn't try to match the indent of the previous line
Before
abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc
After
abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc abc
abc abc abc abc abc abc abc abc abc abc abc abc
abc abc abc abc abc abc

Compare one field, Remove duplicate if value of another field is greater

Trying to do this at linux command line. Wanting to combine two files, compare values based on ID, but only keeping the ID that has the newer/greater value for Date (edit: equal to or greater than). Because the ID 456604 is in both files, wanting to only keep the one from File 2 with the newer date: "20111015 456604 tgf"
File 1
Date ID Note
20101009 456604 abc
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
File 2
Date ID Note
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf
And then the output to have both files combined, but only keeping the second ID value, with the newer date. The order of the rows are in does not matter, just example of the output for concept.
Output
Date ID Note
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 111111 abc
20111015 222222 abc
20111015 333333 xyz
20111015 456604 tgf
$ cat file1.txt file2.txt | sort -ru | awk '!($2 in seen) { print; seen[$2] }'
Date ID Note
20111015 456604 tgf
20111015 333333 xyz
20111015 222222 abc
20111015 111111 abc
20101009 666666 xyz
20101009 555555 abc
20101009 444444 abc
Sort the combined files by descending date and only print a line the first time you see an ID.
EDIT
More compact edition, thanks to Steve:
cat file1.txt file2.txt | sort -ru | awk '!seen[$2]++'
You didn't specify how you'd like to handle the case were the dates are also duplicated, or even if this case could exist. Therefore, I have assumed that by 'greater', you really mean 'greater or equal to' (it also makes handling the header a tiny bit easier). If that's not the case, please edit your question.
awk code:
awk 'FNR==NR {
a[$2]=$1
b[$2]=$0
next
}
a[$2] >= $1 {
print b[$2]
delete b[$2]
next
}
1
END {
for (i in b) {
print b[i]
}
}' file2 file1
Explanation:
Basically, we use an associative array, called a, to store the 'ID' and 'Date' as key and value, respectively. We also store the contents of file2 in memory using another associative array called b. When file1 is read, we test if column two exists in our array, a, and that the key's value is greater or equal to column one. If it is, we print the corresponding line from array b, then delete it from the array, and next onto the next line/record of input. The 1 on it's lonesome will return true, thereby enabling printing where the previous (two) conditions are not met. This has the effect of printing any unmatched records from file1. Finally, we print what's left in array b.
Results:
Date ID Note
20111015 456604 tgf
20101009 444444 abc
20101009 555555 abc
20101009 666666 xyz
20111015 222222 abc
20111015 111111 abc
20111015 333333 xyz
Another awk way
awk 'NR==1;FNR>1{a[$2]=(a[$2]<$1&&b[$2]=$3)?$1:a[$2]}
END{for(i in a)print a[i],i,b[i]}' file file2
Compares value in an array to previously stored value to determine which is higher, also stores the third field if current record is higher.
Then prints out the stored date,key(field 2) and the value stored for field 3.
Or shorter
awk 'NR==1;FNR>1{(a[$2]<$1&&b[$2]=$0)&&a[$2]=$1}END{for(i in b)print b[i]}' file file2

How to delete the matching pattern from given occurrence

I'm trying to delete matching patterns, starting from the second occurrence, using sed or awk. The input file contains the information below:
abc
def
abc
ghi
jkl
abc
xyz
abc
I want to the delete the pattern abc from the second instance. The output should be as below:
abc
def
ghi
jkl
xyz
Neat sed solution:
sed '/abc/{2,$d}' test.txt
abc
def
ghi
jkl
xyz
$ awk '$0=="abc"{c[$0]++} c[$0]<2; ' file
abc
def
ghi
jkl
xyz
Just change the "2" to "3" or whatever number you want to keep the first N occurrences instead of just the first 1.
One way using awk:
$ awk 'f&&$0==p{next}$0==p{f=1}1' p="abc" file
abc
def
ghi
jkl
xyz
Just set p to pattern that you only want the first instance of printing:
Taken from : unix.com
Using awk '!x[$0]++' will remove duplicate lines. x is a array and it's initialized to 0.the index of x is $0,if $0 is first time meet,then plus 1 to the value of x[$0],x[$0] now is 1.As ++ here is "suffix ++",0 is returned and then be added.So !x[$0] is true,the $0 is printed by default.if $0 appears more than once,! x[$0] will be false so won't print $0.

Multiline trimming

I have a html file that I want to trim. I want to remove a section from the beginning all the way to a given string, and from another string to the end. How do I do that, preferably using sed?
With GNU sed:
sed '/mark1/,/mark2/d;/mark3/,$d'
this
abc
def
mark1
ghi
jkl
mno
mark2
pqr
stu
mark3
vwx
yz
becomes
abc
def
pqr
stu
you can use awk
$ cat file
mark1 dsf
abc
def
before mark2 after
blah mark1
ghi
jkl
mno
wirds mark2 here
pqr
stu
mark3
vwx
yz
$ awk -vRS="mark2" '/mark1/{gsub("mark1.*","")}/mark3/{ gsub("mark3.*","");print;f=1 } !f ' file
after
blah
here
pqr
stu

Resources