I have a column named Roles in a df, that has values like the below :
ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer)
I want to split this string, such that i get a value containing lists of each individual
So the output should be:
[[ABCD (Actor)], [XYZ (Actor, Director)], [PQR (Producer, Writer)]]
I am trying to use the below, however what happens is the ) gets cut and i end up with the string without the ) in the output
df['Role_Split'] = df['Roles'].str.split("\), ")
results in
['ABCD (Actor, XYZ (Actor, Director, PQR (Producer, Writer)']
Further, my plan was to create new columns, each for Actor, Director, Producer, etc
And populate those columns if the list element contains the string "Actor" or "Director" or "Producer" etc
Can you advise if there is a easier way to do this?
So the final output
Some more columns Roles Role_Split Actor Other Roles
ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer) [[ABCD (Actor)], [XYZ (Actor, Director)], [PQR (Producer, Writer)]] ABCD, XYZ XYZ, PQR
Looks like you can use str.findall
Ex:
df = pd.DataFrame({"Roles":['ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer)']})
df['Role_Split'] = df['Roles'].str.findall(r"(\w+ \(.*?\))")
print(df['Role_Split']) # print(df['Role_Split'][0]
Output:
['ABCD (Actor)', 'XYZ (Actor, Director)', 'PQR (Producer, Writer)']
Edit as per comment
df = pd.DataFrame({"Roles":['ABCD Walters Sr (Actor), XYZ PQR AB (Lead Role, Producer, Director)']})
df['Role_Split'] = df['Roles'].str.findall(r"([\w\s]+ \(.*?\))")
print(df['Role_Split'][0])
# ->['ABCD Walters Sr (Actor)', ' XYZ PQR AB (Lead Role, Producer, Director)']
Use str.extractall with named regex groups
Ex:
df2 = df['Roles'].str.extractall(r"(?P<Actor>[\w\s]+) (?P<roles>\(.*?\))")
print(df2)
Output:
Actor roles
match
0 0 ABCD Walters Sr (Actor)
1 XYZ PQR AB (Lead Role, Producer, Director)
Using a regular expression with a look-behind for a bracket:
(?<=\)),\s+
It splits on the comma and any following space (requiring at least one forward space) where that comma is after a closing bracket.
s = 'ABCD (Actor), XYZ (Actor, Director), PQR (Producer, Writer)'
pd.Series([s]).str.split(r'(?<=\)),\s+', expand=True)
Yields output like so:
Out[5]:
0 1 2
0 ABCD (Actor) XYZ (Actor, Director) PQR (Producer, Writer)
Note that pd.Series.str.split takes a regular expression input directly. If you want to place it into different columns, pass expand=True to your s.str.split(pattern) call; otherwise, if desiring a lists in each series cell, don't pass that parameter.
I have two data and I want to subset data1 based on another data2's column value and merge, my data1 looks like:
ID address phone
123 .... .....
456 .... .....
789 .... .....
101 .... .....
and data2 looks like:
ID City Zipcode if_travel
123 .... .... ....
456 .... .... ....
I hope to get data like:
ID address phone City Zipcode if_travel
123 .... ..... .... .... ....
456 .... ..... .... .... ....
789 .... ..... NA NA NA
101 .... ..... NA NA NA
The process seems to alike left-join in python, yet is there any way to do the same process in bash command? Thanks!
There are lots of ways to do this; here's one:
join -o '0,1.2,1.3,2.2,2.3,2.4' -a1 -e 'NA' <(sort file1.txt) <(sort file2.txt) | awk '{printf "%-7s %-7s %-7s %-7s %-7s %7s\n",$1,$2,$3,$4,$5,$6}' | sort -nk1
input :
file 1:
ID address phone
123 jordan 123
456 usa 144
789 bla 606
101 bla 1616
file 2 :
ID City Zipcode if_travel
123 amman 2222 yes
456 zarqa 3030 no
output :
ID address phone City Zipcode if_travel
101 bla 1616 NA NA NA
123 jordan 123 amman 2222 yes
456 usa 144 zarqa 3030 no
789 bla 606 NA NA NA
input.txt
ABC
CDE
EFG
XYZ
ABC
PQR
EFG
From above file i want to print lines between 'ABC' and first occurrence of 'EFG'.
Expected output :
ABC
CDE
EFG
ABC
PQR
EFG
How can i print lines from one word to first occurrence of second word?
EDIT: In case you want to print all occurrences of lines coming between ABC to DEF and leave others then try following.
awk '/ABC/{found=1} found;/EFG/{found=""}' Input_file
Could you please try following.
awk '/ABC/{flag=1} flag && !count;/EFG/{count++}' Input_file
$ awk '/ABC/,/EFG/' file
Output:
ABC
CDE
EFG
ABC
PQR
EFG
This might work for you (GNU sed):
sed -n '/ABC/{:a;N;/EFG/!ba;p}' file
Turn off implicit printing by using the -n option.
Gather up lines between ABC and EFG and then print them. Repeat.
If you want to only print between the first occurrence of ABC to EFG, use:
sed -n '/ABC/{:a;N;/EFG/!ba;p;q}' file
To print the second through fourth occurrences, use:
sed -En '/ABC/{:a;N;/EFG/!ba;x;s/^/x/;/^x{2,4}$/{x;p;x};x;}' file
if the file is like this:
ram_file
abc
123
end_file
tony_file
xyz
456
end_file
bravo_file
uvw
789
end_file
now i want to access text between ram_file and end_file, tony_file & end _file and bravo_file & end_file simultaneously. I tried sed command but i don't know how to specify *_file in this
Thanks in advance
This awk should do the job for you.
This solution threat the end_file as an end of block, and all other xxxx_file as start of block.
It will not print text between the block of there are some, like in my example do not print this.
awk '/end_file/{f=0} f; /_file/ && !/end_file/ {f=1}' file
abc
123
xyz
456
uvw
789
cat file
ram_file
abc
123
end_file
do not print this
tony_file
xyz
456
end_file
nor this data
bravo_file
uvw
789
end_file
If you like some formatting, it can be done easy with awk
awk -F_ '/end_file/{printf (f?RS:"");f=0} f; /file/ && !/end_file/ {f=1;print "-Block-"++c"--> "$1}' file
-Block-1--> ram
abc
123
-Block-2--> tony
xyz
456
-Block-3--> bravo
uvw
789
I have a html file that I want to trim. I want to remove a section from the beginning all the way to a given string, and from another string to the end. How do I do that, preferably using sed?
With GNU sed:
sed '/mark1/,/mark2/d;/mark3/,$d'
this
abc
def
mark1
ghi
jkl
mno
mark2
pqr
stu
mark3
vwx
yz
becomes
abc
def
pqr
stu
you can use awk
$ cat file
mark1 dsf
abc
def
before mark2 after
blah mark1
ghi
jkl
mno
wirds mark2 here
pqr
stu
mark3
vwx
yz
$ awk -vRS="mark2" '/mark1/{gsub("mark1.*","")}/mark3/{ gsub("mark3.*","");print;f=1 } !f ' file
after
blah
here
pqr
stu