Reading from a file how to append strings to a list until a specific marker in python - python-3.x

I have a input text file:
This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&
Now I would like to append students name in a list in a line when find ### marker & stop appending when find &##& marker. Output should be(using python):
bob alice rhea john mary alex roma peter

You can use this method
txt = """This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&"""
or :
txt = open("./file.txt").read()
ls = txt.split("###")[1].split("&##&")[0].split()
print(ls)
This prints :
['bob', 'alice', 'rhea', 'john', 'mary', 'alex', 'roma', 'peter']

Using re.findall with re.sub:
inp = """This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&"""
output = re.sub(r'\s+', ' ', re.findall(r'###\s+(.*?)\s+&##&', inp, flags=re.DOTALL)[0])
print(output) # bob alice rhea john mary alex roma peter
If you want a list, then use:
output = re.split(r'\s+', re.findall(r'###\s+(.*?)\s+&##&', inp, flags=re.DOTALL)[0])
print(output)
This prints:
['bob', 'alice', 'rhea', 'john', 'mary', 'alex', 'roma', 'peter']

Related

Regular Expressions Python replacing couple names

I would like to find and replace expressions like "John and Jane Doe" with "John Doe and Jane Doe"
for a sample expression
regextest = 'Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
I can find the expression and replace it with a fixed string but I am not able to replace it with a modification of the original text.
re.sub(r'[a-zA-Z]+\s*and\s*[a-zA-Z]+.[^,]*',"kittens" ,regextest)
Output: 'Heather Robinson, kittens, kittens, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
I think instead of a string ("kittens"), we can pass a function that can make that change but I am unable to write that function. I am getting errors below.
def re_couple_name_and(m):
return f'*{m.group(0).split()[0]+m.group(0).split()[-1:]+ m.group(0).split()[1:]}'
re.sub(r'[a-zA-Z]+\s*and\s*[a-zA-Z]+.[^,]*',re_couple_name_and ,regextest)
IIUC, one way using capture groups:
def re_couple_name_and(m):
family_name = m.group(3).split(" ",1)[1]
return "%s %s" % (m.group(1), family_name) + m.group(2) + m.group(3)
re.sub(r'([a-zA-Z]+)(\s*and\s*)([a-zA-Z]+.[^,]*)',re_couple_name_and ,regextest)
Output:
'Heather Robinson, Jane Smith and John Smith, Kiwan Brady John and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown'
You can use below regex to capture the items to be interchanged and use re.sub() to construct new string.
(\w+)( +and +)(\w+)( +[^,]*)
Demo
Example
import re
text="Heather Robinson, Jane and John Smith, Kiwan and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown"
print(re.sub(r"(\w+)( +and +)(\w+)( +[^,]*)",r"\1\4\2\3\4",text))
Output
Heather Robinson, Jane Smith and John Smith, Kiwan Brady John and Nichols Brady John, Jimmy Nichols, Melanie Carbone, and Nancy Brown

Selecting the first and the last word from a 3 word long string in PLSQL

For example I have names like these:
John Lucas Smith
Kevin Thomas Bacon
I need to do it with regexp_substr, or replace or something like that.
and what I want to get is:
John Smith
Kevin Bacon
Thank you!
Something like this?
SQL> with test (col) as
2 (select 'John Lucas Smith' from dual union
3 select 'Kevin Thomas Bacon' from dual union
4 select 'Little Foot' from dual
5 )
6 select regexp_substr(col, '^\w+') ||' '||
7 regexp_substr(col, '\w+$') first_and_last
8 from test;
FIRST_AND_LAST
-------------------------------------
John Smith
Kevin Bacon
Little Foot
SQL>

Using Spark to merge two or more files content and manipulate the content

Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.
Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
Output:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Explained:
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print
It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers = spark.read.csv("called browser_type.tsv") \
.toDF("browser_type", "browser_name")
Join:
users.join(browser, "browser_type", "left").write.csv(path)

use sed to return only last line that contains specific string

All help would be appreciated as I have tried much googling and have drawn a blank :)
I am new to sed and do not know what the command I need might be.
I have a file containing numerous lines eg
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
The file is plain text and is not formatted (ie not csv or anything like that )
I want to search for a list of specific strings, eg. John Smith, Mike Smith, Jim Smith and return only the last line entry in the file for each string found (all other lines found would be deleted).
(I do not necessarily need every unique entry, ie Jane Smith may or may not be needed)
It is important that the original order of the lines found is maintained in the output.
So the result would be :
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
I am new to sed and do not know what this command might be.
There are about 100 specific search strings.
Thank you :)
Assuming sample.txt contains the data you provided:
$ cat sample.txt
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
For this sample data, following script works just fine:
$ cut -f1,2 -d' ' sample.txt | sort | uniq | while read s; do tac sample.txt | grep -m1 -n -e "$s" ; done | sort -n -r -t':' | cut -f2 -d':'
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
Here is the breakdown of the script:
First generate all the unique strings ( First Name, Last Name in this case )
Now find the last occurrence each of these strings. For this we find first occurrence by reversing the file. Also print the line number along with output.
Now reverse the output in reverse line number order, and then remove the line numbers ( we don't need them )
you didn't tell the format of given list, I assume it is CSV, same as you wrote in question: eg. John Smith, Mike Smith, Jim Smith
from your description, you want line each string found, not only col1 and col2
from above two points, I have:
awk -v list="John Smith, Mike Smith, Jim Smith" 'BEGIN{split(list,p,",\\s*")}
{for(i=1;i<=length(p);i++){
if($0~p[i]){
a[p[i]]=$0
break
}
}
}END{for(x in a)print a[x]}' file
you can fill the list with your strings, sepearated by comma. output this with your test data:
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Reverse the list fist, e.g. something like this:
$ sed -n '/Mike/{p;q}' <(tac input.txt)
Mike Smith C2345613213
sed -n -e 's/.*/&³/
H
$ {x
s/\n/²/g
t a
:a
s/²\([a-zA-Z]* [a-zA-Z]* \)[^³]*³\(\(.*\)²\1\)/\2/
t a
s/²//g;s/³$//;s/³/\
/g
p
}' YourFile
work with any name (cannot contain ² nor ³). For composed name change pattern with [-a-z-A-Z]
in your list, there is also Jane Smith that appear at least once
For the specific list, use a grep -f before, it's faster and easier to maintain without changing the code

Remove string with brackets in R

I have a data.frame looks like this:
name
Lily(1+2)
John(good+1)
Tom()
Jim
Alice(*+#)
.....
I want to remove all brackets and everything inside the brackets in R. What should I do?
I prefer my data.frame can be looked like:
name
Lily
John
Tom
Jim
Alice
....
Thanks!
# read your sample data:
d <- read.table(text=readClipboard(), header=TRUE, comment='`')
# remove strings in parentheses
transform(d, name=gsub('\\(.*\\)', '', name))
# name
# 1 Lily
# 2 John
# 3 Tom
# 4 Jim
# 5 Alice

Resources