I have a data.frame looks like this:
name
Lily(1+2)
John(good+1)
Tom()
Jim
Alice(*+#)
.....
I want to remove all brackets and everything inside the brackets in R. What should I do?
I prefer my data.frame can be looked like:
name
Lily
John
Tom
Jim
Alice
....
Thanks!
# read your sample data:
d <- read.table(text=readClipboard(), header=TRUE, comment='`')
# remove strings in parentheses
transform(d, name=gsub('\\(.*\\)', '', name))
# name
# 1 Lily
# 2 John
# 3 Tom
# 4 Jim
# 5 Alice
Related
I have a input text file:
This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&
Now I would like to append students name in a list in a line when find ### marker & stop appending when find &##& marker. Output should be(using python):
bob alice rhea john mary alex roma peter
You can use this method
txt = """This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&"""
or :
txt = open("./file.txt").read()
ls = txt.split("###")[1].split("&##&")[0].split()
print(ls)
This prints :
['bob', 'alice', 'rhea', 'john', 'mary', 'alex', 'roma', 'peter']
Using re.findall with re.sub:
inp = """This file contains information of students who are joining picnic:
### bob alice rhea john
mary alex roma
peter &##&"""
output = re.sub(r'\s+', ' ', re.findall(r'###\s+(.*?)\s+&##&', inp, flags=re.DOTALL)[0])
print(output) # bob alice rhea john mary alex roma peter
If you want a list, then use:
output = re.split(r'\s+', re.findall(r'###\s+(.*?)\s+&##&', inp, flags=re.DOTALL)[0])
print(output)
This prints:
['bob', 'alice', 'rhea', 'john', 'mary', 'alex', 'roma', 'peter']
I want to search for a pattern and add sequential number to end of line.
For example, I have following text file.
Male
John #
Tom #
Jack #
Female
Mary #
Jenny #
I want to search pattern "#" and add sequential number end of line.
Male
John #0
Tom #1
Jack #2
Female
Mary #3
Jenny #4
You could substitute with ascending numbers, as explained in this tip from the Vim Wiki:
:let i=1 | g/#$/s//\='#'.i/ | let i=i+1
For example I have names like these:
John Lucas Smith
Kevin Thomas Bacon
I need to do it with regexp_substr, or replace or something like that.
and what I want to get is:
John Smith
Kevin Bacon
Thank you!
Something like this?
SQL> with test (col) as
2 (select 'John Lucas Smith' from dual union
3 select 'Kevin Thomas Bacon' from dual union
4 select 'Little Foot' from dual
5 )
6 select regexp_substr(col, '^\w+') ||' '||
7 regexp_substr(col, '\w+$') first_and_last
8 from test;
FIRST_AND_LAST
-------------------------------------
John Smith
Kevin Bacon
Little Foot
SQL>
Can I use Spark to do the following?
I have three files to merge and change the contents:
First File called column_header.tsv with this content:
first_name last_name address zip_code browser_type
Second file called data_file.tsv with this content:
John Doe 111 New Drive, Ca 11111 34
Mary Doe 133 Creator Blvd, NY 44499 40
Mike Coder 13 Jumping Street UT 66499 28
Third file called browser_type.tsv with content:
34 Chrome
40 Safari
28 FireFox
The final_output.tsv file after Spark processing the above should have this contents:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Is this do able using Spark? Also I will consider Sed or Awk if it is possible use the tools. I know the above is possible with Python but I will prefer using Spark to do the data manipulation and changes. Any suggestions? Thanks in advance.
Here it is in awk, just in case. Notice the file order:
$ awk 'NR==FNR{ a[$1]=$2;next }{ $NF=($NF in a?a[$NF]:$NF) }1' file3 file1 file2
Output:
first_name last_name address zip_code browser_type
John Doe 111 New Drive, Ca 11111 Chrome
Mary Doe 133 Creator Blvd, NY 44499 Safari
Mike Coder 13 Jumping Street UT 66499 FireFox
Explained:
NR==FNR { # process browser_type file
a[$1]=$2 # remember remember the second of ...
next } # skip to the next record
{ # process the other files
$NF=( $NF in a ? a[$NF] : $NF) } # replace last field with browser from a
1 # implicit print
It is possible. Read header:
with open("column_header.tsv") as fr:
columns = fr.readline().split()
Read data_file.tsv:
users = spark.read.option("delimiter", "\t").csv("data_file.tsv").toDF(*columns)
Read called browser_type.tsv:
browsers = spark.read.csv("called browser_type.tsv") \
.toDF("browser_type", "browser_name")
Join:
users.join(browser, "browser_type", "left").write.csv(path)
All help would be appreciated as I have tried much googling and have drawn a blank :)
I am new to sed and do not know what the command I need might be.
I have a file containing numerous lines eg
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
The file is plain text and is not formatted (ie not csv or anything like that )
I want to search for a list of specific strings, eg. John Smith, Mike Smith, Jim Smith and return only the last line entry in the file for each string found (all other lines found would be deleted).
(I do not necessarily need every unique entry, ie Jane Smith may or may not be needed)
It is important that the original order of the lines found is maintained in the output.
So the result would be :
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
I am new to sed and do not know what this command might be.
There are about 100 specific search strings.
Thank you :)
Assuming sample.txt contains the data you provided:
$ cat sample.txt
John Smith Aweqwewq321
Mike Smith A2345613213
Jim Smith Ad432143432
Jane Smith A432434324
John Smith Bweqwewq321
Mike Smith B2345613213
Jim Smith Bd432143432
Jane Smith B432434324
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
For this sample data, following script works just fine:
$ cut -f1,2 -d' ' sample.txt | sort | uniq | while read s; do tac sample.txt | grep -m1 -n -e "$s" ; done | sort -n -r -t':' | cut -f2 -d':'
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Jane Smith C432434324
Here is the breakdown of the script:
First generate all the unique strings ( First Name, Last Name in this case )
Now find the last occurrence each of these strings. For this we find first occurrence by reversing the file. Also print the line number along with output.
Now reverse the output in reverse line number order, and then remove the line numbers ( we don't need them )
you didn't tell the format of given list, I assume it is CSV, same as you wrote in question: eg. John Smith, Mike Smith, Jim Smith
from your description, you want line each string found, not only col1 and col2
from above two points, I have:
awk -v list="John Smith, Mike Smith, Jim Smith" 'BEGIN{split(list,p,",\\s*")}
{for(i=1;i<=length(p);i++){
if($0~p[i]){
a[p[i]]=$0
break
}
}
}END{for(x in a)print a[x]}' file
you can fill the list with your strings, sepearated by comma. output this with your test data:
John Smith Cweqwewq321
Mike Smith C2345613213
Jim Smith Cd432143432
Reverse the list fist, e.g. something like this:
$ sed -n '/Mike/{p;q}' <(tac input.txt)
Mike Smith C2345613213
sed -n -e 's/.*/&³/
H
$ {x
s/\n/²/g
t a
:a
s/²\([a-zA-Z]* [a-zA-Z]* \)[^³]*³\(\(.*\)²\1\)/\2/
t a
s/²//g;s/³$//;s/³/\
/g
p
}' YourFile
work with any name (cannot contain ² nor ³). For composed name change pattern with [-a-z-A-Z]
in your list, there is also Jane Smith that appear at least once
For the specific list, use a grep -f before, it's faster and easier to maintain without changing the code