finding similar values of variable in pyspark dataframe

finding similar values of variable in pyspark dataframe - apache-spark

I can find similar strings with the following command, but instead of using a string, I need to use a variable to find similar values between two columns.
df.filter(df.column_name.like("%Spark%")).show()

pattern = "Spark"
df.filter(df.column_name.like("%" + pattern + "%")).show()
A perhaps better way is to use rlike with a regex pattern to take care of word boundaries:
pattern = "Spark"
df.filter(df.column_name.rlike("\\b" + pattern + "\\b")).show()

Related

Need guidance with Regular Expression in Python

I need help with one of my current tasks wherein i am trying to pick only the table names from the query via Python
So basically lets say a query looks like this
Create table a.dummy_table1
as
select a.dummycolumn1,a.dummycolumn2,a.dummycolumn3 from dual
Now i am passing this query into Python using STRINGIO and then reading only the strings where it starts with "a" and has "_" in it like below
table_list = set(re.findall(r'\ba\.\w+', str(data)))
Here data is the dataframe in which i have parsed the query using StringIO
now in table_list i am getting the below output
a.dummy_table1
a.dummycolumn1
a.dummycolumn2
whereas the Expected output should have been like
a.dummy_table1
<Let me know how we can get this done , have tried the above regular expression but that is not working properly>
Any help on same would be highly appreciated

Your current regex string r"\ba.\w+" simply matches any string which:
Begins with "a" (the "\ba" part)
Followed by a period (the "." part)
Followed by 1 or more alphanumeric characters (the "\w+" part).
If I've understood your problem correctly, you are looking to extract from str(data) any string fragments which match this pattern instead:
Begins with "a"
Followed by a period
Followed by 1 or more alphanumeric characters
Followed by an underscore
Followed by 1 or more alphanumeric characters
Thus, the regular expression should have "_\w+" added to the end to match criteria 4 and 5:
table_list = set(re.findall(r"\ba\.\w+_\w+", str(data)))

What is the difference between these Peewee query filter forms

In the description of filtering records in Peewee, there are examples of two alternative syntaxes: using commas to separate multiple conditions, such as the following example,
Tweet.select().where(Tweet.user == user, Tweet.is_published == True)
and using bitwise operators. I cannot figure out (and cannot find a description of) the difference between using the comma syntax and using bitwise operators. What does the comma syntax actually do? From the (single) documented example of using a comma, it seems like it might be equivalent to using &, as in
Tweet.select().where( (Tweet.user == user) & (Tweet.is_published == True) )
Is that the case?

Yes, they are equivalent, as per the code:
def where(self, *expressions):
if self._where is not None:
expressions = (self._where,) + expressions
self._where = reduce(operator.and_, expressions)

Replacing a certain part of string with a pre-specified Value

I am fairly new to Puppet and Ruby. Most likely this question has been asked before but I am not able to find any relevant information.
In my puppet code I will have a string variable retrieved from the fact hostname.
$n="$facts['hostname'].ex-ample.com"
I am expecting to get the values like these
DEV-123456-02B.ex-ample.com,
SCC-123456-02A.ex-ample.com,
DEV-123456-03B.ex-ample.com,
SCC-999999-04A.ex-ample.com
I want to perform the following action. Change the string to lowercase and then replace the
-02, -03 or -04 to -01.
So my output would be like
dev-123456-01b.ex-ample.com,
scc-123456-01a.ex-ample.com,
dev-123456-01b.ex-ample.com,
scc-999999-01a.ex-ample.com
I figured I would need to use .downcase on $n to make everything lowercase. But I am not sure how to replace the digits. I was thinking of .gsub or split but not sure how. I would prefer to make this happen in a oneline code.

If you really want a one-liner, you could run this against each string:
str
.downcase
.split('-')
.map
.with_index { |substr, i| i == 2 ? substr.gsub(/0[0-9]/, '01') : substr }
.join('-')
Without knowing what format your input list is taking, I'm not sure how to advise on how to iterate through it, but maybe you have that covered already. Hope it helps.

Note that Puppet and Ruby are entirely different languages and the other answers are for Ruby and won't work in Puppet.
What you need is:
$h = downcase(regsubst($facts['hostname'], '..(.)$', '01\1'))
$n = "${h}.ex-ample.com"
notice($n)
Note:
The downcase and regsubst functions come from stdlib.
I do a regex search and replace using the regsubst function and replace ..(.)$ - 2 characters followed by another one that I capture at the end of the string and replace that with 01 and the captured string.
All of that is then downcased.

If the -01--04 part is always on the same string index you could use that to replace the content.
original = 'DEV-123456-02B.ex-ample.com'
# 11 -^
string = original.downcase # creates a new downcased string
string[11, 2] = '01' # replace from index 11, 2 characters
string #=> "dev-123456-01b.ex-ample.com"

rstrip() has no effect on string

Trying to use rstrip() at its most basic level, but it does not seem to have any effect at all.
For example:
string1='text&moretext'
string2=string1.rstrip('&')
print(string2)
Desired Result:
text
Actual Result:
text&moretext
Using Python 3, PyScripter
What am I missing?

someString.rstrip(c) removes all occurences of c at the end of the string. Thus, for example
'text&&&&'.rstrip('&') = 'text'
Perhaps you want
'&'.join(string1.split('&')[:-1])
This splits the string on the delimiter "&" into a list of strings, removes the last one, and joins them again, using the delimiter "&". Thus, for example
'&'.join('Hello&World'.split('&')[:-1]) = 'Hello'
'&'.join('Hello&Python&World'.split('&')[:-1]) = 'Hello&Python'

Lua: Search a specific string

Hi all tried all the string pattrens and library arguments but still stuck.
i want to get the name of the director from the following string i have tried the string.matcH but it matches the from the first character it finD from the string
the string is...
fixstrdirector = {id:39254,cast:[{id:15250,name:Hope Davis,character:Aunt Debra,order:5,cast_id:10,profile_path:/aIHF11Ss8P0A8JUfiWf8OHPVhOs.jpg},{id:53650,name:Anthony Mackie,character:Finn,order:3,cast_id:11,profile_path:/5VGGJ0Co8SC94iiedWb2o3C36T.jpg},{id:19034,name:Evangeline Lilly,character:Bailey Tallet,order:2,cast_id:12,profile_path:/oAOpJKgKEdW49jXrjvUcPcEQJb3.jpg},{id:6968,name:Hugh Jackman,character:Charlie Kenton,order:0,cast_id:13,profile_path:/wnl7esRbP3paALKn4bCr0k8qaFu.jpg},{id:79072,name:Kevin Durand,character:Ricky,order:4,cast_id:14,profile_path:/c95tTUjx5T0D0ROqTcINojpH6nB.jpg},{id:234479,name:Dakota Goyo,character:Max Kenton,order:1,cast_id:15,profile_path:/7PU6n4fhDuFwuwcYVyRNVEZE7ct.jpg},{id:8986,name:James Rebhorn,character:Marvin,order:6,cast_id:16,profile_path:/ezETMv0YM0Rg6YhKpu4vHuIY37D.jpg},{id:930729,name:Marco Ruggeri,character:Cliff,order:7,cast_id:17,profile_path:/1Ox63ukTd2yfOf1LVJOMXwmeQjO.jpg},{id:19860,name:Karl Yune,character:Tak Mashido,order:8,cast_id:18,profile_path:/qK315vPObCNdywdRN66971FtFez.jpg},{id:111206,name:Olga Fonda,character:Farra Lemkova,order:9,cast_id:19,profile_path:/j1qabOHf3Pf82f1lFpUmdF5XvSp.jpg},{id:53176,name:John Gatins,character:Kingpin,order:10,cast_id:41,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:1126350,name:Sophie Levy,character:Big Sister,order:11,cast_id:42,profile_path:null},{id:1126351,name:Tess Levy,character:Little Sister,order:12,cast_id:43,profile_path:null},{id:1126352,name:Charlie Levy,character:Littlest Sister,order:13,cast_id:44,profile_path:null},{id:187983,name:Gregory Sims,character:Bill Panner,order:14,cast_id:45,profile_path:null}],crew:[{id:58726,name:Leslie Bohem,department:Writing,job:Screenplay,profile_path:null},{id:53176,name:John Gatins,department:Writing,job:Screenplay,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:17825,name:Shawn Levy,department:Directing,job:Director,profile_path:/7f2f8EXdlWsPYN0HPGcIlG21xU.jpg},{id:12415,name:Richard Matheson,department:Writing,job:Story,profile_path:null},{id:57113,name:Dan Gilroy,department:Writing,job:Story,profile_path:null},{id:25210,name:Jeremy Leven,department:Writing,job:Story,profile_path:null},{id:17825,name:Shawn Levy,department:Production,job:Producer,profile_path:/7f2f8EXdlWsPYN0HPGcIlG21xU.jpg},{id:34970,name:Susan Montford,department:Production,job:Producer,profile_path:/1XJt51Y9ciPhkHrAYE0j6Jsmgji.jpg},{id:3183,name:Don Murphy,department:Production,job:Producer,profile_path:null},{id:34967,name:Rick Benattar,department:Production,job:Producer,profile_path:null},{id:1126348,name:Eric Hedayat,department:Production,job:Producer,profile_path:null},{id:186721,name:Ron Ames,department:Production,job:Producer,profile_path:null},{id:10956,name:Josh McLaglen,department:Production,job:Executive Producer,profile_path:null},{id:57634,name:Mary McLaglen,department:Production,job:Executive Producer,profile_path:null},{id:23779,name:Jack Rapke,department:Production,job:Executive Producer,profile_path:null},{id:488,name:Steven Spielberg,department:Production,job:Executive Producer,profile_path:/cuIYdFbEe89PHpoiOS9tmo84ED2.jpg},{id:30,name:Steve Starkey,department:Production,job:Executive Producer,profile_path:null},{id:24,name:Robert Zemeckis,department:Production,job:Executive Producer,profile_path:/isCuZ9PWIOyXzdf3ihodXzjIumL.jpg},{id:531,name:Danny Elfman,department:Sound,job:Original Music Composer,profile_path:/pWacZpYPos8io22nEiim7d3wp2j.jpg},{id:18265,name:Mauro Fiore,department:Crew,job:Cinematography,profile_path:null},{id:54271,name:Dean Zimmerman,department:Editing,job:Editor,profile_path:null},{id:25365,name:Richard Hicks,department:Production,job:Casting,profile_path:null},{id:5490,name:David Rubin,department:Production,job:Casting,profile_path:null},{id:52088,name:Tom Meyer,department:Art,job:Production Design,profile_path:null}]}
i have tried string.match(fixstrdirector,"name:(.+),department:Directing")
but it gives me the from the first occurace it find the name to the end of thr string
output:
Hope Davis,character:Aunt Debra,order:5,cast_id:10,profile_path:/aIHF11Ss8P0A8JUfiWf8OHPVhOs.jpg},{id:53650,name:Anthony Mackie,character:Finn,order:3,cast_id:11,profile_path:/5VGGJ0Co8SC94iiedWb2o3C36T.jpg},{id:19034,name:Evangeline Lilly,character:Bailey Tallet,order:2,cast_id:12,profile_path:/oAOpJKgKEdW49jXrjvUcPcEQJb3.jpg},{id:6968,name:Hugh Jackman,character:Charlie Kenton,order:0,cast_id:13,profile_path:/wnl7esRbP3paALKn4bCr0k8qaFu.jpg},{id:79072,name:Kevin Durand,character:Ricky,order:4,cast_id:14,profile_path:/c95tTUjx5T0D0ROqTcINojpH6nB.jpg},{id:234479,name:Dakota Goyo,character:Max Kenton,order:1,cast_id:15,profile_path:/7PU6n4fhDuFwuwcYVyRNVEZE7ct.jpg},{id:8986,name:James Rebhorn,character:Marvin,order:6,cast_id:16,profile_path:/ezETMv0YM0Rg6YhKpu4vHuIY37D.jpg},{id:930729,name:Marco Ruggeri,character:Cliff,order:7,cast_id:17,profile_path:/1Ox63ukTd2yfOf1LVJOMXwmeQjO.jpg},{id:19860,name:Karl Yune,character:Tak Mashido,order:8,cast_id:18,profile_path:/qK315vPObCNdywdRN66971FtFez.jpg},{id:111206,name:Olga Fonda,character:Farra Lemkova,order:9,cast_id:19,profile_path:/j1qabOHf3Pf82f1lFpUmdF5XvSp.jpg},{id:53176,name:John Gatins,character:Kingpin,order:10,cast_id:41,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:1126350,name:Sophie Levy,character:Big Sister,order:11,cast_id:42,profile_path:null},{id:1126351,name:Tess Levy,character:Little Sister,order:12,cast_id:43,profile_path:null},{id:1126352,name:Charlie Levy,character:Littlest Sister,order:13,cast_id:44,profile_path:null},{id:187983,name:Gregory Sims,character:Bill Panner,order:14,cast_id:45,profile_path:null}],crew:[{id:58726,name:Leslie Bohem,department:Writing,job:Screenplay,profile_path:null},{id:53176,name:John Gatins,department:Writing,job:Screenplay,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:17825,name:Shawn Levy

You're searching from the first occurrence of "name:" until the "department:Directing" with everything in between.
Instead, you need to restrict what can be between the two strings. Here for example I'm saying that the characters that make up the name can only be alphanumeric or a space:
string.match(fixstrdirector,"name:([%w ]+),department:Directing")
Alternatively, given that there's a comma separating the parameters, a better approach would be to search for "name:" followed by any characters other than a comma, followed by "department:Directing":
string.match(fixstrdirector,"name:([^,]+),department:Directing")
Of course that wouldn't work if the name had a comma it in!

Lua patterns provides - modifier for tasks as you have above. As stated on PiL - Section 20.2:
The + modifier matches one or more characters of the original class.
It will always get the longest sequence that matches the pattern.
Like *, the modifier - also matches zero or more occurrences of
characters of the original class. However, instead of matching the
longest sequence, it matches the shortest one.
Next, when you are using . to match, it'll find any and all characters satisfying the pattern. Therefore, you'll get the result from first occurence of name until the ,department:Directing is found. Since you know that it is a JSON data, you can try to match for [^,]; that is, non-comma characters.
So, for your case try:
local tAllNames = {}
for sName in fixstrdirector:gmatch( "name:([^,]-),department:Directing" ) do
tAllNames[ #tAllNames + 1 ] = sName
end
and all your required names will be stored in the table tAllNames. An example of the above can be seen at codepad.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

finding similar values of variable in pyspark dataframe - apache-spark

I can find similar strings with the following command, but instead of using a string, I need to use a variable to find similar values between two columns. df.filter(df.column_name.like("%Spark%")).show()

pattern = "Spark" df.filter(df.column_name.like("%" + pattern + "%")).show() A perhaps better way is to use rlike with a regex pattern to take care of word boundaries: pattern = "Spark" df.filter(df.column_name.rlike("\\b" + pattern + "\\b")).show()

Related

Need guidance with Regular Expression in Python

What is the difference between these Peewee query filter forms

Replacing a certain part of string with a pre-specified Value

rstrip() has no effect on string

Lua: Search a specific string

Categories

Resources