How to do a Pairwise Iterate Columns to find Similarities.
For All the Elemets from All The Colunms of one Data Frame, to be compared with all the elements from all the colunms of another Data Frame.
Eg :
df1 has two fields Name & Age
Name , Age
"Ajay Malhotra", 28
"Sujata Krishanan" , 27
"Madhav Shankar" , 33
df2 has two fields UserId & EmpId, eMail
" UserID " , " Emp ID " , "Email "
--------------------------------------
"Ajay.Malhotra", 100, "a.malt#nothing.com"
"Madhav.Shankar" , 101, "m.shankar"
"Sujata.Kris" , 1001,"Kris.Suja#nothing.com"
Some Method to give a Match Value can some hardCode 0.73 as example
def chekIfSame(leftString: String, rightString: String): Double = {
// Some Logic ..Gives a MatchValue
0.73
}
How to take Each Colunms from df_1, and each Colunms from df2 , and pass it to chekIfSame.
Output could be a Cartesian product like this
Name , UserId, MatchValue
--------------------------------------
"Sujata Krishanan", Sujata.Kris, 0.85
"Ajay Malhotra", Ajay.Malhotra , 0.98
"Ajay Malhotra", Sujata.Kris , 0.07
Two DataFrame nested for Each Loop
We wont be able to nested loop it.
But, we can Join and Pass it to a Function
joined = leftDf.join(rightDf)
val joinedWithScore = joined.withColumn("simlarScore", chekIfSame( joined(ltColName) , joined(rtColName)))
For this, we need to have it in chekIfSame as a UDF prior to the above operation.
def checkSimilarity = udf((left:String,right:String):Double => {
// Logic or hard code 0..73
0.73
}
Related
I have this (simplified) Spark dataset with these columns:
"col1", "col2", "col3", "col4"
And I would like to add a new column: "result".
The value of "result" is the return value of a function that takes all the other columns ("col1", "col2", ...) values as parameters.
map/foreach can't change the iterated row, and UDF functions don't take a whole row as a parameter, so I will have to collect all the column names as input, and I will also have to specify each column type in the UDF registration part.
Notes:
The dataset doesn't have a lot of rows, so I don't mind having a low performant solution.
The dataset does have a lot of columns with different types, so specifying all the columns in the UDF registration part doesn't seem like the most elegant solution.
The project is written in Java, so I'm using the Java API to interact with Spark.
How can I achieve that behavior?
You actually could add a new column with a map.
df.map { row =>
val col1 = row.getAs[String]("col1")
val col2 = row.getAs[String]("col2")
// etc, extract all your columns
....
val newColumns = col1 + col2
// do what you need to do to obtain value for a new column
(col1, col2, ..., newColumn)
}.toDF("col1", "col2", ..., "new")
In term of Java API this will be just the same with some adjustments:
data.map((MapFunction<Row, Tuple3<String, String, String>>) row -> {
String col1 = row.getAs("col1");
String col2 = row.getAs("col2");
// whatever you need
String newColumns = col1 + col2;
return new Tuple3<>(col1, col2, newColumns);
}, Encoders.tuple(Encoders.STRING(), Encoders.STRING(), Encoders.STRING()))
.toDF("col1", "col2", ..., "new")
Alternatively, you could collect all your columns to the array and then process this array in your UDF.
val transformer = udf { arr: Seq[Any] =>
// do your stuff but bevare of types
}
data.withColumn("array", array($"col1", $"col2", ..., $"colN"))
.select($"col1", $"col2",..., transformer($"array") as "newCol")
I've found a solution for my question:
String[] allColumnsAsStrings = dataset.columns();
final Column[] allColumns = Arrays.stream(allColumnsAsStrings).toArray(Column[]::new);
UserDefinedFunction addColumnUdf = udf((Row row) -> {
double score;
// Calculate stuff based on the row values
// ...
return score;
}, DataTypes.DoubleType
);
dataset = dataset.withColumn("score", addColumnUdf.apply(functions.struct(allColumns)));
Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition
I have two dataframe df1 and df2. df2 consist of "tagname" and "value" column. Dictionary "bucket_dict" holds the data from df2.
bucket_dict = dict(zip(df2.tagname,df2.value))
In a df1 there are millions of row.3 columns are there "apptag","comments" and "Type" in df1. I want to match between this two dataframes like, if
"dictionary key" from bucket_dict contains in df1["apptag"] then update the value of df1["comments"] = corresponding dictionary key
and df1["Type"] = corresponding bucket_dict["key name"]
. I used below code:
for each_tag in bucket_dict:
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] = each_tag
df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] = bucket_dict[each_tag]
Is there any efficient way to do this since it's taking longer time.
Bucketing df from which dictionary has been created:
bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])
other dataframe:
output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])
Required output:
You can do this by calling an apply on your comments column along with a loc on your bucketing_df in this manner -
def find_type(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
except:
return ""
def find_comments(a):
try:
return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
except:
return ""
output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))
Here I had to make them separate functions so it could handle cases where no tagname existed in apptag
It gives you this as the output_df -
apptag comments type
0 test123-pen pen study
1 test234-pencil pencil study
2 test234-rice rice grocery
All this code uses is the existing bucketing_df and output_df you provided at the end of your question.
I am looking to automatically generate the following string in Python 2.7 using a loop based on the number of columns in a Pandas DataFrame:
INSERT INTO table_name (firstname, lastname) VALUES (534737, 100.115)
This assumes that the DataFrame has 2 columns.
Here is what I have:
# Generate test numbers for table:
df = pd.DataFrame(np.random.rand(5,2), columns=['firstname','lastname'])
# Create list of tuples from numbers in each row of DataFrame:
list_of_tuples = [tuple(x) for x in df.values]
Now, I create the string:
Manually - this works:
add_SQL = INSERT INTO table_name (firstname, lastname) VALUES %s" % (list_of_tuples[4])
In this example, I only used 2 column names - 'firstname' and 'lastname'. But I must do this with a loop since I have 156 column names - I cannot do this manually.
What I need:
I need to automatically generate the placeholder %s the same
number of times as the number of columns in the Pandas DataFrame.
Here, the DataFrame has 2 columns so I need an automatic way to
generate %s twice.
Then I need to create a tuple with 2 entries,
without the ''.
My attempt:
sss = ['%s' for x in range(0,len(list(df)))]
add_SQL = "INSERT INTO table_name (" + sss + ") VALUES %s" % (len(df), list_of_tuples[4])
But this is not working.
Is there a way for me to generate this string automatically?
Here is what I came up with - it is based on dwanderson's approach in the 2nd comment of the original post (question):
table_name = name_a #name of table
# Loop through all columns of dataframe and generate one string per column:
cols_n = df.columns.tolist()
placeholder = ",".join(["%s"]*df.shape[1]) #df.shape[1] gives # of columns
column_names = ",".join(cols_n)
insrt = "INSERT INTO %s " % table_name
for qrt in range(0,df.shape[0]):
add_SQL_a_1 = insrt + "(" + column_names + ") VALUES (" + placeholder + ")" #part 1/2
add_SQL_a_2 = add_SQL_a_1 % list_of_tuples[qrt] #part 2/2
This way, the final string is in part 2/2.
For some reason, it would not let me do this all in one line and I can't figure out why.
Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?
You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra