I just did:
len(my_df.drop_duplicates())
Is there not a more elegant way to do this?
in R you can do:
nrow(distinct(my_df))
Which to me is very readable, drop_duplicates() feels worrying, because as new Python user, I get lost with what operations are happening in place and which ones you need to store/overwrite copies of for the environment to persist the change.
The fact that searching on google didn't give me a clear one click answer for what I'd think was a simple function worried me a bit...
Thanks!
In pandas you can do by another way groupby or duplicated with sum
df.groupby(list(df)).ngroup()
(~df.duplicated()).sum()
Also as a R and python user, I know that is hard to switch from R to pandas , but the most common way is drop_duplicates
len(pd.unique(my_df))
you are looking for unique I guess.
Related
I am using the below code to compare 2 columns in data frame. I dont want to do it in pandas. Can someone help how to compare using spark data frames?
df1=context.spark.read.option("header",True).csv("./test/input/test/Book1.csv",)
df1=df1.withColumn("Curated", dataclean.clean_email(col("email")))
df1.show()
assert_array_almost_equal(df1['expected'], df1['Curated'],verbose=True)
You can either do it through:
pyspark-test library which is inspired by the pandas testing module built for Spark as in this documentation or
exceptAll as in this documentation. Once used, you then have to check whether count is greater then zero, if yes, then the tables are not the same.
Good luck!
One efficient way would be to try to identify the first difference as soon as possible. One way to achieve that is via left-anti joins:
assert(df1.join(df1, (df1['expected'] == df1['Curated']), "leftanti").first() != None)
I am new with pyspark.
I need to analyze big CSV files .
at the beginning of the analyze I had to sort the data by ID and TIME .
I tried to use dask in order to do it but I realize that it give wrong answer and also many times it stuck in the middle. So dask not good with sorting values as mentioned in the link . apparently Because it do in parallel way.
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sort_values.html
My question is how pyspark handle with this issue?
Does it give good and reliable results?
If the answer is yes I would like to know how spark is sorting data in parallel way and why it is difficult to dask
In Spark, is there a way of adding a column to a DataFrame by means of a join, but in a way that guarantees that the left hand side remains completely unchanged?
This is what I have looked at so far:
leftOuterJoin... but that risks duplicating rows, so one would have to be super-careful to make sure that there are no duplicate keys on the right. Not exactly robust or performant, if the only way to guarantee safety is to dedupe before the join.
There is a data structure that seems to guarantee no duplicate keys: PairRDD. That has a nice method of looking up a key in the key-value table: YYY.lookup("key") . Thus one might expect to be able to do .withColumn("newcolumn", udf((key:String) => YYY.lookup(key)).apply(keyColumn)) but it seems that udfs cannot do this because they apparently cannot access the sqlContext which is apparently needed for the lookup. If there were a way of using withColumn I would be extremely happy because it has the right semantics.
Many thanks in advance!
I have data keyed by Data.Time.Calendar.Day and need to efficiently look it up. Some dates are missing, when I try to look up by a missing key, I want to get data attached to the closest existing key, somewhat like std::map::lower_bound.
Any suggestions for existing libraries that can do this? I searched around for a while and only found maps supporting exact key lookups.
Thanks.
Did you check Data.Map.Lazy? In particular, I guess you could use the functions lookupLE and lookupGT, or similar. The complexity of these functions is O(log n), and similar functions exist in Data.Map.Strict.
A suitable combination of Data.Map's splitLookup and findMin/findMax will do the trick.
Ok, I have tried searching the web for different modules for me to use but it didn't make me any wiser. There where so many different alternatives and I couldn't find any good discussion on wich was the better.
I need a hashmap with a 5 digit decimal number as key, and with arrays as values.
I need to effectively iterate through its keys as well a couple of times per second.
Anyone have a good recommendation of what I should use?
Thank you!
//Svennisen
I realised I do not need a module at all.
Since in javascript an array with a string index is automatically considered a map with key,value instead of index,value.
So I just convert my number to a string and use an ordinary vector.