What is the difference between:
my_df = my_df.select(col('age').alias('age2'))
and
my_df = my_df.select(col('age').withColumnRenamed('age', 'age2'))
The second expression is not going to work, you need to call withColumnRenamed() on your dataframe.
I assume you mean:
my_df = my_df.withColumnRenamed('age', 'age2')
And to answer your question, there is no difference.
Related
I am dealing with a scenario in which I need to write a case sensitive join condition. For that, I found there is a spark config property spark.sql.caseSensitive that can be altered. However, there is no impact on the final result set if I set this property to True or False.
In both ways, I am not getting results for language=java from the below sample PySpark code. Can anyone please help with how to handle this scenario?
spark.conf.set("spark.sql.caseSensitive", False)
columns1 = ["language","users_count"]
data1 = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
columns2 = ["language","note"]
data2 = [("java", "JVM based"), ("Python", "Indentation is imp"), ("Scala", "Derived from Java")]
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)
#df1.createOrReplaceTempView("df1")
#df2.createOrReplaceTempView("df2")
df = df1.join(df2, on="language", how="inner")
display(df)
My understanding of spark.sql.caseSensitive is that it affects SQL, not the data.
As for your join itself, if you do not want to lowercase or uppercase your data, which I can understand why, you can create a key column, which is the lowercase version of the value you want to join on. If you are having more complex situation, your key column could even become a md5() of one/more columns. Make sure everything stays lowercase/uppercase though to make the comparison works.
I want to map through the rows of df1 and compare those with the values of df2 , by month and day, across every year in df2,leaving only the values in df1 which are larger than those in df2, to add into a new column, 'New'. df1 and df2 are of the same size, and are indexed by 'Month' and 'Day'. what would be the best way to do this?
df1=pd.DataFrame({'Date':['2015-01-01','2015-01-02','2015-01-03','2015-01-``04','2005-01-05'],'Values':[-5.6,-5.6,0,3.9,9.4]})
df1.Date=pd.to_datetime(df1.Date)
df1['Day']=pd.DatetimeIndex(df1['Date']).day
df1['Month']=pd.DatetimeIndex(df1['Date']).month
df1.set_index(['Month','Day'],inplace=True)
df1
df2 = pd.DataFrame({'Date':['2005-01-01','2005-01-02','2005-01-03','2005-01-``04','2005-01-05'],'Values':[-13.3,-12.2,6.7,8.8,15.5]})
df2.Date=pd.to_datetime(df1.Date)
df2['Day']=pd.DatetimeIndex(df2['Date']).day
df2['Month']=pd.DatetimeIndex(df2['Date']).month
df2.set_index(['Month','Day'],inplace=True)
df2
df1 and df2
df2['New']=df2[df2['Values']<df1['Values']]
gives
ValueError: Can only compare identically-labeled Series objects
I have also tried
df2['New']=df2[df2['Values'].apply(lambda x: x < df1['Values'].values)]
The best way to handle your problem is by using numpy as a tool. Numpy has an attribute called "where"that helps a lot in cases like this.
This is how the sentence works:
df1['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false').
First import the library:
import numpy as np
Using the condition provided by you:
df2['New'] = np.where(df2['Values'] > df1['Values'], df2['Values'],'')
So, I think that solves your problem... You can change the value passed to the False condition to every thin you want, this is only an example.
Tell us if it worked!
Let´s try two possible solutions:
The first solution is to sort the index first.
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
it is possible to raise some kind of error, so if that happens, try this correction instead:
df1.sort_index(inplace=True, axis=1)
df2.sort_index(inplace=True, axis=1)
The second solution is to drop the indexes and reset it:
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
Perform a simple test to see if it works!
df1 == df2
See if it works and tell us the result.
I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))
I am trying to get the difference between each element after reading multiple csv files. Each csv file has 13 rows and 128 columns. I am trying to get the column-wise difference
I read the files using
data = [pd.read_csv(f, index_col=None, header=None) for f in _temp]
I get a list of all samples.
According to this I have to use .diff() to get the difference. Which goes something like this
data.diff()
This works but instead of getting the difference between each row in the same sample, I get the difference between each row of one sample to another sample.
Is there a way to separate this and let the difference happen within each sample?
Edit
Ok I am able to get the difference between the data elements by doing this
_local = pd.DataFrame(data)
_list = []
_a = _local.index
for _aa in _a:
_list.append(_local[0][_aa].diff())
flow = pd.DataFrame(_list, index=_a)
I am creating too many DataFrames, is there a better way to do this?
Here is a relatively efficient way to read you dataframes one at a time and calculate their differences which are stored in a list df_diff.
df_diff = []
df_old = pd.read_csv(_temp[0], index_col=None)
for f in _temp[1:]:
df = pd.read_csv(f, index_col=None)
df_diff.append(df_old - df)
df_old = df
Since your code work you should real post on https://codereview.stackexchange.com/
(PS. The leading "_" is not really pythonic. pls avoid. It makes your code harder to read. )
_local = pd.DataFrame(data)
_list = [ _local[0][_aa].diff() for _aa in _local.index ]
flow = pd.DataFrame(_list, index=_local.index )
I can't find a way to take just a part on rdd. take seems promising but it returns a list instead of rdd. I of course can then convert it to an rdd, but this seems wasteful and ugly.
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = sc.parallelize(my_rdd.take(10000))
I there a better way to do this?
Yes, indeed there is a better way. You can use the sample method from RDDs, it states:
sample(withReplacement, fraction, seed=None)
Return a sampled subset of this RDD.
quantity = 10000
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = my_rdd.sample(False, quantity / my_rdd.count())
#Akavall , it is a good idea. but the format have some change.
my_rdd = sc.textFile("my_file.csv")
part_of_my_rdd = sc.parallelize(my_rdd.take(10000)).map(x=>x.slice(1, x.length-1))
remove the brackets is OK!