How to use dictionary values in dynamic spark sql query - python-3.x

Iam new to python please help me in below problem
I have a dictionary as below
city = {"AP":"VIZAG","TELANGANA":"HYDERABAD"}
and also I have a list which I need to loop for all state tables as below
states=['AP','HYDERABAD']
for st in states:
df = spark.sql(f"""select * from {st} where city = {city}["{st}"]""")
In above df I am trying to filter city based on dictionary value as per state. But I am not able to do it

New answer
By combining two filter conditions you can do the expected filtering.
selected_city = 'AP'
df = df.filter(
(F.col('city') == selected_city)
& (F.col('state') == cities[selected_city])
)
Old answer
It is a simple change: You can use isin to filter a column based on a list [Docs].
cities = list(city.keys())
df = df.filter(F.col('city').isin(cities))
If you want to construct more complex conditions based on a dictionary see this question.
[Edit] Updated answer based on OPs comment. Will leave the old one in there for completeness.

Related

aggregation and indexing based on specific column in pandas

I have a csv file related to world happiness data by country. In that data file different scores related to happiness are calculated based on some specific criteria. i want to find the worst and best countries based on these criteria(characteristics). My solution to this is given below with notebook pictures:
happiness_df = pd.read_csv('Datasets/happiness_2017.csv')
happiness_data_by_country = {}
for column in happiness_df.describe().columns:
if column != 'Rank':
max_val = happiness_df.describe().loc['max',column]
min_val = happiness_df.describe().loc['min',column]
country_with_max = happiness_df.loc[happiness_df[column]==max_val,'Country'].values[0]
country_with_min = happiness_df.loc[happiness_df[column]==min_val,'Country'].values[0]
happiness_data_by_country[column] = {
"worst" : country_with_min,
"best" : country_with_max
}
dataframe
my solution
is there any better way of doing this in pandas?
Yes, for the maximum value you can try:
df.loc[df['HappinessScore'].idxmax()]
And for the minimum: df.loc[df['HappinessScore'].idxmin()].

How can I grab the columns of many tables efficiently in Spark?

I want to find all columns in some Hive tables that meet a certain criteria. However, the code I've written to do this is very slow, since Spark isn't a particularly big fan of looping:
matches = {}
for table in table_list:
matching_cols = [c for c in spark.read.table(table).columns if substring in c]
if matching_cols:
matches[table] = matching_cols
I want something like:
matches = {'table1': ['column1', 'column2'], 'table2': ['column2']}
How can I more efficiently achieve the same result?
A colleague just figured it out. This is the revised solution:
matches = {}
for table in table_list:
matching_cols = spark.sql("describe {}".format(table)) \
.where(col('col_name').rlike(substring)) \
.collect()
if matching_cols:
matches[table] = [c.col_name for c in matching_cols]
The key difference here is that Spark seems to be caching partition information in my prior example, hence why it was getting more and more bogged down with each loop. Accessing the metadata to scrape columns, rather than the table itself, bypasses that issue.
If table fields has comments above code will get into issues on extra info(comment), Also side note HBase link tables will be issue too...
Example:
create TABLE deck_test (
COLOR string COMMENT 'COLOR Address',
SUIT string COMMENT '4 type Suits',
PIP string)
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
STORED AS TEXTFILE;
describe deck_test;
color string COLOR Address
suit string 4 type Suits
pip string
to handle comments issue small change may help...
matches = {}
for table in table_list:
matching_cols = spark.sql("show columns in {}".format(table)).where(col('result').rlike(substring)).collect()
if matching_cols:
matches[table] = [c.col_name for c in matching_cols]

GroupBy of a dataframe by comparison against a set of items

So I have a dataframe of movies with about 10K rows. Have a column that captures its genre in a comma separated string. Since a movie can be classified in multiple genres, I needed to create a set of genres that contained all possible genres in the 10K rows. So I went about it by doing as follows:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
This gets me a list of 24 genres from the 27K items in the simplist which is awesome. But heres the pinch:
I want to groupby genres by comparing genres to the set and then do aggregation and other operations AND
I want the output to be 24 distinct groups such that if a movie has more than one of the genres in the set - it should show up in both groups (removes sorting or tagging bias in the data gathering phase)
Is groupby even the right way to go about this?
Thanks for your input/thoughts/options/approach in advance.
Ok, so I made some headway but still unable to get to put the puzzle pieces together.
Started off by making a list and set (don't know which I will end up using) of unique values:
simplist = []
for i in df.genres.values:
vlist = i.split(', ')
for item in vlist:
simplist.append(item)
gset = set(simplist)
g_list = list(gset)
Then, separately use df.pivot to structure the analysis:
table7 = df.pivot_table(index=['release_year'], values=['runtime'],aggfunc={'runtime': [np.median, ], 'popularity': [np.mean]}, fill_value=0, dropna=True)
But here's the thing:
it would be awesome if I could index by g_list or check 'genres' against the gset of 24 distinct items but df.pivot_table does not support that. Leaving it at Genres creates ~2000 rows and is not meaningful.
Got it!! Wanted to thank a bunch of offline folks and Pythonistas who helped me in the right direction. Turns out I'd been spinning my wheels with sets and lists when one single Pandas command (well 3 to be precise) does the trick!!
df2 = pd.DataFrame(df.genres.str.split(', ').tolist(), index=[df.col1, df.col2, df.coln]).stack()
df2 = df2.reset_index()[[0, 'col1', 'col2', 'coln',]]
df2.columns = ['Genre', 'col1', "col2", 'coln']
This should create a 2nd dataframe (df2) that has the key columns for analysis from the original dataframe and the rows duplicated/attributed to each genre. You see the true value of this when you turnaround and do something like:
revenue_table = df2.pivot_table(index=['Release Year','Genre'], values=['Profit'],aggfunc={'Profit': np.sum},fill_value=0,dropna=True) or anything to similar effect or use case.
Closing this but would appreciate any notes on more efficient ways to do this.

Creating a pandas column conditional to another columns values based on a dictionary

Goodday,
I'm currently trying to tag a lot of job-ads based on the job-title using python 3.x and pandas. As every company uses different terminology for their jobs I want to cluster them in a sensible way.
Currently I have a dataframe containing 40.000+ job-ads and use following code to tag the jobs in a new pandas column:
dictionary = {
'c.*dev': 'c developer',
'web.*des': 'web designer',
'onl.*mark': 'online marketer',
...
}
for key in dictionary:
df.loc[(df['Job'].str.contains(key) == True), ['Clustered Jobs']] = dictionary[key]
As the dictionary and the database is growing constantly. I wanted to ask whether there is a more elegant and efficient way to do this.
Thanks for your help in advance.
You solution should be a bit simplify:
for key in dictionary:
df.loc[df['Job'].str.contains(key), 'Clustered Jobs'] = dictionary[key]
Or:
for k, v in dictionary.items():
df.loc[df['Job'].str.contains(k), 'Clustered Jobs'] = v

How to loop through each row of dataFrame in pyspark

E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.

Resources