I am trying to find a way to name columns of a dataframe using strings coming from excel or scraping the web.
So how to transform "colname" to colname below?
df = DataFrame(colname = [1, 2])
I tried
df = DataFrame(symbol("colname") = [1, 2])
or
df = DataFrame([1, 2], [symbol("colname")])
and many other combinations, but no success.
I see questions related to deleting columns based on string column names but no question/answer for naming columns in the first place.
May be you can try something like this in two steps using the names! function.
using DataFrames
newname = ["colname1", "colname2"]
df = DataFrame(v1 = [1, 2], v2 = [3, 4])
names!(df.colindex, map(parse, newname))
df
# 2x2 DataFrames.DataFrame
# | Row | colname1 | colname2 |
# |-----|----------|----------|
# | 1 | 1 | 3 |
# | 2 | 2 | 4 |
Here are the version of Julia and DataFrames.jl I used
versioninfo()
# Julia Version 0.4.0-dev+6991
# Commit 811a977 (2015-08-26 04:02 UTC)
# Platform Info:
# System: Linux (x86_64-unknown-linux-gnu)
# CPU: Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
# WORD_SIZE: 64
# BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
# LAPACK: libopenblas
# LIBM: libopenlibm
# LLVM: libLLVM-svn
Pkg.installed("DataFrames")
# v"0.6.9"
Related
My table loaded in PySpark has a column "Date" with the following type of data:
Date
Open
High
Low
Close
Volume
Adj Close
1/3/2012
59.97
61.06
59.87
60.33
12668800
52.61923
1/4/2012
60.21
60.35
59.47
59.71
9593300
52.07848
1/5/2012
59.35
59.62
58.37
59.42
12768200
51.82554
1/6/2012
59.42
59.45
58.87
59
8069400
51.45922
How do I calculate the difference, in days, between the max and the min of the column (so in the example above, I need the difference in day between 1/6/2012 and 1/3/2012
Test data:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2012-01-03',),('2013-02-03',),('2011-11-29',)], ['Date']).select(F.col('Date').cast('date'))
df.show()
# +----------+
# | Date|
# +----------+
# |2012-01-03|
# |2013-02-03|
# |2011-11-29|
# +----------+
This will create a new dataframe containing the difference in days:
df_diff = df.groupBy().agg(F.datediff(F.max('Date'), F.min('Date')).alias('diff'))
df_diff.show()
# +----+
# |diff|
# +----+
# | 432|
# +----+
# If you need the difference in a variable:
v = df_diff.head().diff
print(v)
# 432
And this will add a new column to your existing df:
df = df.withColumn('diff', F.expr('datediff(max(Date) over(), min(Date) over())'))
df.show()
# +----------+----+
# | Date|diff|
# +----------+----+
# |2012-01-03| 432|
# |2013-02-03| 432|
# |2011-11-29| 432|
# +----------+----+
Supposing your dataframe df has only the column Date in date format, you can do the following:
from pyspark.sql import functions as F
(df.withColumn('Max_Date', F.max(F.col('Date')))
.withColumn('Min_Date', F.min(F.col('Date')))
.withColumn('Diff_days', F.datediff(F.col('Max_Date'), F.col('Min_Date')))
.drop('Date').dropDuplicates())
In this link you can find more examples about sql functions for pyspark:
https://sparkbyexamples.com/spark/spark-sql-functions/
I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file.
I have load it into a dataframe of pandas as follows:
old_df['Vector']=old_df.apply(lambda row:
np.array(np.matrix(row.Vector)).ravel(), axis = 1)
l=[]
for a in old_df['Vector']:
l.append(a)
A=np.array(l)
similarities = cosine_similarity(A)
The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I only want to get the top k have highest similar score.
Could you pls help me to solve this issue.
Thank you.
|Index | GUID | Vector |
|-------|-------|---------------------------------------|
|36099 | b770 |[-0.04870541 -0.02133574 0.03180726] |
|36098 | 808f |[ 0.0732905 -0.05331331 0.06378368] |
|36097 | b111 |[ 0.01994788 0.00417582 -0.09615131] |
|36096 | b6b5 |[0.025697 -0.08277534 -0.0124591] |
|36083 | 9b07 |[ 0.025697 -0.08277534 -0.0124591] |
|36082 | b9ed |[-0.00952298 0.06188576 -0.02636449] |
|36081 | a5b6 |[0.00432161 0.02264584 -0.0341924] |
|36080 | 9891 |[ 0.08732156 0.00649456 -0.02014138] |
|36079 | ba40 |[0.05407356 -0.09085554 -0.07671648] |
|36078 | 9dff |[-0.09859556 0.04498474 -0.01839088] |
|36077 | a423 |[-0.06124249 0.06774347 -0.05234318] |
|36076 | 81c4 |[0.07278682 -0.10460124 -0.06572364] |
|36075 | 9f88 |[0.09830415 0.05489364 -0.03916228] |
|36074 | adb8 |[0.03149953 -0.00486591 0.01380711] |
|36073 | 9765 |[0.00673934 0.0513557 -0.09584251] |
|36072 | aff4 |[-0.00097896 0.0022945 0.01643319] |
Example code to get top k cosine similarities and they corresponding GUID and row ID:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
data = {"GUID": ["b770", "808f", "b111"], "Vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))
vectors = []
for v in df['Vector']:
vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix
similarities = cosine_similarity(A)
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))
k = 2
if k > vectors_num:
K = vectors_num
# Get top k similarities and pair GUID in ascending order
top_k_indexes = np.unravel_index(np.argsort(similarities.ravel())[-k:], similarities.shape)
top_k_similarities = similarities[top_k_indexes]
top_k_pair_GUID = []
for indexes in top_k_indexes:
pair_GUID = (df.iloc[indexes[0]]["GUID"], df.iloc[indexes[1]]["GUID"])
top_k_pair_GUID.append(pair_GUID)
print("top_k_indexes: \n{}\ntop_k_pair_GUID: \n{}\ntop_k_similarities: \n{}".format(top_k_indexes, top_k_pair_GUID, top_k_similarities))
Outputs:
Data:
GUID Vector
0 b770 [-0.1, -0.2, 0.3]
1 808f [0.1, -0.2, -0.3]
2 b111 [-0.1, 0.2, -0.3]
Similarities:
[[-2. -0.42857143 -0.85714286]
[-2. -2. 0.28571429]
[-2. -2. -2. ]]
top_k_indexes:
(array([0, 1], dtype=int64), array([1, 2], dtype=int64))
top_k_pair_GUID:
[('b770', '808f'), ('808f', 'b111')]
top_k_similarities:
[-0.42857143 0.28571429]
So, i am making a movie search cli app for educational reasons.
I am using tabulate to get a pretty table
>>> python movies.py
Please enter a movie name: sonic
+--------+--------------------+--------+
| 1 | Sonic the Hedgehog | 2020 |
| 0 | Oasis: Supersonic | 2016 |
|--------+--------------------+--------|
| SlNo | Title | Year |
+--------+--------------------+--------+
Here is the code i used:
# This is a wrapper for my API that scrapes Yify (yts.mx)
# Import the API
from YifyAPI.yify import search_yify as search
# Import the table
from tabulate import tabulate
# Import other utilities
import click, json
get_results = search(input('Please enter a movie name: '))
count = -1
table = []
for a in get_results:
count += 1
entry = [count, a['title'], a['year']]
table.append(entry)
headers = ['SlNo', "Title", "Year"]
table = tabulate(table, headers, tablefmt='psql')
table = '\n'.join(table.split('\n')[::-1])
click.echo(table)
As you can see in the above code, i have a 2 dimensional list, and i want to sort each movie by using the 3rd item in each sub list, which is the year the movie was released in. Is there any easy and pythonic way to do this? It is possible to convert the year to an integer if it is needed.
If possible i want to sort it by the most recent movie to the oldest. So it should be in a descending order.
Well, i dug a little in stackoverflow questions and found the most suitable answer for me:
https://stackoverflow.com/a/7588949/13077523
From the Python documentation wiki, I think you can do:
a = ([[1, 2, 3], [4, 5, 6], [0, 0, 1]]);
a = sorted(a, key=lambda a_entry: a_entry[1])
print a
The output is:
[[[0, 0, 1], [1, 2, 3], [4, 5, 6]]]
But what i instead did is:
get_results = search(input('Please enter a movie name: '))
get_results = sorted(get_results, key=lambda a_entry: a_entry['year'])
count = -1
table = []
for a in get_results:
count += 1
entry = [count, a['title'], a['year']]
table.append(entry)
headers = ['SlNo', "Title", "Year"]
table = tabulate(table, headers, tablefmt='psql')
table = '\n'.join(table.split('\n')[::-1])
click.echo(table)
Which worked perfectly for me!
I got a table record as stated below.
Id Indicator Date
1 R 2018-01-20
1 R 2018-10-21
1 P 2019-01-22
2 R 2018-02-28
2 P 2018-05-22
2 P 2019-03-05
I need to pick the Ids that had more than two R indicator in the last one year and derive a new column called Marked_Flag as Y otherwise N. So the expected output should look like below,
Id Marked_Flag
1 Y
2 N
So what I did so far, I took the records in a dataset and then again build another dataset from that. The code looks like below.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
But my lead what this to be done using a single dataset and using Spark transformations. I am pretty new to Spark, any guidance or code snippet on this regard would be highly helpful.
Created two Datasets one to get the aggregation and another used the aggregated value to derive the new column.
Dataset<row> getIndicators = spark.sql("select id, count(indicator) as indi_count from source group by id having indicator = 'R'");
Dataset<row>getFlag = spark.sql("select id, case when indi_count > 1 then 'Y' else 'N' end as Marked_Flag" from getIndicators");
Input
Expected output
Try out the following. Note that I am using pyspark DataFrame here
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
[1, "R", "2018-01-20"],
[1, "R", "2018-10-21"],
[1, "P", "2019-01-22"],
[2, "R", "2018-02-28"],
[2, "P", "2018-05-22"],
[2, "P", "2019-03-05"]], ["Id", "Indicator","Date"])
gr = df.filter(F.col("Indicator")=="R").groupBy("Id").agg(F.count("Indicator"))
gr = gr.withColumn("Marked_Flag", F.when(F.col("count(Indicator)") > 1, "Y").otherwise('N')).drop("count(Indicator)")
gr.show()
# +---+-----------+
# | Id|Marked_Flag|
# +---+-----------+
# | 1| Y|
# | 2| N|
# +---+-----------+
#
I have been building my application on Python but for some reason I need to put it on a distributed environment, so I'm trying to build and application
using Spark but unable to come up with a code as fast as shift in Pandas.
mask = (df['name_x'].shift(0) == df['name_y'].shift(0)) & \
(df['age_x'].shift(0) == df['age_y'].shift(0))
df = df[~mask1]
Where
mask.tolist()
gives
[True, False, True, False]
The final result df will contain only two rows (2nd and 4th).
Basically trying to remove rows where, [name_x, age_x]col duplicates if present on [name_y,age_y]col.
Above code is on Pandas dataframe. What would be the closest PySpark code which is as efficient but without importing Pandas?
I have checked Window on Spark but not sure of it.
shift plays no role in your code. This
import pandas as pd
df = pd.DataFrame({
"name_x" : ["ABC", "CDF", "DEW", "ABC"],
"age_x": [20, 20, 22, 21],
"name_y" : ["ABC", "CDF", "DEW", "ABC"],
"age_y" : [20, 21, 22, 19],
})
mask1 = (df['name_x'].shift(0) == df['name_y'].shift(0)) & \
(df['age_x'].shift(0) == df['age_y'].shift(0))
df[~mask1]
# name_x age_x name_y age_y
# 1 CDF 20 CDF 21
# 3 ABC 21 ABC 19
is just equivalent to
mask2 = (df['name_x'] == df['name_y']) & (df['age_x'] == df['age_y'])
df[~mask2]
# name_x age_x name_y age_y
# 1 CDF 20 CDF 21
# 3 ABC 21 ABC 19
Therefore all you need is filter:
sdf = spark.createDataFrame(df)
smask = ~((sdf["name_x"] == sdf["name_y"]) & (sdf["age_x"] == sdf["age_y"]))
sdf.filter(smask).show()
# +------+-----+------+-----+
# |name_x|age_x|name_y|age_y|
# +------+-----+------+-----+
# | CDF| 20| CDF| 21|
# | ABC| 21| ABC| 19|
# +------+-----+------+-----+
which, by De Morgan's laws, can be simplified to
(sdf["name_x"] != sdf["name_y"]) | (sdf["age_x"] != sdf["age_y"])
In general, shift can be expressed with Window functions.