Dataframe column with two different names - apache-spark

I want to know if a a spark data frame can have two different names to a column.
I knew that by using "withColumn" i can add a new column but i do not want to add a new column to the data frame, but i just want to have alias name to the existing column in a data frame.
Example If there is a data frame with 3 columns "Col1, Col2, Col3".
So can anyone please let me know if i can give a alias name to Col3 so that i can retrieve the data of 'Col3' with name "Col4" as well.

EDIT: possible duplicate: Usage of spark DataFrame "as" method
It looks like there are several ways depending on the spark library and client library you're using.

Related

How to fix NaN values in concatenated multiple tables in Jupyter Notebook

I am using Jupyter Notedbook. I have concatenated multiple tables. When I run the head() command I am not able to see the values in age and gender columns in a table rather than it's showing me NaN values against each user_id.
The following image_1 shows us the output when I concatenated the different two tables
How can I sort it out this issue or suggest me another way to conatenate tables where I can see all of the table values?
or do I need to access tables separeately and apply operations on different tables?
I am expecting to get the values in age and gender table rather than NaN values.
When I use these tables separately. It shows correct results but I have a big data problem so I need to concatenate the tables to access each of the feature column. In the end, I can apply operations on concatenated table features.
I have been testing your problem with the two csv's of your Github.
First of all I loaded the two tables as 'df1' and 'df2', importing the pandas library.
import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_1.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_2.csv')
Then, using the pandas library, you can merge both dataframes choosing the connection column, in this case, 'user_id'.
final_df= pd.merge(df1, df2,on='user_id')
Finally we have the 'final_df' with all the information of both tables and without NaN's.

how a table data gets loaded into a dataframe in databricks? row by row or bulk?

I am new to databricks notebooks and dataframes. I have a requirement to load few columns(out of many) in a table of around 14million records into a dataframe. once the table is loaded, I need to create a new column based on values present in two columns.
I want to write the logic for the new column along with the select command while loading the table into dataframe.
Ex:
df = spark.read.table(tableName)
.select(columnsList)
.withColumn('newColumnName', 'logic')
will it have any performance impact? is it better to first load the table for the few columns into the df and then perform the column manipulation on the loaded df?
does the table data gets loaded all at once or row by row into the df? if row by row, then by including column manipulation logic while reading the table, am I causing any performance degradation?
Thanks in advance!!
This really depends on the underlying format of the table - is it backed by Parquet or Delta, or it's an interface to the actual database, etc. In general, Spark is trying to read only necessary data, and if, for example, Parquet is used (or Delta), then it's easier because it's column-oriented file format, so data for each column is placed together.
Regarding the question on the reading - Spark is lazy by default, so even if you put df = spark.read.table(....) as separate variable, then add .select, and then add .withColumn, it won't do anything until you call some action, for example .count, or write your results. Until that time, Spark will just check that table exists, your operations are correct, etc. You can always call .explain on the resulting dataframe to see how Spark will perform operations.
P.S. I recommend to grab a free copy of the Learning Spark, 2ed that is provided by Databricks - it will provide you a foundation for development of the code for Spark/Databricks

Python3 Pandas dataframes: beside columns names are there also columns labels?

Many database management systems, such as Oracle, SQL Server or even statistical software like SAS, allow having, beside field names, also field labels.
E.g., in DBMS one may have a table called "Table1" with, among other fields, two fields called "income_A" and "income_B".
Now, in the DBMS logic, "income_A" and "income_B" are the field names.
Beside a name, those two fields can also have plain English labels associated to them, which clarify the actual meaning of those two fields; such as "A - Income of households with dependable children where both parents work and they have a post-degree level of education" and "B - Income of empty-nesters households where only one works".
Is there anything like that in Python3 Pandas dataframes?
I mean, I know I can give a dataframe column a "label" (which is, seen from the above DBMS perspective, more like a "name", in the sense that you can use it to refer to the column itself).
But can I also associate a longer description to the column, something that I can choose to display instead of the column "label" in print-outs and reports or that I can save into dataframe exports, e.g., in MS Excel format? Or do I have to do it all using data dictionaries, instead?
It does not seem that there is a way to store such meta info other than in the columns name. But the column name can be quite verbose. I tested up to 100 characters. Make sure to pass it as a collection.
Such a long name could be annoying to use for indexing in the code. You could use loc/iloc or assign the name to a string for use in indexing.
In[10]: pd.DataFrame([1, 2, 3, 4],columns=['how long can this be i want to know please tell me'])
Out[10]:
how long can this be i want to know please tell me
0 1
1 2
2 3
3 4
This page shows that the columns don't really have any attributes to play with other than the lablels.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html
There is some more info you can get about a dataframe:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

Apache Spark search inside a dataset

I am new to spark and trying to achieve the following. I am not able to find the best way to do this. Please suggest.
I am using Spark 2.0 & its Dataset, Cassandra 3.7 and Cassandra Java connector
I have one ColumnFamily with partition key and 2 clustering keys. For example
myksyspace.myTable (some_id, col1, col2, col3, col4, col5, PRIMARY KEY (some_id, col1, col2))
I can get the data of myksyspace.myTable in a myTableDataset. The data has large number of rows (may be 200000).
After every 1 hour I get updated data from some other source where there could be some new data which is not in my database and I want to save it in database.
The data which I receive from some other source contains updated data but no value of “col2”. I get the rest of the data in a list, “dataListWithSomeNewData” in my Java code.
Now I want to match the data in list with data in myTableDataset and copy col2 from dataset to the list “dataListWithSomeNewData” and then generate new dataset and save it back in Database. This way my existing data will be updated. i want the new data to be inserted with new generated unique value for col2 for each list item. How do I achieve this?
I want to avoid collectAsList() on dataset to avoid out of memory as I may load large data in memory. With collectAsList(), the code works with small amount of data.
Any suggestion/idea on this. How can I achieve this?
Thank you in advance

Pyspark: filter DataaFrame where column value equals some value in list of Row objects

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

Resources