Getting data in database format from excel file [Python] - python-3.x

I have a Table in Excel that has this structure:
Country
County
20/01/2020
21/01/2020
Country
County
Value 1
Value 2
I would like to be able to convert the table into the following format so that I could add it in a table in my database.
Country
County
Date
Values
Country
County
20/01/2020
Value 1
Country
County
20/01/2020
Value 2
Is there a quick way to do this or should I iterate over each row and create a dataframe from there? The excel file has millions of entries.

pd.melt is what you're looking for:
pd.melt(df, id_vars = ['Country','County'], var_name = 'Date')
supply the dataframe as the first argument, then id_vars tells the function which columns you want to keep for each row. the rest of the columns will be turned into values in a new column in the melted dataframe. then var_name says what you want that new column to be called.

Related

How to compute unique values for a group of rows and create a column for all records using that value?

I have a dask dataframe with only the 'Name' and 'Value' column similar to the table below.
How do I compute the 'Average' column? I tried groupby in dash but that just gives me a dataframe of 2 records containing the average of A and B.
You can just left join your original table with the new one on Name. From https://docs.dask.org/en/latest/dataframe-joins.html:
small = small.repartition(npartitions=1)
result = big.merge(small)

my dataframe contains '/' in column name , How can i plot certain column values in that column name?

i need to plot the state called 'kerala' in the column 'state/unionterritory' and 'confirmed' to create a lineplot.
so far I have written till
1sns.lineplot(x=my_data['state/unionterritory'],y=my_data['confirmed'])
[https://www.kaggle.com/essentialguy/exercise-final-project]
this is the dataframe.head() , see the column name
I assume you want to get a column state/unionterritory from your DataFrame and filter it so it contains only Kerala state:
my_data_kerala = my_data[my_data['state/unionterritory'] == 'Kerala']['state/unionterritory']

How to extract certain data from multiple identical indexes and create new column and save it on the first row of the index

enter image description here
So i have dataset of movie name, date , revenue accumulated
As you see, there are multiple rows for same movie and revenue is accumulated.
I want to make extract the last revenue accumulated value from a movie name column and make a new column and add that extracted value to the first row of a certain movie. I know how to make new column, but i dont know how would i extract data and add that data to the FIRST ROW of a certain movie. Im using python and pandas
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.groupby('Movie name')['Revenue accumulated'].last()
I now know how to collect the last revenue accumulated of a certain movie using groupby,but i still do not know how to use insert that information to first row of each movie in a new column
Use
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.groupby('Movie name')['Revenue accumulated'].last()

How to read particular row from groupby dataframe?

I am preprocessing my data about car sales where lots of used cars price is 0. I want to replace 0 value with the mean price value of similar kind of cars.
Here, I have found mean values for each car with groupby function:
df2= df1.groupby(['car','body','year','engV'])['price'].mean()
This is my dataframe extracted from actual data with price is zero
rep_price=df[df['price']==0]
I want to assign mean price value from df2['Land Rover'] to rep_price['Land Rover'] which is 0
Since I'm not sure about the columns you have in those dataframes I'm gonna take a wild guess to try to give you a head start, let's say 'Land Rover' is a value in a column called 'Car_Type' in df1, and then you grouped your data like this:
df2= df1.groupby(['Car_Type'])['price'].mean()
In that case something like this should cover your need:
df1.loc[df1['Car_Type']=='Land Rover','price'] = df2['Land Rover']

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Resources