Getting data in database format from excel file [Python]

Getting data in database format from excel file [Python] - python-3.x

I have a Table in Excel that has this structure:
Country
County
20/01/2020
21/01/2020
Country
County
Value 1
Value 2
I would like to be able to convert the table into the following format so that I could add it in a table in my database.
Country
County
Date
Values
Country
County
20/01/2020
Value 1
Country
County
20/01/2020
Value 2
Is there a quick way to do this or should I iterate over each row and create a dataframe from there? The excel file has millions of entries.

pd.melt is what you're looking for:
pd.melt(df, id_vars = ['Country','County'], var_name = 'Date')
supply the dataframe as the first argument, then id_vars tells the function which columns you want to keep for each row. the rest of the columns will be turned into values in a new column in the melted dataframe. then var_name says what you want that new column to be called.

Related

How to compute unique values for a group of rows and create a column for all records using that value?

I have a dask dataframe with only the 'Name' and 'Value' column similar to the table below.
How do I compute the 'Average' column? I tried groupby in dash but that just gives me a dataframe of 2 records containing the average of A and B.

You can just left join your original table with the new one on Name. From https://docs.dask.org/en/latest/dataframe-joins.html:
small = small.repartition(npartitions=1)
result = big.merge(small)

my dataframe contains '/' in column name , How can i plot certain column values in that column name?

i need to plot the state called 'kerala' in the column 'state/unionterritory' and 'confirmed' to create a lineplot.
so far I have written till
1sns.lineplot(x=my_data['state/unionterritory'],y=my_data['confirmed'])
[https://www.kaggle.com/essentialguy/exercise-final-project]
this is the dataframe.head() , see the column name

I assume you want to get a column state/unionterritory from your DataFrame and filter it so it contains only Kerala state:
my_data_kerala = my_data[my_data['state/unionterritory'] == 'Kerala']['state/unionterritory']

How to extract certain data from multiple identical indexes and create new column and save it on the first row of the index

enter image description here
So i have dataset of movie name, date , revenue accumulated
As you see, there are multiple rows for same movie and revenue is accumulated.
I want to make extract the last revenue accumulated value from a movie name column and make a new column and add that extracted value to the first row of a certain movie. I know how to make new column, but i dont know how would i extract data and add that data to the FIRST ROW of a certain movie. Im using python and pandas
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.groupby('Movie name')['Revenue accumulated'].last()
I now know how to collect the last revenue accumulated of a certain movie using groupby,but i still do not know how to use insert that information to first row of each movie in a new column

Use
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date')
df.groupby('Movie name')['Revenue accumulated'].last()

How to read particular row from groupby dataframe?

I am preprocessing my data about car sales where lots of used cars price is 0. I want to replace 0 value with the mean price value of similar kind of cars.
Here, I have found mean values for each car with groupby function:
df2= df1.groupby(['car','body','year','engV'])['price'].mean()
This is my dataframe extracted from actual data with price is zero
rep_price=df[df['price']==0]
I want to assign mean price value from df2['Land Rover'] to rep_price['Land Rover'] which is 0

Since I'm not sure about the columns you have in those dataframes I'm gonna take a wild guess to try to give you a head start, let's say 'Land Rover' is a value in a column called 'Car_Type' in df1, and then you grouped your data like this:
df2= df1.groupby(['Car_Type'])['price'].mean()
In that case something like this should cover your need:
df1.loc[df1['Car_Type']=='Land Rover','price'] = df2['Land Rover']

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even

I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Getting data in database format from excel file [Python] - python-3.x

Related

How to compute unique values for a group of rows and create a column for all records using that value?

my dataframe contains '/' in column name , How can i plot certain column values in that column name?

How to extract certain data from multiple identical indexes and create new column and save it on the first row of the index

How to read particular row from groupby dataframe?

PySpark: Update column values for a given number of rows of a DataFrame

Categories

Resources