I have another problem with csv. I am using pandas to remove duplicates from a csv file. After doing so I noticed that all data has been put in one column (preprocessed data has been contained in 9 columns). How to avoid that?
Here is the data sample:
39,43,197,311,112,88,47,36,Label_1
Here is the function:
import pandas as pd
def clear_duplicates():
df = pd.read_csv("own_test.csv", sep="\n")
df.drop_duplicates(subset=None, inplace=True)
df.to_csv("own_test.csv", index=False)
Remove sep, because default separator is , in read_csv:
def clear_duplicates():
df = pd.read_csv("own_test.csv")
df.drop_duplicates(inplace=True)
df.to_csv("own_test.csv", index=False)
Maybe not so nice, but works too:
pd.read_csv("own_test.csv").drop_duplicates().to_csv("own_test.csv", index=False)
Related
I need your help to sort a table based on a specific column in ascending and descending order.
For example I have this table saved in a file called "test.txt"
which contains the following:
Name,BirthYear,Job
Ahmed,1990,Engineer
Salim,1997,Teacher
Nasser,1993,Accountant
I converted it to a csv table with prettytable:
from prettytable import from_csv
import csv, operator
with open('test.txt', 'r') as open_file:
table_file = from_csv(open_file)
print(table_file)
the picture is attached as 1.png
Then I sorted it based on BirthYear column:
print(table_file.get_string(sortby='BirthYear', sort_key=lambda row: row[0]))
the picture is attached as 2.png
How to sort the same but in descending order? (based on any column, not only BirthYear column).
If you have different ways, it would be greate to know learn them.
Thank you in advance
use sorted() if you want to sort data in descending order, just add reverse=True in sorted function but it allows you to sort only one column at a time.
from prettytable import from_csv
import csv, operator
with open('test.txt', 'r') as open_file:
table_file = from_csv(open_file)
print(table_file)
srt = sorted(data, key=operator.itemgetter(2), reverse=True)
print(st)
or you can use pandas
from prettytable import from_csv
import csv, operator
import pandas as pd
with open('test.txt', 'r') as open_file:
table_file = from_csv(open_file)
print(table_file)
srt = table_file.sort_values(["Age"], axis=0, ascending=[False], inplace=True)
print(st)
I hope that gonna helps you .
I have just started out learning few things in python, I am stuck in between.
import yfinance as yf
import pandas as pd
import yahoo_fin.stock_info as si
ticker = ['20MICRONS.NS', '21STCENMGM.NS', '3IINFOTECH.NS', '3MINDIA.NS', '3PLAND.NS']
for i in ticker:
try:
quote = si.get_quote_table(i)
df = pd.DataFrame.from_dict(quote.items())
df = df.append(quote.items(), ignore_index=True)
except (ValueError, IndexError,TypeError):
continue
print(df)
Just for example: The value of i has more than 4 entries, whenever I am exiting the loop this data has to be added or should be appended in the dataframe.
But for some reason the dataframe is not appending these values.
Thanks in advance
You defined df within the loop, which means that a new dataframe is initialised in df at each iteration. Define a new dataframe in df before the loop and append to the df in the loop. Please add the information that you provided in the comments to the question.
I'm trying to separate columns by slicing them because I need to assign dtypes for each one. So I grouped them by dtypes and assign their respective dtype and now I want to join or concat and that has the same column order as the main dataframe. I add that is not possible to do it by its column name because it may change.
Example:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
After doing this I need to join them with the same order as the main file, that is from 0 to 11.
To join these 2 chunks you can use the join function, then reindex based on your original dataframes columns. Should look something like this:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
out = pd.join(intg, obj).reindex(file.columns, axis="columns")
I have a 1650x40 dataframe that is a matrix of people who worked on projects each day. It looks like this:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
I am trying to sanity check the data by:
listing any columns that do not have an X in them (in this case
'project2' and 'project4')
listing any rows that do not have an X in them (in this case
'sam')
Desired outcome:
Something like df.show_empty(columns) returns ['project2','project4'] and df.show_empty(rows) returns ['sam']
Obviously the this method would need some way to tell it that the first two columns are not expected to be empty and they should be ignored.
My desired outcome above would return lists of column headings (or row indexes) so that I could go back and check my data and application to find out why there's no entry in the relevant cell (I am guessing there's a good chance that more than one row or column are affected). This seems like it should be trivial but I'm really stuck with figuring this out.
Thanks for any help offered!
For me, it is easier to use apply to accomplish this task. The working code is shown below:
import pandas as pd
df = pd.DataFrame([['bob','11/1/19','X','','',''], ['pete','11/1/19','X','','',''],
['wendy','11/1/19','','','X',''], ['sam','11/1/19','','','',''],
['cara','11/1/19','','','X','']],
columns=['person', 'date', 'project1','project2','project3','project4'])
import numpy as np
df = df.replace('', np.NaN)
colmns = df.apply(lambda x: x.count()==0, axis=0)
df[colmns.index[colmns]]
df[df.apply(lambda x: x[2:].count()==0, axis=1)]
df = df.replace('', np.NaN) will replace the '' with NaN, so that we can use count() function.
colmns = df.apply(lambda x: x.count()==0, axis=0): this will find the columns that are all NaN.
df[df.apply(lambda x: x[2:].count()==0, axis=1)]: this will ignore the first two columns.
I am trying to read the column names from dataframe and append it to the geojson properties dynamically it worked by hard coding the column names but i want those not by hard coding
can any one help me how to get those values (not by geojson iterating rows)
import pandas as pd
import geojson
def data2geojson(df):
#s="name=X[0],description=X[1],LAT-x[2]"
features = []
insert_features = lambda X: features.append(
geojson.Feature(geometry=geojson.Point((float(X["LONG"]),float(X["LAT"]))),
properties=dict(name=X[0],description=X[1])))
df.apply(insert_features,axis=1)
#with open('/dbfs/FileStore/tables/geojson11.geojson', 'w', encoding='utf8') as fp:
# geojson.dump(geojson.FeatureCollection(features), fp, sort_keys=True, ensure_ascii=False)
print(features)
df=spark.sql("select * from geojson1")
df=df.toPandas()
data2geojson(df)