I am trying to read the column names from dataframe and append it to the geojson properties dynamically it worked by hard coding the column names but i want those not by hard coding
can any one help me how to get those values (not by geojson iterating rows)
import pandas as pd
import geojson
def data2geojson(df):
#s="name=X[0],description=X[1],LAT-x[2]"
features = []
insert_features = lambda X: features.append(
geojson.Feature(geometry=geojson.Point((float(X["LONG"]),float(X["LAT"]))),
properties=dict(name=X[0],description=X[1])))
df.apply(insert_features,axis=1)
#with open('/dbfs/FileStore/tables/geojson11.geojson', 'w', encoding='utf8') as fp:
# geojson.dump(geojson.FeatureCollection(features), fp, sort_keys=True, ensure_ascii=False)
print(features)
df=spark.sql("select * from geojson1")
df=df.toPandas()
data2geojson(df)
Related
I need your help to sort a table based on a specific column in ascending and descending order.
For example I have this table saved in a file called "test.txt"
which contains the following:
Name,BirthYear,Job
Ahmed,1990,Engineer
Salim,1997,Teacher
Nasser,1993,Accountant
I converted it to a csv table with prettytable:
from prettytable import from_csv
import csv, operator
with open('test.txt', 'r') as open_file:
table_file = from_csv(open_file)
print(table_file)
the picture is attached as 1.png
Then I sorted it based on BirthYear column:
print(table_file.get_string(sortby='BirthYear', sort_key=lambda row: row[0]))
the picture is attached as 2.png
How to sort the same but in descending order? (based on any column, not only BirthYear column).
If you have different ways, it would be greate to know learn them.
Thank you in advance
use sorted() if you want to sort data in descending order, just add reverse=True in sorted function but it allows you to sort only one column at a time.
from prettytable import from_csv
import csv, operator
with open('test.txt', 'r') as open_file:
table_file = from_csv(open_file)
print(table_file)
srt = sorted(data, key=operator.itemgetter(2), reverse=True)
print(st)
or you can use pandas
from prettytable import from_csv
import csv, operator
import pandas as pd
with open('test.txt', 'r') as open_file:
table_file = from_csv(open_file)
print(table_file)
srt = table_file.sort_values(["Age"], axis=0, ascending=[False], inplace=True)
print(st)
I hope that gonna helps you .
I'm trying to separate columns by slicing them because I need to assign dtypes for each one. So I grouped them by dtypes and assign their respective dtype and now I want to join or concat and that has the same column order as the main dataframe. I add that is not possible to do it by its column name because it may change.
Example:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
After doing this I need to join them with the same order as the main file, that is from 0 to 11.
To join these 2 chunks you can use the join function, then reindex based on your original dataframes columns. Should look something like this:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
out = pd.join(intg, obj).reindex(file.columns, axis="columns")
I have 2 CSV files. One of them has the sorted data and another unsorted. Example data is as shown below.
I am trying to do is to take the unsorted data and sort it according to index numbers from the sorted data. Ex: in the sorted data, I have index number "1" corresponds to "name001.a.a". So, since it iss index number = "1", In the unsorted file, I want "name 001.a.a,0001" to be the first in the list. The number after the comma in unsorted file is 4 digit number which does not play a role in sorting but is attached to the names.
One more sample would be: index "2" is for "name002.a.a", so after sorting, new file would have "name002.a.a,0002" as a second item in the list
unsorted.csv:
name002.a.a,0002
name001.a.a,0001
name005.a.a,0025
hostnum.csv (sorted):
"1 name001.a.a"
"2 name002.a.a"
"3 name005.a.a"
I need help to figure out where I have coded wrong and if possible, need help with completing it.
EDIT- CODE:
After changing the name csv_list to csv_file, I am receiving the following error
from matplotlib import pyplot as plt
import numpy as np
import csv
csv_file = []
with open('hostnum.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
csv_file.append(line)
us_csv_file = []
with open('unsorted.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
us_csv_file.append(line)
us_csv_file.sort(key=lambda x: csv_file.index(x[1]))
plt.plot([int(item[1]) for item in us_csv_file], 'o-')
plt.xticks(np.arange(len(csvfile)), [item[0] for item in csvfile])
plt.show()
ERROR:
Traceback (most recent call last):
File "C:/..../TEST_ALL.py", line 16, in <module>
us_csv_file.sort(key=lambda x: csv_file.index(x[1]))
File "C:/..../TEST_ALL.py", line 16, in <lambda>
us_csv_file.sort(key=lambda x: csv_file.index(x[1]))
ValueError: '0002' is not in list
Well, you haven't defined csv_list in your code. Looking quickly through your code, I'd guess changing us_csv_file.sort(key=lambda x: csv_list.index(x[1])) to us_csv_file.sort(key=lambda x: csv_file.index(x[1])) (i.e. using the correct variable name, which is csv_file and not csv_list), might just solve the problem.
Here's a new attempt. This one tries to extract the numbers from the second column from hostnum.csv and puts them onto a separate list, which it then uses to sort the items. When I run this code, I get ValueError: '025' is not in list but I assume that's because you haven't given us the entire files and there is indeed no such line that would contain name025.a.a in the snippet of hostnum.csv you gave us, I also added a [1:] to the sorting statement.
If this doesn't work, try removing that [1:] and changing csv_file_numbers.append(csv_file[-1][1][4:].split('.')[0]) to csv_file_numbers.append(csv_file[-1][1][4:].split('.')[0].zfill(4)). string.zfill(4) will add zeros to the beginning of a string so long that its length is at least 4.
Because your sorted file contains one more zero than the unsorted file, I also changed
from matplotlib import pyplot as plt
import numpy as np
import csv
csv_file = []
csv_file_numbers = []
##with open('hostnum.csv', 'r') as f:
## csvreader = csv.reader(f, dialect="excel-tab")
## for line in csvreader:
## csv_file.append(line)
## csv_file_numbers.append(line[-1][4:].split('.')[0])
with open('hostnum.csv', 'r') as f:
sorted_raw = f.read()
for line in sorted_raw.splitlines():
csv_file.append(line.split('\t'))
csv_file_numbers.append(csv_file[-1][1][4:].split('.')[0])
us_csv_file = []
with open('unsorted.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
us_csv_file.append(line)
us_csv_file.sort(key=lambda x: csv_file_numbers.index(x[1][1:]))
plt.plot([int(item[1]) for item in us_csv_file], 'o-')
plt.xticks(np.arange(len(csvfile)), [item[0] for item in csvfile])
plt.show()
This one worked on my computer:
from matplotlib import pyplot as plt
import numpy as np
import csv
csv_file = []
csv_file_dict = {}
##with open('hostnum.csv', 'r') as f:
## csvreader = csv.reader(f, dialect="excel-tab")
## for line in csvreader:
## csv_file.append(line)
## csv_file_numbers.append(line[-1][4:].split('.')[0])
with open('hostnum.csv', 'r') as f:
sorted_raw = f.read()
for line in sorted_raw.splitlines():
csv_file.append(line.split('\t'))
csv_file_dict[csv_file[-1][-1][:-1]] = int(csv_file[-1][0][1:])
us_csv_file = []
with open('unsorted.csv', 'r') as f:
csvreader = csv.reader(f)
for line in csvreader:
us_csv_file.append(line)
us_csv_file.sort(key=lambda x: csv_file_dict[x[0]])
plt.plot([int(item[1]) for item in us_csv_file], 'o-')
plt.xticks(np.arange(len(csv_file)), [item[0] for item in csv_file])
plt.show()
So now I created a dict which stores the index values as values and the names of each cell that is found in both files as keys. I also removed the quotations manually, as for some reason, csv.reader didn't seem to do it correctly, at least it didn't handle the tabs in the desired way. As I wrote in one of my comments, I don't know why for sure, I'd guess it's because the quotations are not closed within the cells in the file. Anyway, I decided to split each line manually with string.split('\t').
Also, you had missed the underscore in the variable name csv_file from a couple of places at the end, so I added them.
I need to store the timestamps in a list for further operations and have written the following code:
import csv
from datetime import datetime
from collections import defaultdict
t = []
columns = defaultdict(list)
fmt = '%Y-%m-%d %H:%M:%S.%f'
with open('log.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
#t = row[1]
for i in range(len(row)):
columns[i].append(row[i])
if (row):
t=list(datetime.strptime(row[0],fmt))
columns = dict(columns)
print (columns)
for i in range(len(row)-1):
print (t)
But I am getting the error :
Traceback (most recent call last):
File "parking.py", line 17, in <module>
t = list(datetime.strptime(row[0],fmt))
TypeError: 'datetime.datetime' object is not iterable
What can I do to store each timestamp in the column in a list?
Edit 1:
Here is the sample log file
2011-06-26 21:27:41.867801,KA03JI908,Bike,Entry
2011-06-26 21:27:42.863209,KA02JK1029,Car,Exit
2011-06-26 21:28:43.165316,KA05K987,Bike,Entry
If you have a csv file than why not use pandas to get what you want.The code for your problem may be something like this.
import Pandas as pd
df=pd.read_csv('log.csv')
timestamp=df[0]
if the first column of csv is of Timestamp than you have an array with having all the entries in the first column in the list known as timestamp.
After this you can convert all the entries of this list into timestamp objects using datetime.datetime.strptime().
Hope this is helpful.
I can't comment for clarifications yet.
Would this code get you the timestamps in a list? If yes, give me a few lines of data from the csv file.
from datetime import datetime
timestamps = []
with open(csv_path, 'r') as readf_obj:
for line in readf_obj:
timestamps.append(line.split(',')[0])
fmt = '%Y-%m-%d %H:%M:%S.%f'
datetimes_timestamps = [datetime.strptime(timestamp_, fmt) for timestamp_ in timestamps]
I have another problem with csv. I am using pandas to remove duplicates from a csv file. After doing so I noticed that all data has been put in one column (preprocessed data has been contained in 9 columns). How to avoid that?
Here is the data sample:
39,43,197,311,112,88,47,36,Label_1
Here is the function:
import pandas as pd
def clear_duplicates():
df = pd.read_csv("own_test.csv", sep="\n")
df.drop_duplicates(subset=None, inplace=True)
df.to_csv("own_test.csv", index=False)
Remove sep, because default separator is , in read_csv:
def clear_duplicates():
df = pd.read_csv("own_test.csv")
df.drop_duplicates(inplace=True)
df.to_csv("own_test.csv", index=False)
Maybe not so nice, but works too:
pd.read_csv("own_test.csv").drop_duplicates().to_csv("own_test.csv", index=False)