Python memory related and time consumption related - python-3.x

I have a huge file of around 80mb from which I am generating one other file which orders the data. But as the file size is very huge , the program wrote by me is taking a lot of time (around 1Hr).So how to reduce the time duration?
Below is the content of the file.
Logging at 1/20/2019 12:00:00 AM
Test Manager | 1706
TestStandEngineWrapper | 1403
Logging at 1/20/2019 12:00:01 AM
Test Manager | 1706
TestStandEngineWrapper | 1403
Like this there are thousands of entries which I am trying to order in the below format.
I am arranging them in the below format.
Test Manager | 1706 | Logging at 1/20/2019 12:00:00 AM
Test Manager | 1706 | Logging at 1/20/2019 12:00:01 AM
TestStandEngineWrapper | 1403 | Logging at 1/20/2019 12:00:00 AM
TestStandEngineWrapper | 1403 | Logging at 1/20/2019 12:00:01 AM
import re
file=open("C:\\Users\\puru\\Desktop\\xyz.txt","rt")
file1=open("C:\\Users\\puru\\Desktop\\xyz1.txt","wt")
file1.write("")
arr1=file.readlines()
str1=""
str2=""
arr2=[]
arr3=[]
arr4=[]
#for j in iter(file.readline, ''):
for i,j in enumerate (arr1):
if "Logging" in j:
str1=j
elif "Logging" not in j:
arr3.append(j.split("|")[0])
str2=j.rstrip()+" | "+str1
arr2.append(str2)
str2=""
for i in arr3:
if i not in arr4:
arr4.append(i)
for j in arr4:
for k in arr2:
if re.match(j,k):
file1=open("C:\\Users\\puru\\Desktop\\xyz1.txt","at")
file1.write(k)
file1.close()
file.close()
Though I am getting the desired output ,as it takes a lot of time ,it is not that useful. Could you please suggest something to reduce the time?

Related

How do you use the Django timezone.override(utc) to create datetimes with tzinfo?

I can use timezone.now() to get a UTC timestamp in Django.
I can use timezone.make_aware(datetime(2012,1,1)) to add UTC as the TZ for the datetime.
Last one is cumbersome and a lot of extra typing.
Meanwhile I find these docs: https://docs.djangoproject.com/en/4.1/topics/i18n/timezones/
But activating the TZ doesn't seem to affect datetime creation (still no tzinfo included)
I also found these docs inside python:
>>> from django.utils import timezone
>>> help(timezone)
Help on module django.utils.timezone in django.utils:
NAME
django.utils.timezone - Timezone-related classes and functions.
CLASSES
contextlib.ContextDecorator(builtins.object)
override
class override(contextlib.ContextDecorator)
| override(timezone)
|
| Temporarily set the time zone for the current thread.
|
| This is a context manager that uses django.utils.timezone.activate()
| to set the timezone on entry and restores the previously active timezone
| on exit.
|
| The ``timezone`` argument must be an instance of a ``tzinfo`` subclass, a
| time zone name, or ``None``. If it is ``None``, Django enables the default
| time zone.
|
| Method resolution order:
| override
| contextlib.ContextDecorator
| builtins.object
|
| Methods defined here:
|
| __enter__(self)
|
| __exit__(self, exc_type, exc_value, traceback)
|
| __init__(self, timezone)
| Initialize self. See help(type(self)) for accurate signature.
|
| ----------------------------------------------------------------------
| Methods inherited from contextlib.ContextDecorator:
|
| __call__(self, func)
| Call self as a function.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from contextlib.ContextDecorator:
I'm wanting to do something like:
from datetime import datetime
from django.utils import timezone
>>> with timezone.override(None):
... datetime.now()
...
datetime.datetime(2023, 1, 12, 16, 56, 3, 679319). #WTF is the timezone info?
>>>
Am I stuck with writing a wrapper class to apply the tzinfo or is there a pythonic way to do this?

want to create a timestamp using GPS data with python3

python3 Newby here. I am trying to create a variable that I can use to make a GPS timestamp from an adafruit GPS sensor. I eventually want to store this in a db. The mysql database has a timestamp feature when data is inserted into a table so I want to have that and the UTC time and date that comes from the GPS device be stored as well.
It seems I have something wrong and can not figure it out. The code is hanging on this:
def gpstsRead():
gps_ts = '{}/{}/{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,
)
return gps_ts
I am trying to put all of these into a timestamp like format. The error is this:
Traceback (most recent call last):
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 57, in <module>
gps_ts = gpstsRead()
File "/home/pi/ek9/Sensors/GPS/gps-db-insert.py", line 20, in gpstsRead
gps.timestamp_utc.tm_mon,
AttributeError: 'NoneType' object has no attribute 'tm_mon'
I have made sure I use spaces instead of tabs as that has caused me grief in the past. Beyond that I really don't know. I have been putzing with this for hours to no avail. Any ideas? thanks for any suggestions.
Thanks all for the input. After reading these I decided to try it a little different. Instead of defining a variable with "def" i just decided to eliminate the "def" and just create the variable itself.
Like this:
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
And that seemed to work. For anyone that is doing something similar I will include the complete code. I also understand that like most languages, there is always more than one way to get the job done, some better than others, some not. I am still learning. If anyone cares to point out how I could accomplish the same task by doing something different or more efficient, please feel free to provide me the opportunity to learn. thanks again!
#!/usr/bin/python3
import pymysql
import time
import board
from busio import I2C
import adafruit_gps
i2c = I2C(board.SCL, board.SDA)
gps = adafruit_gps.GPS_GtopI2C(i2c) # Use I2C interface
gps.send_command(b"PMTK314,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0")
gps.send_command(b"PMTK220,1000")
last_print = time.monotonic()
# Open database connection
db = pymysql.connect("localhost", "database", "password", "table")
# prepare a cursor object using cursor() method
cursor = db.cursor()
while True:
gps.update()
current = time.monotonic()
if current - last_print >= 1.0: # update rate
last_print = current
if not gps.has_fix:
# Try again if we don't have a fix yet.
print("Waiting for a satellite fix...")
continue
# define values
gps_ts = ('{}-{}-{} {:02}:{:02}:{:02}'.format(
gps.timestamp_utc.tm_year,
gps.timestamp_utc.tm_mon,
gps.timestamp_utc.tm_mday,
gps.timestamp_utc.tm_hour,
gps.timestamp_utc.tm_min,
gps.timestamp_utc.tm_sec,)
)
gps_lat = '{}'.format(gps.latitude)
gps_long = '{}'.format(gps.longitude)
gps_fix = '{}'.format(gps.fix_quality)
gps_sat = '{}'.format(gps.satellites)
gps_alt = '{}'.format(gps.altitude_m)
gps_speed = '{}'.format(gps.speed_knots)
gps_track = '{}'.format(gps.track_angle_deg)
sql = "INSERT into ek9_gps(gps_timestamp_utc, latitude, \
longitude, fix_quality, number_satellites, gps_altitude, \
gps_speed, gps_track_angle) \
values (%s,%s,%s,%s,%s,%s,%s,%s)"
arg = (gps_ts, gps_lat, gps_long, gps_fix, gps_sat,
gps_alt, gps_speed, gps_track)
try:
# Execute the SQL command
cursor.execute(sql, arg)
# Commit your changes in the database
db.commit()
except:
print('There was an error on input into the database')
# Rollback in case there is any error
db.rollback()
# disconnect from server
cursor.close()
db.close()
And this is what the mariadb shows:
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| id | datetime | gps_timestamp_utc | latitude | longitude | fix_quality | number_satellites | gps_altitude | gps_speed | gps_track_angle |
+----+---------------------+---------------------+----------+-----------+-------------+-------------------+--------------+-----------+-----------------+
| 11 | 2020-12-30 14:14:42 | 2020-12-30 20:14:42 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 10 | 2020-12-30 14:14:41 | 2020-12-30 20:14:41 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 9 | 2020-12-30 14:14:39 | 2020-12-30 20:14:39 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
| 8 | 2020-12-30 14:14:38 | 2020-12-30 20:14:38 | xx.xxxx | -xx.xxxx | 1 | 10 | 232 | 0 | 350 |
Success!!! Thanks again!

How to Convert a Normal Dataframe into MultiIndex'd based on certain condition

After a long While I visited to the SO's pandas section and got a question which is not indeed nicely framed thus thought to put here in an explicit way as similar kind of situation I'm too as well :-)
Below is the data frame construct:
>>> df
measure Pend Job Run Job Time
cls
ABC [inter, batch] [101, 93] [302, 1327] [56, 131]
DEF [inter, batch] [24279, 421] [4935, 5452] [75, 300]
Desired output would be...
I tried working hard but didn't get any solution, thus though to Sketch it here as that's somewhat I would like it be achieved.
----------------------------------------------------------------------------------
| |Pend Job | Run Job | Time |
cls | measure |-----------------------------------------------------------
| |inter | batch| |inter | batch| |inter | batch |
----|-----------------|------|------|-------|------|------|-----|------|----------
ABC |inter, batch |101 |93 | |302 |1327 | |56 |131 |
----|-----------------|-------------|-------|------|------|-----|------|---------|
DEF |inter, batch |24279 |421 | |4935 |5452 | |75 |300 |
----------------------------------------------------------------------------------
Saying that I want my dataFrame into MultiIndex Dataframe where Pend Job , Run Job, and Time to be on the top as above.
Edit:
cls is not in the columns
This is my approach, you can modify it to your need:
s = (df.drop('measure', axis=1) # remove the measure column
.set_index(df['measure'].apply(', '.join),
append=True) # make `measure` second level index
.stack().explode().to_frame() # concatenate all the values
)
# assign `inter` and `batch` label to each new cell
new_lvl = np.array(['inter','batch'])[s.groupby(level=(0,1,2)).cumcount()]
# or
# new_lvl = np.tile(['inter', 'batch'], len(s)//2)
(s.set_index(new_level, append=True)[0]
.unstack(level=(-2,-1)
.reset_index()
)
Output:
cls measure Pend Job
inter batch
0 ABC inter, batch 101 93
1 DEF inter, batch 24279 421

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)

Spark: TreeAgregate at IDF is taking ages

I am using Spark 1.6.1, and I have a DataFrame as follow:
+-----------+---------------------+-------------------------------+
|ID |dateTime |title |
+-----------+---------------------+-------------------------------+
|809907895 |2017-01-21 23:00:01.0| |
|1889481973 |2017-01-21 23:00:06.0|Man charged with murder of ... |
|979847722 |2017-01-21 23:00:09.0|Munster cruise to home Cham... |
|18894819734|2017-01-21 23:00:11.0|Man charged with murder of ... |
|17508023137|2017-01-21 23:00:15.0|Emily Ratajkowski hits the ... |
|10321187627|2017-01-21 23:00:17.0|Gardai urge public to remai... |
|979847722 |2017-01-21 23:00:19.0|Sport |
|19338946129|2017-01-21 23:00:33.0| |
|979847722 |2017-01-21 23:00:35.0|Rassie Erasmus reveals the ... |
|1836742863 |2017-01-21 23:00:49.0|NAMA sold flats which could... |
+-----------+---------------------+-------------------------------+
I am doing the following operation:
val aggDF = df.groupBy($"ID")
.agg(concat_ws(" ", collect_list($"title")) as "titlesText")
Then on aggDF DataFrame, I am fitting a pipeline that extracts TFIDF feature from titlesText column (by applying tokenizer, stopWordRemover, HashingTF then IDF).
When I call the pipline.fit(aggDF) the code reaches a stage treeAggregate at IDF.scala:54 (I can see that on the UI), and then it gets stuck there, without any progress, without any error, I wait very long time without any progress and no helpful information on UI.
Here is an example of what I see in the UI (nothing changes for very long time):
What are the possible reasons of that?
How to track and debug such problems?
Is there any other way to extract the same feature?
Did you specify a maximum number of features in your HashingTF?
Because the amount of data the IDF has to deal with will be proportional to the number of features produced by HashingTF and it will most likely have to spill on disk for very large amounts which wastes time.

Resources