How use dir() function to see inside scrapy module - python-3.x

From the documentation:
Without arguments, return the list of names in the current local scope. With an argument, attempt to return a list of valid attributes for that object.
So i try see inside the scrapy module
import scrapy is a module right, or im wrong?
>>>dir(scrapy)
NameError: name 'scrapy' is not defined
Im complete newb in python and just try understand how works.
How can i see inside modules like documentation examples
>>> dir(sys)
['__displayhook__', '__doc__', '__excepthook__', '__loader__', '__name__',
'__package__', '__stderr__', '__stdin__', '__stdout__',
'_clear_type_cache', '_current_frames', '_debugmallocstats', '_getframe',
'_home', '_mercurial', '_xoptions', 'abiflags', 'api_version', 'argv',
'base_exec_prefix', 'base_prefix', 'builtin_module_names', 'byteorder',
'call_tracing', 'callstats', 'copyright', 'displayhook',
'dont_write_bytecode', 'exc_info', 'excepthook', 'exec_prefix',
'executable', 'exit', 'flags', 'float_info', 'float_repr_style',
'getcheckinterval', 'getdefaultencoding', 'getdlopenflags',
'getfilesystemencoding', 'getobjects', 'getprofile', 'getrecursionlimit',
'getrefcount', 'getsizeof', 'getswitchinterval', 'gettotalrefcount',
'gettrace', 'hash_info', 'hexversion', 'implementation', 'int_info',
'intern', 'maxsize', 'maxunicode', 'meta_path', 'modules', 'path',
'path_hooks', 'path_importer_cache', 'platform', 'prefix', 'ps1',
'setcheckinterval', 'setdlopenflags', 'setprofile', 'setrecursionlimit',
'setswitchinterval', 'settrace', 'stderr', 'stdin', 'stdout',
'thread_info', 'version', 'version_info', 'warnoptions']

Try this from your python interpreter:
In [1]: import scrapy
In [2]: dir(scrapy)
Out[2]:
['Field',
'FormRequest',
'Item',
'Request',
'Selector',
'Spider',
'__all__',
'__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'__version__',
'_txv',
'exceptions',
'http',
'item',
'link',
'selector',
'signals',
'spiders',
'twisted_version',
'utils',
'version_info']
This worked for me in both Python 2 and 3. I have also confirmed that it works in both iPython and the standard interpreter. If it does not work for you even with the import, your environment may have gotten messed up in some way, and we can troubleshoot further.
import scrapy is a module right, or im wrong?
In this case scrapy is a module, and import scrapy is the syntax for making that module available in whatever context you are invoking the import from. This section of the Python tutorial has information on modules and importing them.

Related

split username & password from URL in 3.8+ (splituser is deprecated, no alternative)

trying to filter out the user-password from a URL.
(I could've split it manually by the last '#' sign, but I'd rather use a parser)
Python gives a deprecation warning but urlparse() doesn't handle user/password.
Should I just trust the last-#-sign, or is there a new version of split-user?
Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux
>>> url="http://usr:pswd#www.site.com/path&var=val"
>>> import urllib.parse
>>> urllib.parse.splituser(url)
<stdin>:1: DeprecationWarning: urllib.parse.splituser() is deprecated as of 3.8, use urllib.parse.urlparse() instead
('http://usr:pswd', 'www.site.com/path&var=val')
>>> urllib.parse.urlparse(url)
ParseResult(scheme='http', netloc='usr:pswd#www.site.com', path='/path&var=val', params='', query='', fragment='')
#neigher with allow_fragments:
>>> urllib.parse.urlparse(url,allow_fragments=True)
ParseResult(scheme='http', netloc='us:passw#ktovet.com', path='/all', params='', query='var=val', fragment='')
(Edit: the repr() output is partial & misleading; see my answer.)
It's all there, clear and accessible.
What went wrong: The repr() here is misleading, showing only few properties / values (why? it's another question).
The result is available with explicit property get:
>>> url = 'http://usr:pswd#www.sharat.uk:8082/nativ/page?vari=valu'
>>> p = urllib.parse.urlparse(url)
>>> p.port
8082
>>> p.hostname
'www.sharat.uk'
>>> p.password
'pswd'
>>> p.username
'usr'
>>> p.path
'/nativ/page'
>>> p.query
'vari=valu'
>>> p.scheme
'http'
Or as a one-liner (I just needed the domain):
>>> urllib.parse.urlparse('http://usr:pswd#www.sharat.uk:8082/nativ/page?vari=valu').hostname
www.shahart.uk
Looking at the source code for splituser, looks like they simply use str.rpartition:
def splituser(host):
warnings.warn("urllib.parse.splituser() is deprecated as of 3.8, "
"use urllib.parse.urlparse() instead",
DeprecationWarning, stacklevel=2)
return _splituser(host)
def _splituser(host):
"""splituser('user[:passwd]#host[:port]') --> 'user[:passwd]', 'host[:port]'."""
user, delim, host = host.rpartition('#')
return (user if delim else None), host
which yes, relies on the last occurrence of #.
EDIT: urlparse still has all these fields, see Berry's answer

Scraping a dynamic table using Selenium in Python3

I am trying to scrape the symbols from this page, https://www.barchart.com/stocks/indices/sp/sp400?page=all
When I look at the source in the Firefox browser (using Ctrl-U), none of the symbols turns up. Thinking maybe Selenium might be able to obtain the dynamic table, I ran the following code.
sp400_url= "https://www.barchart.com/stocks/indices/sp/sp400?page=all"
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(sp400_url)
html = driver.page_source
soup = BeautifulSoup(html)
print(soup)
The print command doesn't show any of the symbols we see on the page. Is there a way to scrape the symbols from this page?
Edited to clarify: I am interested in just the symbols and not the prices. So the list should read: AAN, AAXN, ACC, ACHC, ...
You can easily feed this into pandas' .read_html() to get the table and turn the symbols column into a list. Note: I used chromedriver instead of firefox
import pandas as pd
from selenium import webdriver
sp400_url= "https://www.barchart.com/stocks/indices/sp/sp400?page=all"
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(sp400_url)
html = driver.page_source
df = pd.read_html(html)[-1]
driver.close()
symbolsList = list(df['Symbol'])
Output:
print(symbolsList)
['AAN', 'AAXN', 'ACC', 'ACHC', 'ACIW', 'ACM', 'ADNT', 'ADS', 'AEO', 'AFG', 'AGCO', 'ALE', 'AM', 'AMCX', 'AMED', 'AMG', 'AN', 'ARW', 'ARWR', 'ASB', 'ASGN', 'ASH', 'ATGE', 'ATI', 'ATR', 'AVNS', 'AVNT', 'AVT', 'AYI', 'BC', 'BCO', 'BDC', 'BHF', 'BJ', 'BKH', 'BLD', 'BLKB', 'BOH', 'BRO', 'BRX', 'BXS', 'BYD', 'CABO', 'CACI', 'CAR', 'CASY', 'CATY', 'CBRL', 'CBSH', 'CBT', 'CC', 'CCMP', 'CDAY', 'CDK', 'CFR', 'CFX', 'CGNX', 'CHDN', 'CHE', 'CHH', 'CHX', 'CIEN', 'CIT', 'CLGX', 'CLH', 'CLI', 'CMC', 'CMD', 'CMP', 'CNK', 'CNO', 'CNX', 'COHR', 'COLM', 'CONE', 'COR', 'CPT', 'CR', 'CREE', 'CRI', 'CRL', 'CRS', 'CRUS', 'CSL', 'CTLT', 'CUZ', 'CVLT', 'CW', 'CXW', 'CZR', 'DAN', 'DAR', 'DCI', 'DECK', 'DEI', 'DKS', 'DLPH', 'DLX', 'DNKN', 'DOC', 'DY', 'EBS', 'EGP', 'EHC', 'EME', 'ENPH', 'ENR', 'ENS', 'EPC', 'EPR', 'EQT', 'ESNT', 'ETRN', 'ETSY', 'EV', 'EVR', 'EWBC', 'EXEL', 'EXP', 'FAF', 'FCFS', 'FCN', 'FDS', 'FFIN', 'FHI', 'FHN', 'FICO', 'FIVE', 'FL', 'FLO', 'FLR', 'FNB', 'FR', 'FSLR', 'FULT', 'GATX', 'GBCI', 'GEF', 'GEO', 'GGG', 'GHC', 'GMED', 'GNRC', 'GNTX', 'GNW', 'GO', 'GRUB', 'GT', 'HAE', 'HAIN', 'HCSG', 'HE', 'HELE', 'HIW', 'HNI', 'HOG', 'HOMB', 'HPP', 'HQY', 'HR', 'HRC', 'HUBB', 'HWC', 'HXL', 'IART', 'IBKR', 'IBOC', 'ICUI', 'IDA', 'IDCC', 'IIVI', 'INGR', 'INT', 'ITT', 'JACK', 'JBGS', 'JBL', 'JBLU', 'JCOM', 'JEF', 'JHG', 'JLL', 'JW.A', 'JWN', 'KAR', 'KBH', 'KBR', 'KEX', 'KMPR', 'KMT', 'KNX', 'KRC', 'LAMR', 'LANC', 'LEA', 'LECO', 'LFUS', 'LGND', 'LHCG', 'LII', 'LITE', 'LIVN', 'LOGM', 'LOPE', 'LPX', 'LSI', 'LSTR', 'MAC', 'MAN', 'MANH', 'MASI', 'MAT', 'MCY', 'MD', 'MDU', 'MIDD', 'MKSI', 'MLHR', 'MMS', 'MOH', 'MPW', 'MPWR', 'MRCY', 'MSA', 'MSM', 'MTX', 'MTZ', 'MUR', 'MUSA', 'NATI', 'NAVI', 'NCR', 'NDSN', 'NEU', 'NFG', 'NGVT', 'NJR', 'NKTR', 'NNN', 'NSP', 'NTCT', 'NUS', 'NUVA', 'NVT', 'NWE', 'NYCB', 'NYT', 'OC', 'OFC', 'OGE', 'OGS', 'OHI', 'OI', 'OLED', 'OLLI', 'OLN', 'ORI', 'OSK', 'OZK', 'PACW', 'PB', 'PBF', 'PBH', 'PCH', 'PCTY', 'PDCO', 'PEB', 'PEN', 'PENN', 'PII', 'PK', 'PNFP', 'PNM', 'POOL', 'POST', 'PPC', 'PRAH', 'PRI', 'PRSP', 'PSB', 'PTC', 'PZZA', 'QDEL', 'QLYS', 'R', 'RAMP', 'RBC', 'RGA', 'RGEN', 'RGLD', 'RH', 'RIG', 'RLI', 'RNR', 'RPM', 'RS', 'RYN', 'SABR', 'SAFM', 'SAIC', 'SAM', 'SBH', 'SBNY', 'SBRA', 'SCI', 'SEDG', 'SEIC', 'SF', 'SFM', 'SGMS', 'SIGI', 'SIX', 'SKX', 'SLAB', 'SLGN', 'SLM', 'SMG', 'SMTC', 'SNV', 'SNX', 'SON', 'SR', 'SRC', 'SRCL', 'STL', 'STLD', 'STOR', 'STRA', 'SVC', 'SWX', 'SXT', 'SYNA', 'SYNH', 'TCBI', 'TCF', 'TCO', 'TDC', 'TDS', 'TECH', 'TER', 'TEX', 'TGNA', 'THC', 'THG', 'THO', 'THS', 'TKR', 'TMHC', 'TOL', 'TPH', 'TPX', 'TR', 'TREE', 'TREX', 'TRIP', 'TRMB', 'TRMK', 'TRN', 'TTC', 'TTEK', 'TXRH', 'UBSI', 'UE', 'UFS', 'UGI', 'UMBF', 'UMPQ', 'UNVR', 'URBN', 'UTHR', 'VAC', 'VC', 'VLY', 'VMI', 'VSAT', 'VSH', 'VVV', 'WAFD', 'WBS', 'WEN', 'WERN', 'WEX', 'WH', 'WOR', 'WPX', 'WRI', 'WSM', 'WSO', 'WTFC', 'WTRG', 'WW', 'WWD', 'WWE', 'WYND', 'X', 'XEC', 'XPO', 'Y', 'YELP', 'Symbol']
If elements are not present in page source try to implement ExplicitWait:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get(sp400_url)
wait = WebDriverWait(driver, 10)
symbols = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//td[contains(#class, "symbol")]//a[starts-with(#href, "/stocks/quotes/")]')))
for symbol in symbols:
print(symbol.text)
I am not sure why you want to scrape compete page. if you need just Symbols. You can simply get list of all such elements and then put in a list.
driver = webdriver.Firefox(executable_path=r'..\drivers\geckodriver.exe')
driver.get("https://www.barchart.com/stocks/indices/sp/sp400?page=all")
# Waiting for table to laod
WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[contains(text(),'S&P 400 Components')]")))
symbols = driver.find_elements_by_xpath("//div[#class='bc-table-scrollable-inner']//a[#data-ng-bind='cell']")
symbolList = []
for symbol in symbols:
symbolList.append(symbol.text)
print(len(symbolList)) #Length of list
print(symbolList) #Content of list
Out Put:

Get the Text of AutoCAD AcDbText object with python

What is the best way, in python, to get the text of an AcDbText object?
I am working in python, win32com, and autoCAD. I would like to be able to do the following, via a python program:
Place objects into a selection set
Determine which are AcDbText objects
From those, extract the text and then delete.
I can do the first two things just fine. However, assuming textObj is the correct type of object, the following achieves half the remaining-- t will contain the text desired as a str:
t = textObj.copy().fieldcode()
Problem 1: As the code implies, this creates a copy of the object, right there in the drawing, and does not seem to provide a way to identify it later for deletion.
Problem 2: The original object resists deletion from the selection set. If selection is the selection set, then no variation of selection.clear(), selection.delete(), or selection.erase() does anything. (I have checked the length of selection set before and after the fieldcode() invocation-- the number of objects remains the same.)
I am puzzled that there does not seem to be a way to prize the text out of the object without copying it. What am I missing?
Per question in comments to an answer, the output of pprint(dir(textObj)) is:
['AddRef',
'Application',
'ArrayPolar',
'ArrayRectangular',
'Copy',
'Database',
'Delete',
'Document',
'EntityName',
'EntityTransparency',
'EntityType',
'Erase',
'GetBoundingBox',
'GetExtensionDictionary',
'GetIDsOfNames',
'GetTypeInfo',
'GetTypeInfoCount',
'GetXData',
'Handle',
'HasExtensionDictionary',
'Highlight',
'Hyperlinks',
'IntersectWith',
'Invoke',
'Layer',
'Linetype',
'LinetypeScale',
'Lineweight',
'Material',
'Mirror',
'Mirror3D',
'Move',
'ObjectID',
'ObjectName',
'OwnerID',
'PlotStyleName',
'QueryInterface',
'Release',
'Rotate',
'Rotate3D',
'ScaleEntity',
'SetXData',
'TransformBy',
'TrueColor',
'Update',
'Visible',
'_AddRef',
'_GetIDsOfNames',
'_GetTypeInfo',
'_IAcadEntity__com_ArrayPolar',
'_IAcadEntity__com_ArrayRectangular',
'_IAcadEntity__com_Copy',
'_IAcadEntity__com_GetBoundingBox',
'_IAcadEntity__com_Highlight',
'_IAcadEntity__com_IntersectWith',
'_IAcadEntity__com_Mirror',
'_IAcadEntity__com_Mirror3D',
'_IAcadEntity__com_Move',
'_IAcadEntity__com_Rotate',
'_IAcadEntity__com_Rotate3D',
'_IAcadEntity__com_ScaleEntity',
'_IAcadEntity__com_TransformBy',
'_IAcadEntity__com_Update',
'_IAcadEntity__com__get_EntityName',
'_IAcadEntity__com__get_EntityTransparency',
'_IAcadEntity__com__get_EntityType',
'_IAcadEntity__com__get_Hyperlinks',
'_IAcadEntity__com__get_Layer',
'_IAcadEntity__com__get_Linetype',
'_IAcadEntity__com__get_LinetypeScale',
'_IAcadEntity__com__get_Lineweight',
'_IAcadEntity__com__get_Material',
'_IAcadEntity__com__get_PlotStyleName',
'_IAcadEntity__com__get_TrueColor',
'_IAcadEntity__com__get_Visible',
'_IAcadEntity__com__get_color',
'_IAcadEntity__com__set_EntityTransparency',
'_IAcadEntity__com__set_Layer',
'_IAcadEntity__com__set_Linetype',
'_IAcadEntity__com__set_LinetypeScale',
'_IAcadEntity__com__set_Lineweight',
'_IAcadEntity__com__set_Material',
'_IAcadEntity__com__set_PlotStyleName',
'_IAcadEntity__com__set_TrueColor',
'_IAcadEntity__com__set_Visible',
'_IAcadEntity__com__set_color',
'_IAcadObject__com_Delete',
'_IAcadObject__com_Erase',
'_IAcadObject__com_GetExtensionDictionary',
'_IAcadObject__com_GetXData',
'_IAcadObject__com_SetXData',
'_IAcadObject__com__get_Application',
'_IAcadObject__com__get_Database',
'_IAcadObject__com__get_Document',
'_IAcadObject__com__get_Handle',
'_IAcadObject__com__get_HasExtensionDictionary',
'_IAcadObject__com__get_ObjectID',
'_IAcadObject__com__get_ObjectName',
'_IAcadObject__com__get_OwnerID',
'_IDispatch__com_GetIDsOfNames',
'_IDispatch__com_GetTypeInfo',
'_IDispatch__com_GetTypeInfoCount',
'_IDispatch__com_Invoke',
'_IUnknown__com_AddRef',
'_IUnknown__com_QueryInterface',
'_IUnknown__com_Release',
'_Invoke',
'_QueryInterface',
'_Release',
'__bool__',
'__class__',
'__cmp__',
'__com_interface__',
'__ctypes_from_outparam__',
'__del__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattr__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__map_case__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__setstate__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_b_base_',
'_b_needsfree_',
'_case_insensitive_',
'_compointer_base__get_value',
'_idlflags_',
'_iid_',
'_invoke',
'_methods_',
'_needs_com_addref_',
'_objects',
'_type_',
'color',
'from_param',
'value']
Assuming textObj is either a single-line text object (AcDbText) or multiline text object (AcDbMText), then you should be able to obtain the text content using the TextString property, e.g.:
t = textObj.TextString
Note that the methods clear() & delete() when invoked on an ActiveX SelectionSet object do not delete the objects it contains, but rather remove the objects from the SelectionSet and delete the SelectionSet object respectively. Though, the erase() method should successfully erase all objects contained in the SelectionSet.
Though, to delete an object, you would typically just invoke the delete() method on the object itself, e.g.:
textObj.Delete()

A Question with using scapy.sniff for get the 'Ethernet Frame' in pcap files

Aim: Get the arrival time from the pcap files
Language: python3.7
Tools: Scapy.sniff
Above all ,i want get the arrival time data,in the .pcap ,when i use wireshark ,i saw the data in the Ethernet Frame,but when i use
#Scapy.sniff(offline='.pcap') ,i just get the Ether,TCP,IP and others ,so how can i get that data?
Thanx alot!
>>from scapy.all import *
>>a = sniff(offline = '***.pcap')
>>a[0]
[out]:
<Ether dst=*:*:*:*:*:* src=*:*:*:*:*:* type=** |<IP version=4 ihl=5 tos=0x20 len=52 id=14144 flags=DF frag=0 ttl=109 proto=tcp chksum=0x5e3b src=*.*.*.* dst=*.*.*.* |<TCP sport=gcsp dport=http seq=1619409885 ack=1905830025 dataofs=8 reserved=0 flags=A window=65535 chksum=0xfdb5 urgptr=0 options=[('NOP', None), ('NOP', None), ('SAck', (1905831477, 1905831485))] |>>>
[ ]:
The packet time from the pcap is available in the time member:
print(a[0].time)
It's kept as a floating point value (the standard python "timestamp" format). To get it in a form more easily understandable, you may want to use the datetime module:
>>> from datetime import datetime
>>> dt = datetime.fromtimestamp(a[0].time)
>>> print(dt)
2018-11-12 03:03:00.259780
The scapy documentation isn't great. It can be very instructive to use the interactive help facility. For example, in the interpreter:
$ python
>>> from scapy.all import *
>>> a = sniff(offline='mypcap.pcap')
>>> help(a[0])
This will show you all the methods and attributes of the object represented by a[0]. In your case, that is an instance of class Ether(scapy.packet.Packet).

Developing module and using it in Spyder

I'm trying to develop a python module, which I then want to use in Spyder.
Here is how my files are organized in my module :
testing_the_module.py
myModule
-> __init__.py
-> sql_querying.py #contains a function called sql()
testing_the_module.py contains :
import myModule
print(myModule.sql_querying.sql(query = "show tables")) # how this function works is not relevant
__init__.py contains
import myModule.sql_querying
When I use the command line, it works :
> python3 .\testing_the_module.py
[{
'query': 'show tables',
'result': ['table1', 'table2']
}]
It also works if I use the python console :
> python3
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import myModule
>>> print(myModule.sql_querying.sql(query = "show tables"))
[{
'query': 'show tables',
'result': ['table1', 'table2']
}]
However, when using Spyder, I can't get it to work. Here is what I get when I run (with F9) each of those lines :
import myModule
# no error message
print(myModule.sql_querying.sql(query = "show tables"))
AttributeError: module 'myModule' has no attribute 'sql_querying'
Any idea of why and how to make it work in Spyder ?
Edit to answer comment :
In [665]: sys.path
Out[665]:
['',
'C:\\ProgramData\\Anaconda3\\python36.zip',
'C:\\ProgramData\\Anaconda3\\DLLs',
'C:\\ProgramData\\Anaconda3\\lib',
'C:\\ProgramData\\Anaconda3',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\Sphinx-1.5.6-py3.6.egg',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\win32',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\win32\\lib',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\Pythonwin',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\setuptools-27.2.0-py3.6.egg',
'C:\\ProgramData\\Anaconda3\\lib\\site-packages\\IPython\\extensions',
'C:\\Users\\fmalaussena\\.ipython']

Resources