There are tonnes of music lyric sites out there. A while back, I was looking at some lyrics for a band I am in to. And it made me think, "How does this site obtain all these lyrics and how can I get my hands on something like this?" Could not find much back then, so I decided to write a program that would basically parsed a site for the bands information and lyrics and placed the data in a database that I created.
But I am still wondering how these sites get their data? My way is not very efficient, very site specific, and if the site changes its' script structure, I have to change my parsing program. There must be a simpler way.
Anyone's thoughts are greatly appreciated!
I'd guess at either JSON or XML files. To 'get your hands on it' - there are various ways and means of downloading data from a web site. wget is one means, not that I condone it but it's hardly a secret
Most of the website get their lyrics from users. Musixmatch for example, they allow user to create their lyric if the lyric do not exist in their database. When then user create a lyric, it would probably be automatically saved into musixmatch's database. There are tons of lyric websites allowing users to upload lyrics.
Another way websites get their data is through data mining, just like you said, writing a parser/scraper to go through someone else's website.
Related
Excuse me in advance, if you have submitted the application or the problem is not divided Please help solve this problem Or my topic is moved to the appropriate section I have an Arabic chat site as a wonder chat Chat is a script
During my search for the most powerful methods of banning annoying visitors within the chat and found these files bearing the name of the browser fingerprint Link to the Fingerprint file to make a distinctive fingerprint of the browser.
https://cdnjs.cloudflare.com/ajax/li...rprint2.min.js
Working file idea:
The basic idea of the file is to make a distinctive imprint for the browser to distinguish members on the site even if the member changed his name and change his IP, and also the file can fetch a lot of information through the browser of the member such as the version of the browser and state and city and the private Internet company used by the member. The only problem we have now is how to use the file to bring the fingerprint of the browser to the member and fetch the basic data from the browser such as the state, city and its Internet company, to store this data on the datapize and use it to protect the site and chat from spam and annoying members.
Thank you for your presence.
Site Link:
https://www.3a-chat.com/chat
Unfortunately the library URL in your question does not work, but I would recommend using this existing solution and extending it a bit. For example you may add:
pixel ratio window.devicePixelRatio || 1
languages (navigator.languages || []).join(',')
math precision ${((Math.exp(10) + 1 / Math.exp(10)) / 2)}${Math.tan(-1e300)}
est I'll preface this by saying this is something that is new to me and is purely a learning exercise, so please excuse any naivety.
I've been looking through some articles on scraping and it seems that NodeJS, ExpressJS, Request and Cheerio would be my preferred method as a Front-End guy who is comfortable with JS/jQuery.
All the articles I've read so far focus on scraping data from a specific website in the absence of an API, whereas what I am looking to achieve to start with is a tool which takes any given URL and returns a true/false for a list of which common libraries are being used and which social networks are linked.
For example, a user enters a URL and the results return a "This website uses jQuery, MooTools, BackboneJS, AngularJS, etc" and "This website is linked with Facebook, Twitter, etc". Somewhat similar to Tregia: http://www.tregia.com/process?q=http://smashingmagazine.com.
Is my chosen setup (above) appropriate or limited to only scraping specific pages due to CSS selectors?
You should be able to scrape all pages and then find their tags and read which tools they're using (although keep in mind they may have renamed them [ex angularjs3.1.0.js - > foobar.js] to keep people from knowing their stack). You should also be able to get the specific text within the rest of the tags that you feel relevant as well.
You should try and pay attention to every page's robots.txt as well.
Edit: You probably won't be able to scrape "members"/"login only" areas of sites though.
Im trying to add a feature to my website that involves the typical "invite your friends" with help from a contact importer (cloudsponge). Its a pretty popular and gets the job done but I need something faster..
The problem with cloudsponge is that they request all contacts in one call, this could mean a long wait time for someone with alot of contacts.
I looked at their rest calls and there doesnt seem to be a way to load contacts in pieces. Do any of these contact importing services allow you to pull in a few contacts at a time (lets say 50) so that we can show our user the first 50 contacts and load the rest / updating the view. So they dont have to wait forever for all the contacts to be pulled?
Ive looked at other apis like context io but cant seem to find a solution to this one.
I built the CloudSponge API.
Early on, we decided to support imports across a variety of providers while exposing a simple and consistent interface. Pagination and rolling or real-time access to contacts were things that were excluded in order to do that. To provide end-user feedback on the progress of the import, we added the /events endpoint.
So far import speed hasn't been a major issue for a couple reasons:
In general, end users with an address book of 10000+ contacts are rare (although this may not be the case for certain niches).
End users who do have this many contacts in their address book usually understand that it will take a while to import.
Having said that, the speed is something that we can definitely improve upon. Here's a few ideas:
We can allow for returning only a subset of all contacts by default. For example, we currently return all contacts for Gmail, which is usually a much larger number of contacts than are actually stored in 'my contacts'.
We can implement parallel paginated imports on the server side. This will make our server process work harder and faster to download the user's contacts from, say, Gmail. This adds complexity on our side but keeps the API untouched.
We can implement your suggestion: add a rolling or real-time access to contacts in our API, either in an extended endpoint or a new version of our interface.
I'm happy to work with you on exploring these to improve our service. Send us an email: support#cloudsponge.com
Graeme
Is it possible to create an app-bound playlist?
It's possible to create a playlist for a user, but how will I know which one that is when they move away from my app?
Ideally, I would only need to be able to create/edit 1 playlist.
Edit: Have found this http://developer.spotify.com/technologies/apps/guidelines/integration/#appsthatcreateplaylisturi:s
But if anyone has great ideas, I'm still open!
As you've found out yourself, you can't create a playlist in a user's library that's somehow linked to your application using the Spotify Apps API.
I thought it'd be a good idea to also quote the relevant part of the Integration Guidelines that you've linked to:
If you want to generate and save the user’s personal playlists in the
app, you should not keep playlist information only saved within the
app. Playlist information should instead be handled by utilizing user
playlists, so that the user can access playlists as usual. They
shouldn’t have to go to the app to access a certain playlist that they
have created.
Suggestion:
I think there's several ways to do what you want to do though.
One way could be to let a user create a new playlist using your application and save it to the user's library, and at the same time save the playlist URI to your own back end. As you've noted, playlist URIs are obfuscated (e.g. they look like spotify:user:#:playlist:783BHaT7Xb8K5VyYstxsj3 instead of spotify:user:thelinmichael:playlist:783BHaT7Xb8K5VyYstxsj3, the username is replaced by # for the currently logged in user, and #xxx.. for other users). You could still save the last part of the URI, which I believe is unique for every playlist. Using a hashmap to map that part of the playlist URI to properties you want to keep track of would let you do quick lookups of a user's playlists to see if they are associated to your app. You could iterate though the user's library to gather all obfuscated URIs, and send them to your backend in a single HTTP request. The response from your server could be the index of the library playlists that matched the playlist on your backend, along with the properties you've mapped to it. Again, this was just a suggestion and possibly not the best way forward but I hope it gave you some ideas. :-)
I am trying to build a service where anybody can send an image file from an email address/client and process it. Think about the service a bit like Flickr showing the image in a dashboard that comes via emails
From a usability standpoint this mechanic offers great deal of advantage but I want to understand the security consequences of such an action.Some concerns are:
I need to validate all these files as images
People can probably send a file with an exploit/code that can likely
be a problem. But in my case I am mostly going to do a file open and
save and let the browser show the image
Am I taking the right approach here? Are there serious consequences that I should be of?
Things you should do and take into consideration.
Make sure your mail server is configured for virus scanning, keep it up to date. That'll be the first line of defense.
When the email comes in, attempt to process the image in a known rock solid library.
Be aware that many emails contain multiple images, some of which may have nothing at all to do with the one they are sending. For example, our company emails all include our logo at the bottom. I'm not exactly sure what the solution is here, but you'll want to take it into consideration.
Different email clients handle image attachments, well, differently. Sometimes it's as a normal attachment, sometimes it's embedded in the body. Even within the same client an image might be handled differently depending on if they sent the email as plaint text with attachments or HTML mail.
People will test your system. They'll send .js files, they'll send images whose headers are jacked in order to overflow your image processing library...
Consider enforcing certain email restrictions such as SPF checks.
Be prepared to receive images that are absolutely huge. Today's cameras take very large photos and a lot of people don't know what crop or resize means. You might consider setting a cap of 15MB or larger per email coming into your server. Then, in combination with #2 above, auto resizing images down to something a bit more acceptable.
Determine the mechanism you actually want to use to notify the user of any issues. Bear in mind that this mechanism is subject to abuse. For example, consider a spam message sent to your machine with reply-to headers going to a victim.
If you are using .net, see this for a possible way to confirm a file is an image: How can I determine if a file is an image file in .NET?
I'm not saying this is 100% secure (can you ever be 100% secure?) but here is something that you can try:
Lets say that you have an alias on your postfix (or whatever mail system) that redirects incoming emails to a php/bash/python script for further processing.
The first thing I would do is use an image manipulation library (say imagemagick) and convert all incoming files to a .png format or whatever, and only proceed further with your logic if the conversion is successful.
This way, if someone sends you any malicious attachments (php exploit, jar's, swf's, anything) the conversion will fail, and hence it will be disregarded by your system.
Edit: ImageMagick has the "identify" command which does exactly what you want.
Emails could be easily spoofed as well, which means I can send an email from an email address which doesn't belong to me.
This might help also: Secure way to upload image in PHP ...