Use Node JS and browser developer tools as intelligent webscraper for secure websites.

blog_web-scraper

Some of you probably already had the idea to combine (maybe even realtime) data from different websites on one page that holds all the relevant data without the crap you don’t need. There are some techniques to fetch and distribute static webpages, but when it comes to realtime data on a https location you can forget about it! And in fact it is a very good thing that you cannot simply embed an iframe with another website and get data out of it.

But unfortunately there are situations that require magic to make it happen. The point is that you will have to inject the javascript code into the page yourself, so you know the data that is on the page can be stolen if you do stupid things and serve it on the web. The reason I needed to scrape some realtime data from a website was that the data could be viewed by anyone, but the layout was terrible and could not be modified to the viewer’s needs. ‘Blabla blabla… now tell us how!’ Relax, I will, just wanted to explain that I did not ‘hack’ some website and steal its data to make money and that this technique is as safe you make it yourself and cannot be exploited by third parties!

In an ideal world it would be possible to include the realtime updated website in an iframe on out own webpage and extract the data of intrest with javascript and probably jQuery. Now, if we can’t include the data into our page, can we inject data on the page itself? Yes you can, by using the web developer tools (only tested google chrome), but it will not be permanent and requires a manual action on each page refresh. Is that a problem? Yes.

‘Wait a minute, you’re explaining a crappy method… goodby!’ Don’t leave me here, it’s just getting interesting if you think about it… nobody said that you should view the modified data in the same screen as the original data! ‘So you’re telling us to get the data we need in 1 screen, and display it in another? How are you smartass going to make that work? Separate screens can’t communicate!’ Indeed, they can’t communicate directly with each other, but you can always give your package to the mail man, he will deliver it where you want.

In our case the mail man works at Node JS, at the web socket department, and is very fast if you just give him a few letters and not a whole truckload of heavy furnitures! Simply inject some code via the web development tools that collects your data in the original page, implements web sockets and sends it to a Node JS server which caches the data and can serve it realtime to all subscribers that visit the page served by Node JS. When data is drawn to a canvas and you want to scrape that you’ll need more magic to create and serve images, like a movie, but that is a rare case so I’ll leave that explanation to the ones that have implemented it themselves.

Leave a Reply

Your email address will not be published. Required fields are marked *