Scrape A Website: ASP - Screen scraping an authenticated site

My company is using a web hosting company to host the company website. I as the webmaster have access to a staging server and a production server. I use a web page to replicate the site from staging to production.

The replication page requires me to Authenticate the first time I access it and then knows me every subsequent time I load the page. When I close the browser I have to authenticate again. When the page loads it gives me the current state of replication (complete, running ...) I am trying to screen scrape the status of the replication.

The reason I'm trying to scrape the status is that I have to run post replication actions on the production site to hide features I'm adding that have not yet been approved for production.

Let me explain. If I get a request for a site change, I make the change and upload the change to the staging server. When the change is approved it moves into production. The problem occurs when requests need to remain active in staging for review and can't go live in production. Normally this holds up replication because replication simply duplicates what is in staging to production. To get around this problem, I added a features table to my DB so I can turn on and off new features. This is great except for the fact that upon completion of replication, I need to turn off the features in the DB.

At present, replication works like this:
1. Load replication page
2. Supply replication credentials
3. Press the button to begin replication
4. Refresh the page until the status reads: complete
5. Run the post replication code to turn off not yet approved features.

What I would like:
1. Load my own page
2. Call the replication page in an iframe which asks me to authenticate
3. Automatically refresh the page in the iframe at a set interval scraping the status
4. If the status is “Complete” do post replication actions else return to previous step and refresh page.

So far I’ve got everything set except for the scraping. I’ve been using fidder and know the following:
1. The replication page uses NTLM authentication method
2. The replication page stores a cookie upon login containing siteserverid=

I’ve researched the issue and come up with a handy scraping script, (http://www.codeproject.com/KB/asp/gethtml.aspx) which does get html source code but not authenticated sites. I’ve yet to figure out how to add my authentication information to the request so I can scrape. Does anyone understand my situation enough to assist me?

Source: http://www.nullskull.com/q/10211086/screen-scraping-an-authenticated-site.aspx

Scrape A Website

Thursday, 13 June 2013

ASP - Screen scraping an authenticated site

No comments:

Post a Comment

About Me

Blog Archive