Stopping Data Theft via Screen Scraping
One of the ways that MLS data finds itself into the “wild” to be misused is via screen scraping – “scraping” for short – where a miscreant uses software to visit a website, perform searches and visit each web page, and turn the unstructured (or less structured) data designed for consumers to view into structured data, in a spreadsheet or database so that can be re-used or mis-used. “What’s the big deal?” some people ask, “It’s not like we provide the seller’s phone number or anything.” That’s true, but it doesn’t take much programming skill to find an online reverse telephone directory and, with a few lines of code, add that information to the listing address in the miscreant’s database.
Let’s not argue any more about whether anti-scraping is needed – that was settled by legal folks: according to the VOW rules agreed to between NAR and the DOJ, VOWs are supposed to both have anti-scraping capabilities and provide monitoring for scraping. Of course, theoretically, if there was to be parity between VOW and IDX, IDX should also provide anti-scraping. How does one square that with the problematic IDX “indexing” rule? It makes my head hurt and that’s not the mess I want to focus on in this post. I want to write about how one creates a reasonable barrier for scrapers and how an MLS would evaluate whether reasonable steps have been taken.
The first thing a scraping script needs to do is to grab a lot of web pages with listings on them. So, the first thing one does in the way of anti-scraping is to make that more difficult. One must eliminate web pages that look like “www.MyBadIDXSite.com/listingdetail.php?listingID=1234″. What’s wrong with that page? Someone writing a scraper knows that when they see a number in a URL like that they just need to increment the variable to grab lots of listings – 1234, 1235, 1236, etc. – and the bigger numbers are the fresher listings! Once they reach the highest number and freshest listing, each day they only need to increment that number a bit more to grab the new listings, and since not many new listings come on the market each day, this activity will fly under the radar. In conclusion, integer (number) variables like the one illustrated above are in general, bad since they make it very easy for the scraper to grab the web pages they need. Along the same lines, RSS feeds that provide that same list of the latest web pages that need to be scraped are unhelpful in combatting scraping.
How else can we make it harder for the scraper to grab a lot of web pages? Your software can look for patterns of use that indicate that a program, rather than a human, is requesting the pages, and take action if a potential problem is detected. One way to do this is to use “rate limiting” – if there are 40 listing requests in a minute (or more, or less depending on site design), is it really a person? One can test this by presenting the person with a CAPTCHA or some other puzzle that only a person will get right. Even then, a watchful programmatic eye is needed, because if all the scraper has to do is solve one CAPTCHA and their program runs on unwatched, that’s not a very big barrier. But what if the scraper has written their software to do a “slow crawl” and not set off the previous mechanism? Then your software has to be looking for things like the total number of listings viewed by the scrapers internet address or range.
But let’s say that the scraper has managed to get by your earlier precautions and grabbed a whole lot of webpages with listings on them. Their next step is to turn the less-structured data on those web pages into a database full of structured data. There are lots of tricks to make this more difficult. One can replace key data on the page (address, price, etc.) with Flash, Java, or an image of the text. Impossible to get around? Nope, but a pretty good barrier. There are also clever ways to create your web pages so they look the same to your site visitors but the HTML code behind the web pages looks very different to scraping software – making it more difficult for the software to figure out what number is bedroom, bathroom, etc.
Are these mechanisms foolproof? Nope – you don’t have to tell me that each is imperfect on its own – but they present a reasonable barrier, especially if implemented correctly and in combination.
Monitoring – Reports and Alerts
If you have a report that shows you what users, internet addresses, or internet address ranges are viewing the most unique listings in a time period, which have triggered the “CAPTCHA” test the most, and can then drill down to see what searches have been performed and what listings have been viewed, including date/time information, then you have decent monitoring in place. And if you get alerts when really suspicious activity occurs, you’re ahead of the curve.
When I’m auditing VOWs or otherwise testing for anti-scraping capabilities, the only way to test the efficacy of the anti-scraping and monitoring steps is to actually try to scrape the site. This means custom-programming scraping scripts custom to the way the site is constructed. I wish there was an easier way, but there really isn’t that I’ve found. There have been lots of sites that thought they had implemented anti-scraping well – including several of the steps described above – but it turned out they hadn’t implemented those steps WELL, and were still easy to scrape.
One may not have to follow all the steps above to be effective – but if one has taken the right combination of steps, well implemented, it can create a substantial barrier to screen scraping.