Blocking spam bots
4 Sep 2011, 8:08AM
So what we're talking about here are automated bots, scanning websites for forms and then posting crap on them, e.g. advertising, links to other sites (to up their Google ranking) or just plain gibberish. Its quite a common problem and will hit most medium/large sites at some point.
I had previously thought that the solution to this was simple, add a CAPTCHA. CAPTCHA's are those little text images where you have to type in what you see and often the text is highly distorted so that it can't be read by OCR (optical character recognition).
So some spam appeared on our website and I said add a CAPTCHA and my workmate said he didn't want to inconvience people filling in the form and that we could add some heuristics to detect/filter spam input. At first I thought any rules you put in could be broken and it would just result in continuously updating your rules but it resulted in a good conversation, the summary of which I'll put here.
So secondly you can add some common sense rules (heuristics) to the parsing of the form. E.g. if the phone number field is an American number format, its a bot, if the first or last name fields contain URLs its a bot, if the description field has one word and a link its a bot. Not bad, it will filter out many bot attempts, but not all.
So we're building up some rules here that could be used to give a weighting to whether the form is filled out by a bot or not. The form could then be rejected if it passes a certain weighting and logged or emailed to a rejected box for manual inspection (if required).
So my workmate was telling me how bots fill out forms, e.g. for a selection of tickboxes (or radio buttons), choose one at random. And I thought, why not have a list of 10 (or 100) tickboxes, choose one at random, hide all the rest with CSS (e.g. position outside the boundaries of a clipping div (overflow=hidden) and have that one required to be ticked by the user. The user will only see the one visible one and not the rest while the bot (who won't render the form and will ignore CSS) will see the whole lot. You can then have a hidden field with the correct item stored in it (encode it somehow if you're thorough). When the user submits the form, compare the returned value with that in your hidden field, if it matches then you've got a human.
First the checkbox can be something like (I agree to the terms and conditions YES/NO), most sign up forms have something like this anyway. Second if the bot is randomly ticking boxes, the chances of it only ticking the correct one are quite low (1 in 2^10 maybe?).
Ok, so after that conversation I had a look at what other people are doing. Heres a couple of very good links which contain better ideas than mine above...
But in short, randomise field names (hard to pass validation on an email field if its called "jwikgks", let alone match the fields if you've cached your post data), add honey-pots (hidden fields that shouldn't be filled in, similar to my checkboxes above but you need less input fields, including hashed data (e.g. timestamp and IP) to know when post data was constructed.
The second article above is very comprehensive, it leads me to think this would be good to build into a .NET component somehow so the form building was automatic. More to think about.
Finally its worth pointing out we're talking about automated attacks here, theres nothing you can do to stop a human filling in the form manually and posting crap links in it, they can also crack the systems above if they can be bothered, what you're really doing though is making your site hard work compared to all the easy to spam sites out there, so they'll leave yours alone.