dslreports logo
    All Forums Hot Topics Gallery


how-to block ads

Search Topic:
share rss forum feed


Brighton, TN
reply to JPedroT

Re: Block web crawlers?

Hmmm, I didn't quite get the humor part of it...

I realize what I am asking for is a stretch to say the least, but you just never know unless you ask!

Truth is, the term "web crawler" is somewhat new to me. Again, consider myself fairly knowledgable when it comes to PCs in general, networking, and such, but I didn't even know of the existance of these web sifting "bots" until rather recently.

My ship is water tight, so it doesn't really matter if my content is published on some rogue web crawler search engine; I don't think anyone is going to hack into my home network. I was just asking more so in case my ISP ever asks "why do you have an opening on ports 21 and 80?". At least I could tell them "its for my own file access, but there are NO public links to it anywhere on the net". As it stands now, that last statement is NOT true because of these bots.


PS I'll try the "robots.txt" file, but not sure what I am supposed to do with it...just drop it in the publicly accssible folder?



1 edit
The funny bit is that I have given the same answer to 3 questions in the last two days. Talk about hammer seeing only nails, but it would have been solved with those very expensive products.

The robots.txt file is stored in your webserver directory and if the bots behave correctly they will read it to check if their allowed to index your website.

Look at the wiki link or just google robots.txt example
"Perl is executable line noise, Python is executable pseudo-code."

Sunnyvale, CA
It depends on which webcrawlers you are concerned about.

I tend to put them primarily into two categories (but I'm sure that there are other ways to slice and dice it):

1.) spiders to populate search engines (google and co)
2.) content crawlers harvesting specific information (a common target are email addresses to be sold to spammers)

The first category tends to be well behaved in general (only loading a few pages at a time, honoring robots.txt) and tend to clearly identify themselves in the user agent string.

The second category is out there to make money without any concern about you and your site. They generally ignore robots.txt and often disguise their identity by pretending to be regular users (by providing the user agent string of common browsers). They are less concerned about the impact they have on your bandwidth and are likely to crawl the entire site in one go (they may still limit their download rate to minimize detection).

The first type of webcrawlers tends to be beneficial to the working of the Internet and their behavior can be controlled with robots.txt and by user agent filtering in the webserver.

I'm not aware of a proper fix for the second type of webcrawlers but there are a few things that can be done to reduce their impact on a site. If your site is truly for personal use only, the best solution is to use basic access controls for the entire site which will keep all the crawlers (good and bad) out.
Got some spare cpu cycles ? Join Team Helix or Team Starfire!


Brighton, TN
Thanks everyone for the info. I'm sorry JP for making you repeat yourself. I did a search before posting, but guess I used the wrong search terms as I yielded ZERO.

This really is not a big deal, it was just something I thought I'd ask. The USG line is so far above everything else I've touched in terms of capability, I almost expect to wake up one day and it have breakfast ready.

My web "site" isn't really a site. All it is is a ZyXel NSA310 with web and ftp enabled on it. So I don't really have a site, just a network storage tank which also happens to be available via http and ftp. I'm pretty good with computers and networking, but honestly I couldn't create a webpage or website if my life depended on it; its just something I've never done, nor cared to do. That is why I didn't even know about these "web crawlers".

Looks like I got two options: 1) Just leave the one folder available from anonymous http and live with it, or 2) close it off and require user login.

Since my "server" is a ZyXel NAS device, robots.txt is out as an option, since I can't gain access to the web root without "hacking" it, which I do not want to do. Only know enough about Linux to be dangerous, and knowing my luck, I'd brick the thing.


I hate Vogons
Burlington, ON
·TekSavvy DSL
·Bell Fibe
Until you've mentioned you're limited to NSA310 I was prepared with some options for you but if you want to stick to the NAS capabilities here are few options to consider:
1) Close the NAS access directly and use SSL or L2TP VPN only (this would remove public access)
2) Get separate FTP server for public access on some cheap server i.e. RaspberryPi ... total cost $50. Install vsftpd, for uploads create one folder witch chown functionality after upload so nobody else can get anything anybody uploads.
For downloads you can make the folders not readable and use direct link to a deeper structure if you know what that is (you can share that with trusted parties only).

Alternatively hack your way into NAS »zyxel.nas-central.org/wiki/SSH_server and install separate FTP server on it.