dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
813
share rss forum feed

FirebirdTN

join:2012-12-13
Brighton, TN
kudos:1
Reviews:
·Comcast

Block web crawlers?

This probably isn't even possible, but thought I'd ask...

Is it possible to somehow block web crawlers with the ZyXel?

You know what they say...ignorance is bliss...Until I got my USG, I didn't realize just how much "traffic" was knocking on the door of my little home connection.

I run a web and ftp server at home for the sole purpose of access to my files when I am off-site. Now at one time I was hosting some game patches, software updates and such, but have since closed all that off. Since the ISP I have now it is against the TOS to run any kind of server (grey area I guess since I'm still doing it, but only for my own access now). but, I still leave one folder available via anonymous http in case I want to host my own images on various forums and such.

When searching for my c-name one day I was shocked to see all my past files being "hosted" on various sites. I'd like the content of this folder to not be completely visible on just a simple web search of my cname!

Is there a way to even block these web crawlers?

-Alan


JPedroT

join:2005-02-18
kudos:1

This is so funny, but not really an answer to your questions. But would have solved your problem, most likely, I think...

If you want to splash cash, get one of these

»www.proceranetworks.com/

Now for a 2nd quite useless answer, have you tried to use the robots.txt, crawlers should respect it, but not sure if they all do.

»en.wikipedia.org/wiki/Robots_exc···standard
--
"Perl is executable line noise, Python is executable pseudo-code."


FirebirdTN

join:2012-12-13
Brighton, TN
kudos:1
Reviews:
·Comcast

Hmmm, I didn't quite get the humor part of it...

I realize what I am asking for is a stretch to say the least, but you just never know unless you ask!

Truth is, the term "web crawler" is somewhat new to me. Again, consider myself fairly knowledgable when it comes to PCs in general, networking, and such, but I didn't even know of the existance of these web sifting "bots" until rather recently.

My ship is water tight, so it doesn't really matter if my content is published on some rogue web crawler search engine; I don't think anyone is going to hack into my home network. I was just asking more so in case my ISP ever asks "why do you have an opening on ports 21 and 80?". At least I could tell them "its for my own file access, but there are NO public links to it anywhere on the net". As it stands now, that last statement is NOT true because of these bots.

-Alan

PS I'll try the "robots.txt" file, but not sure what I am supposed to do with it...just drop it in the publicly accssible folder?



Anav
Sarcastic Llama? Naw, Just Acerbic
Premium
join:2001-07-16
Dartmouth, NS
kudos:4
reply to FirebirdTN

Dont worry about JP, he a bit eccentric but hes brilliant and more importantly (and I dont usually admit but he can probablly drink me under the table).


JPedroT

join:2005-02-18
kudos:1

1 edit
reply to FirebirdTN

The funny bit is that I have given the same answer to 3 questions in the last two days. Talk about hammer seeing only nails, but it would have been solved with those very expensive products.

The robots.txt file is stored in your webserver directory and if the bots behave correctly they will read it to check if their allowed to index your website.

Look at the wiki link or just google robots.txt example
--
"Perl is executable line noise, Python is executable pseudo-code."



leibold
Premium,MVM
join:2002-07-09
Sunnyvale, CA
kudos:10
Reviews:
·SONIC.NET

It depends on which webcrawlers you are concerned about.

I tend to put them primarily into two categories (but I'm sure that there are other ways to slice and dice it):

1.) spiders to populate search engines (google and co)
2.) content crawlers harvesting specific information (a common target are email addresses to be sold to spammers)

The first category tends to be well behaved in general (only loading a few pages at a time, honoring robots.txt) and tend to clearly identify themselves in the user agent string.

The second category is out there to make money without any concern about you and your site. They generally ignore robots.txt and often disguise their identity by pretending to be regular users (by providing the user agent string of common browsers). They are less concerned about the impact they have on your bandwidth and are likely to crawl the entire site in one go (they may still limit their download rate to minimize detection).

The first type of webcrawlers tends to be beneficial to the working of the Internet and their behavior can be controlled with robots.txt and by user agent filtering in the webserver.

I'm not aware of a proper fix for the second type of webcrawlers but there are a few things that can be done to reduce their impact on a site. If your site is truly for personal use only, the best solution is to use basic access controls for the entire site which will keep all the crawlers (good and bad) out.
--
Got some spare cpu cycles ? Join Team Helix or Team Starfire!


FirebirdTN

join:2012-12-13
Brighton, TN
kudos:1
Reviews:
·Comcast

Thanks everyone for the info. I'm sorry JP for making you repeat yourself. I did a search before posting, but guess I used the wrong search terms as I yielded ZERO.

This really is not a big deal, it was just something I thought I'd ask. The USG line is so far above everything else I've touched in terms of capability, I almost expect to wake up one day and it have breakfast ready.

My web "site" isn't really a site. All it is is a ZyXel NSA310 with web and ftp enabled on it. So I don't really have a site, just a network storage tank which also happens to be available via http and ftp. I'm pretty good with computers and networking, but honestly I couldn't create a webpage or website if my life depended on it; its just something I've never done, nor cared to do. That is why I didn't even know about these "web crawlers".

Looks like I got two options: 1) Just leave the one folder available from anonymous http and live with it, or 2) close it off and require user login.

Since my "server" is a ZyXel NAS device, robots.txt is out as an option, since I can't gain access to the web root without "hacking" it, which I do not want to do. Only know enough about Linux to be dangerous, and knowing my luck, I'd brick the thing.

-Alan



Brano
I hate Vogons
Premium,MVM
join:2002-06-25
Burlington, ON
kudos:10
Reviews:
·TekSavvy DSL
·Bell Fibe

Until you've mentioned you're limited to NSA310 I was prepared with some options for you but if you want to stick to the NAS capabilities here are few options to consider:
1) Close the NAS access directly and use SSL or L2TP VPN only (this would remove public access)
2) Get separate FTP server for public access on some cheap server i.e. RaspberryPi ... total cost $50. Install vsftpd, for uploads create one folder witch chown functionality after upload so nobody else can get anything anybody uploads.
For downloads you can make the folders not readable and use direct link to a deeper structure if you know what that is (you can share that with trusted parties only).

Alternatively hack your way into NAS »zyxel.nas-central.org/wiki/SSH_server and install separate FTP server on it.


JPedroT

join:2005-02-18
kudos:1
reply to FirebirdTN

You create the robots.txt file with a text editor and upload it to your site, just like any other file.

Its not hard, it will at least stop your stuff from being indexed in the normal search engines.
--
"Perl is executable line noise, Python is executable pseudo-code."