LinuxBasics.org

The community that helps people to run Linux

rss
Table of Contents

URL-Lists for Google sitemaps

Google Sitemaps needs a list of URLs to optimize crawling. Usually, this is no problem, since Google supplies a script you can run on your server to build that list.

But that fails, if the content of your site is not stored in HTML-files, but in TXT-files like DokuWiki does. So here is what I did to build that list of URLs:

Find them

We do all of this in the directory where DokuWiki stores the pages:

cd /home/hdocs/beta.linuxbasics.org/data
find ./ -iname "*.txt"

give us

./wiki/syntax.txt
./wiki/dokuwiki.txt
./wiki/playground.txt
./start.txt
./tutorials/pre/start.txt
... 

which is the URL except that:

SED

The editor ‘sed’ can help us with those replacements. It is the source of Perl’s s///-command, so if you know Perl, this will be familiar:

sed -e 's#^./#http://LinuxBasics.org/#g ; s/.txt$//g'

This uses ‘#’ as a delimiter instead of ‘/’. Why? Because it looks much better then the version with slashes: “s/^./http:\/\/LinuxBasics.org/g”

Putting it together

find ./ -iname "*.txt" | sed -e 's#^./#http://LinuxBasics.org/#g ; s/.txt$//g'

gives us what we want:

http://LinuxBasics.org/wiki/syntax
http://LinuxBasics.org/wiki/dokuwiki
http://LinuxBasics.org/wiki/playground
http://LinuxBasics.org/start
http://LinuxBasics.org/tutorials/pre/start
http://LinuxBasics.org/tutorials/pre/md5sum

Copyright (c) by the authors.
Prior to editing, authors agreed to license their contributions by the terms of the GPL.
See our licensing page for details.


Linux® is a registered trademark of Linus Torvalds.


 
  tutorials/advanced/realworld/url-lists_for_google.txt · Last modified: 2008/07/20 19:08

LinuxBasics.org

Start Linux-Course Tutorials Linux Links Security Blog Forum E-mail List Search Online Chat

Site-Info

Help Get in Touch Making of LBo

Wiki-Control

Powered by

Linux Apache DokuWiki Mailman RUTE ht://Dig