we had implemented a way of searching the site and the list-archives using the same search-box at the top of each page.
While this was kind of neat, it severely slowed down the whole site in several ways:
The search was really slow, since it had to examine many more pages (1 mail to the list → one page in the wiki). Also, searches usually had many results that were of poor quality.
The second function that was really slow was the ‘backlinks’. This finds all pages in the wiki which link to the current page. (It probably uses ‘search’ internally).
Those two functions were so slow that they seemed to be broken.
So we decided to go back on the searchability of the mailing-list archives. I removed it from the wiki.
The form for mail search can be found on the Mailing list page at: The List's Archives section
Yours,
Stefan
This is how I made our Mailman/Pipermail archive searchable through the DokuWIki searchbox
Pipermail, the Mailman archiver, generates HTML-pages for all posts. The following script takes these pages, strips away everything that is not needed and puts the result in a .txt file inside a DokuWiki namespace.
#! /bin/bash # rm /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/* # fails because of too many files find /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/ -iname "*.txt" | xargs rm # Copy all posts from pipermail to the target directory: find /usr/local/mailman/archives/public/qna/ \ -iname "[1-90]*.html" \ -exec cp {} /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive \; # Remove <HEAD>, links, footer inserted by pipermail for i in /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/*.html ; do echo "<html>" > /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/`basename $i .html`.txt sed -n -e '/<H1>/,/<\/I>/p' \ -e '/<!--beginarticle-->/,/<!--endarticle-->/p' \ $i \ >> /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/`basename $i .html`.txt echo "</html>" >> /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/`basename $i .html`.txt done; # Remove HTML-files find /home/htdocs/beta.linuxbasics.org/data/mailinglist/qnaarchive/ -iname "*.html" | xargs rm
The sed-command prints only the lines between <H1> and </I>, and those between <!–beginarticle–> and <!–endarticle–>. This happens to be the information about the post, and the post itself. Thanks to seder's grab bag for their excellent examples.
The <html></html> around the extracted lines is to tell DokuWiki to allow HTML-markup.
All TXT-files are wiped before the work is done, and all HTML-files after they are no longer needed. Like that, if a post is removed from the Pipermail archives, it will also vanish from the DokuWiki archive.
A regular search for ‘network’ takes about 8 seconds, plus about the same time to display all results. ‘WiFi’, which has much less results, takes about the same time to search, but much less to display. The time to display may vary with your connection-speed.
It is NOT ADVISABLE to search for term that are in many posts, like ‘Anita’. Also, if you click on ‘index’ within the qnaarchive-namespace, you can easily go for coffee (or a pizza)…
Of course: Don’t forget to have a cronjob refreshing the archive :)
— Stefan Waidele jun. 2005/06/20 22:17
Copyright (c) by the authors.
Prior to editing, authors agreed to license their contributions by the terms of the GPL.
See our licensing page for details.
Linux® is a registered trademark of Linus Torvalds.