Andrew Gudgel

Writer | Translator | Poet

About      Bibliography      An Irregular Blog

Downloads

After an Debian upgrade went horribly wrong and choked my laptop, I used a command-line only version to restore the operating system. This led to a weekend of learning just how much "normal" computer usage I could get from a command-line only system. I quickly discovered that short of spreadsheets and high-end word processing, I could do just about everything I normally did on a day-to-day basis from the command line. In the end I did reinstall a graphical desktop because I needed to be able to use LibreOffice. But those couple of days inspired me to try putting an entire command-line only system on a USB drive for further experimentation. Here are the results:

Creating a Command-Line-Only Debian Linux USB Drive

 

 

This is a collection of blog posts I wrote on how to be a digital renaissance man (or woman), which includes topics such as how to keep a commonplace book, why you should write poetry and learn a foreign language, and how to make an erasable writing tablet.

The Digital Renaissance Man Series

 

 

Several years back, I got interested in the concept of "thrift" in the Victorian sense. I read Samuel Smiles' book on the power of saving money, "Thrift" and did some further research, which led to a 17th-century pamphlet by the writer Henry Peacham called "The Worth of a Penny or a Caution to Keep Money." In it, Peacham describes ways of losing money, saving money, making money, gives the history of the word "penny," and even lists what a penny would buy in the 1640s. I went to the Library of Congress and managed to get a photocopy of the pamphlet, which I transcribed. I had a vague notion of doing something with it "someday." And there it sat on my hard drive until 2016, when I posted it online.

One of the interesting things about Peacham's pamphlet is the use of the phrase, "penny wise, pound foolish." "The Worth of a Penny" remained popular well into the 18th century, making it quite possible that a young printer with an interest in thrift by the name of Benjamin Franklin read Peacham's work. Speaking of the poems he used in the 1747 Poor Richard's Almanac, Franklin said "I need not tell thee that many of them are not of my own making." Could the same be true of Franklin's now-famous maxim, which also appeared in that 1747 almanac?

The Worth of a Penny

 

 

When I'm not writing, I like to explore the Linux operating system and the programs associated with it. Inspired by this blog post at the Hundred Rabbits website, I decided to see if I could create a script for the sed (stream editor) program that turned Wikipedia's XML files into plain text for offline storage. This is the result of the weeks of tinkering that followed. While it is possible to do some html/xml and Wikitext parsing using sed, it's not the best or most efficient way, and this script is full of holes and flaws. But it might prove a useful base upon which someone else can build a better one.

Note: I have not tried this script on non-English versions of Wikipedia. Nor have I tried it on Wikibooks or Wiktionary. I don't know if/how it would handle those.

The first step in the text-file-only Wikipedia process is to decide if you really want to do this--the decompressed xml files for the full copy of Wikipedia I downloaded were over 80G in size. Perhaps consider starting with the simplewiki version, which uses the 500 most common English words and which fits into one, large file. You can download the raw Wikipedia files here. Once you have downloaded what you want, extract the file(s) from their archives, then create and run a short "for loop" bash script to iterate through them all. I used:

#!/bin/bash
for file in *.xml*
do
sed -f textwiki.sed ${file} >${file}.txt
done
The resulting text files were roughly 20G altogether. I use grep to search for the article I want. You can also compress them -- compressed, the files were around 8G.

Here's the sed script iteself:

textwiki.sed

 

 


© 2021 Andrew Gudgel
contact (at sign) andrewgudgel.com
This page last updated on 20210108.