How to prepare wikipedia dataset for machine learning

Wikipedia is a great source for machine learning as it is large free corpora of text.
Great as a source for machine learning but to get it actually preprocessed for machine learning tasks takes a bit of preprocessing and data cleaning effort.
One of neat features wikipedia offers are dumps of data in wikimedia format so that people don’t need to crawl the website.
If one is to crawl wikipedia it would be a resource strain for both parties and would require good amount of effort to extract things from html.
To actually get the data I’d recomend using wikimedia format provided by wikipedia official dumps that are xml dumps.

Recomended step is to download above mentioned data from a mirror look for links with page data

[ARC] enwiki-20180220-pages-meta-current26.xml-p38067203p39567203.bz2 2018-02-22 08:18  687M  
[ARC] enwiki-20180220-pages-meta-current26.xml-p39567203p41067203.bz2 2018-02-22 08:22  670M  
[ARC] enwiki-20180220-pages-meta-current26.xml-p41067203p42567203.bz2 2018-02-22 08:24  714M  
[ARC] enwiki-20180220-pages-meta-current26.xml-p42567203p42663461.bz2 2018-02-22 07:23   45M  
[ARC] enwiki-20180220-pages-meta-current27.xml-p42663462p44163462.bz2 2018-02-22 08:18  683M  
[ARC] enwiki-20180220-pages-meta-current27.xml-p44163462p45663462.bz2 2018-02-22 08:22  627M  
[ARC] enwiki-20180220-pages-meta-current27.xml-p45663462p47163462.bz2 2018-02-22 08:26  510M  
[ARC] enwiki-20180220-pages-meta-current27.xml-p47163462p48663462.bz2 2018-02-22 08:23  619M  
[ARC] enwiki-20180220-pages-meta-current27.xml-p48663462p50163462.bz2 2018-02-22 08:22  596M  

and download all of them.

Next unpack it on your hardrive in designated folder for this

Install gnu parallel and ripgrep

We’re going to search all of this data through command line to do so we are going to use parallel and ripgrep(faster grep)

We are going to make index of all articles now for easier searching through all of this data since if you use gnu parallel and ripgrep it will be slow for articles that are at the bottom of big files

First make index subfolder in place where you unpacked all these files and remove zip files leave only data.

mkdir index

Create list of files in current directory and put it in file named cc.

ls -la enwiki-20180220-pages-meta-current* > cc

Now lets make index of titles that will have line number and title per file. There should be 1 to 1 matching between files in index subfolder and data files.

NUMJOBS=10
#this is going to make index files for each file unpacked and each line is going to have line number of title
parallel -j$NUMJOBS -a cc "rg -n '<title>' {} > index/{} "

Now that index is produced in index subfolder you can use following script to search for article.

From index directory do

#store search in variable
search='<title>Helsinki</title>'
#search entire index that was just made for this title to get line number and file where this title is located
file=`rg -FHl -j27 "$search" *`
#get line number of this article
linenum=`rg --no-filename -F --no-line-number "$search" $file |awk '{print substr($1,0,length($1)-1)}'`
#index has one file for one file of data
#now from data start searching from this line number until end of page
#notice that we expect data to be in above directory from index
sed  -n "$linenum,\${p;/<\/page/q}" "../$file"

NOTE: This was done from zsh shell. If you use different one please adapt script. For bash you might need to drop -1 in awk part.

Now you are left with datacleaning part that is easier. Above script should fetch you article from data.

Onwards noble machine learner to next data science adventure!