We Will Rock You

by gerbilByte

Hello peeps!  It's me again, you friendly neighborhood gerbil.  You may remember me from 2600 articles such as "Taking Your Work Home After Work" (31:4) and "My Voice Is My Key" (32:3).

I haven't written in a long, long time because I have been so, so busy.  So thought I'd say 'hi' by submitting a little snippet of something very useful.

Let's talk about wordlists.  What is a wordlist?

Well, a wordlist, as it says on a tin, is a file which is made up of a shit-load of words.

The Kali operating system has a few wordlists which can be found in: /usr/share/wordlists

There is a massive file called "rockyou.txt".  It's huge!!!

This is a bit of a default file for people to use, as it contains absolutely millions of words!  Let's have a look:

$ wc -l /usr/share/wordlists/rockyou.txt
14344392 rockyou.txt

Here we can see that there are 14,344,392 lines in the rockyou.txt file.  But does this value reflect words?  Well, a word is a word.  But is each line in "rockyou" a single word?  Let's run a quick command to have a look if any of these lines contain a space, i.e., all "phrases" or "sentences":

$ grep ' ' /usr/share/wordlist/rockyou.txt | head
rock you
i love you
te amo
fuck you
te iubesc
love you
i love u
chris brown
rock on
john cena

John Cena?!?!  Ha!  We see that the top ten lines are not single words!  So how many of these lines are phrases?  Let's run another command:

$ grep -c ' ' /usr/share/wordlists/rockyou.txt
70622

Wow!  Now if I wanted to run a wordlist testing for single words, these would be a waste of time as they are not single words.  OK, the password cracking tool may strip these out, but that too would be extra unnecessary work.  You may argue that "they are phrases, keep them in."  Nah!  For our phrase to fit their phrase, this would more or less be impossible using only 70,619 phrases.  And anyway, we are interested in a word list rather than a phrase list.

Before I go further, the rockyou.txt file contains loads of crap:

$ awk 'BEGIN{len=0;}{if(length($0)>len){len=length($0);printf("%i : %s\n",len,$0);}}' /usr/share/wordlists/rockyou.txt
6 : 123456
9 : 123456789
10 : 1234567890
11 : christopher
13 : tequieromucho
16 : manchesterunited
17 : mychemicalromance
18 : 123456789123456789
39 : Lets you update your FunNotes and more!
40 : 1111111111111111111111111111111111111111
42 : RockYou account is required for Voicemail.
49 : /* {--friendster-layouts.com css code start--} */
awk: cmd. line:1: (FILENAME=rockyou.txt FNR=602044) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
59 : http://www.rockyou.com/fxtext/fxtext-create.php?partner=hi5
77 : vabfdvfdlvhjibfedblsfndilvbgilebvgdlsbgvhbesghklhyubvuwklfbrebgfyurerebgyureb
165 : lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
222 : <table style="border-collapse:collapse;"><tr><td colspan="2"><embed src="http://apps.rockyou.com/photofx.swf" quality="high" scale="noscale" salign="lt" width="325" height="260" wmode="transparent" flashvars="imgpath=http%
255 : <object width="206" height="224"><param name="movie" value="http://www.vivelatino.com.mx/contador.swf"></param><param name="wmode" value="transparent"></param><embed src="http://www.vivelatino.com.mx/contador.swf" type="application/x-shockwave-flash" wmod
257 : <style type=\\'text/css\\'>body{ background: url(http://recursos.fotocajon.com/enchulatupagina/img003/zxddXgCBLcTi.jpg) white center no-repeat fixed; } table, .heading_profile, .heading_profile_left, table td, #p_container, #p_nav_primary, #top_header, #p_n
262 : <style type=\\'text/css\\'>.bg_content{background-image:url(http://img360.imageshack.us/img360/5198/escanear00532wq9.jpg);}.bg_content{background-repeat:repeat;}</STYLE><a href=\\'http://hi5.enchulatupagina.com\\' target=\\'_top\\'><img src=\\'http://hi5.enchula
266 : <div id=\\'24813\\'><a href=\\'http://www.revistate.com\\'><img src=\\'http://www.revistate.com/uploads/20080218/rq/rqwpcf28o1pyb10yfzen53kmuipsi0_PAPARAZZI.jpg\\' border=0 alt=\\'Hazte famoso en www.revistate.com\\'></a></div><div id=\\'72891\\'><a href=\\'http://w
285 : <div align=\\\\\\'center\\\\\\' style=\\\\\\'font:bold 11px Verdana; width:310px\\\\\\'><a style=\\\\\\'background-color:#eeeeee;display:block;width:310px;border:solid 2px black; padding:5px\\\\\\' href=\\\\\\'http://www.musik-live.net\\\\\\' target=\\\\\\'_blank\\\\\\'>Playing/Tangga

What I have done here is print lines that are bigger than the last recorded line.  Just by looking at this output, we see that lines that have a character count greater than 18 are, in fact, crap.  They're not even phrases!  They are bits of websites - HTML!  Definitely not useful in searching for passwords!

So we can strip these out.  Anything with a space - get rid of it.

And while we're at it, let's remove emails and websites.  Think about it, you are cracking a password hash on BumbleBee Security's webapp.  Is some random person's email address or a website address going to be a password?  Unless you are really lucky, no, no it isn't!  Not whatsoever!

Out of interest, how many lines contain emails and websites?

$ egrep -c '[a-zA-Z0-9_\-\.]+@[a-zA-Z0-9_\-\.]+\.[a-zA-Z]{2,5}' /usr/share/wordlists/rockyou.txt
27348
$ grep -c http[s]*:// /usr/share/wordlists/rockyou.txt
866

Wow!  Quite a lot!  Let's remove them too.

In conclusion, the "rockyou.txt" wordlist contains a load of crap that can be removed.  And other wordlists may contain crap such as blocks of "header texts," etc.  Due to this, I wrote a simple script - feel free to use it and send me kudos.

Many thanks for reading.

Gerbil.  [twitter: @gerbilByte]

File Running

$ ./wordlistcleanser.sh rockyou.txt wewillrockyou.txt
Cleaning rockyou.txt...
 Output file : wewillrockyou.txt
Removing phrases...
grep: rockyou.txt: binary file matches
Extracting then removing websites...
Extracting then removing emails...
Getting stats on wewillrockyou.txt, extracted emails and extracted websites...
Cleansing completed.
$ wc -l rockyou.txt wewillrockyou.txt
 14344392 rockyou.txt
 14246262 wewillrockyou.txt
 28590654 total
$ expr 14344392 - 14246262
98130

#!/bin/bash
#
# wordlistcleanser.sh       gerbil2018 [twitter: @gerbilByte]
#
# This file is used to clean 'rockyou.txt' from all the crap to leave just
# single words.
# It will also cleanse other wordlists too.
#
# Usage:
# wordlistcleanser.sh infile [outfile]
#
# WARNING: If an output file isn't specified, then the input will be
# overwritten (permissions allowing).
#
# Example:
# ./wordlistcleanser.sh /usr/share/wordlists/rockyou.txt ./wewillrockyou.txt
#

infile=$1
outfile=$2
version="1.0"
author="gerbil"

if [ $# -lt 1 ];
   then
   printf "\nwordlistcleanser v%s - %s 2018 \n\nThis is a simple script that will remove \'phrases\', emails and websites from wordlist files.\nEmails and websites will be stored as files under the current directory.\n\n" ${version} ${author}
   printf "Usage:\n\t%s infile.txt [outfile.txt]\n\nWARNING: If an output file isn't specified, then the input will be overwritten (permissions allowing).\n\nExample:\n\t./wordlistcleanser.sh ./rockyou.txt ./wewillrockyou.txt\n\nHave fun! :) \n-%s\n" $0 ${author}
   exit
fi

baseinfile=`basename ${infile}`
baseinfile=${baseinfile%.*}
printf "Cleaning %s...\n " ${infile};

#Check input file exists...
if ! [ -a ${infile} ] ;
   then #inputfile doesn't exist.
   printf " %s doesn't exist!\n" ${infile}
   exit
fi

#Check if inputfile is to be overwritten or not...
if [ ${outfile}X == X ] ;
  then #no output file specified, therefore destruct mode! ;P
  outfile=${infile}
  printf " No output file specified, therefore output will be stor ed at %s\n" ${outfile}
  # rm -f ${infile} # just to save space
else
   printf "Output file : ${outfile}\n"
fi

#Removing phrases...
printf "Removing phrases...\n"
grep -v ' ' ${infile} > /tmp/ry1.txt

#Extracting then removing websites...
printf "Extracting then removing websites...\n"
grep http[s]*:// /tmp/ry1.txt > ./${baseinfile}_websites.txt
grep -v http[s]*:// /tmp/ry1.txt > /tmp/ry2.txt
rm -f /tmp/ry1.txt # just to save space

#Extracting then removing emails...
printf "Extracting then removing emails...\n"
egrep '[a-zA-Z0-9\-\.]+@[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,5}' /tmp/ry2.txt > ./${baseinfile}_emails.txt
egrep -v '[a-zA-Z0-9\-\.]+@[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,5}' /tmp/ry2.txt > ${outfile}
rm -f /tmp/ry2.txt # just to save space

#Get stats on leftover file (length of each word and count of each, I know there are no words longer than 1000 
#characters)...
printf "Getting stats on %s, extracted emails and extracted websites...\n" ${outfile}
printf "Emails extracted: `wc -l ./${baseinfile}_emails.txt`\n" > ./${outfile%.*}_stats.txt
printf "Websites extracted: `wc -l ./${baseinfile}_websites.txt`\n" >> ./${outfile%.*}_stats.txt
printf "\nStats on %s : \n\n" ${outfile} >> ./${outfile%.*}_stats.txt
awk 'BEGIN{charcounts[1000]=0;len=0;printf("word length : count\n------------:-----\n");}{charcounts[length($0)]++;}END{for(i=0;i<=1000;i++){printf("%11i : %i\n",i,charcounts[i]);}}' ${outfile} | grep -v ': 0'$ >> ./${outfile%.*}_stats.txt

printf "Cleansing completed.\n\n"

Code: wordlistcleanser.sh

Return to $2600 Index