June 15th, 2009Cleaning, sorting and deleting repeated words in word lists
Index
- Introduction
- Dicctionary attack
- Description
- An example
- Download (I don’t want to know what it does! Just do it!)
- How to use it
1. Introduction
This article is about a program which can clean, sort and delete repeated words in your wordlists. You can download it directly in the Download section, but I recommend you read the full article to know why you could neet it and how it works.
2. Dictionary attack
In cryptanalysis and computer security, a dictionary attack is a technique for defeating a cipher or authentication mechanism by trying to determine its decryption key or passphrase by searching likely possibilities.
A dictionary attack uses a brute-force technique of successively trying all the words in an exhaustive list (from a pre-arranged list of values). In contrast with a normal brute force attack, where a large proportion key space is searched systematically, a dictionary attack tries only those possibilities which are most likely to succeed, typically derived from a list of words in a dictionary. Generally, dictionary attacks succeed because many people have a tendency to choose passwords which are short (7 characters or fewer), single words found in dictionaries or simple, easily-predicted variations on words, such as appending a digit.
When we want to perform this kind of attack, we usually have several wordlists in separate files: common English words, cities, names, medical words and so on. But many of those words are repeated in most of the wordlists files. For example, it is usual to find these words into the wordlist file as “12345 qwerty asdfg a aa aaa aaaa”, because people who built these wordlists added only the common words. When we use a large list of words, we accumulate repeated words. If we do not have a large number of words in a reduced number of wordlist files, the time that we waste in operations and time is not so important, but when we use wordlists with a huge number of words (10,000 words or more) with repetitions, we need to optimize it.
3. Description
The proposed algorithm will clean this huge word lists. It will delete all repeat words, clean spaces and new lines, and sort all words. So, this algorithm will perform the following tasks:
- Clean spaces and add a new line between words.
- For each wordlist file, sort all words.
- For each wordlist file, remove repeat words.
- Compare every file with the others to remove repeated words and save the result in the second compared file.
These tasks are optimized, and the program already does several tasks at the same time. Anyway, the hardest task is to compare every file with the others, although this task is optimized to not repeat comparisons.
Using this algorithm, we can reduce up to 40-50% of time that a dictionary attack could spend.
4. An example
We have four files in our computers, FileA, FileB, FileC, FileD, (table 4.1) with different words. Maybe the words of these files are repeated or maybe not. In total, we have 36 words, including repeated words.
| FileA | FileB | FileC | FileD |
|
g h e a b c d f |
i j m n l k a f |
o p m a b q q q q r s |
s v j q s t u a w |
Table 4.1: Total words: 36.
Task 1, 2, 3: For each file: Clean spaces. Add a new line between words. Sort words. Remove repeat words (table 4.2).
| FileA | FileB | FileC | FileD |
|
a b c d e f g h |
a f i j k l m n |
a b m o p q r s |
a j q s t u v w |
Table 3.2: Total words: 32.
Task 4: Compare file by file to remove repeated words. In the example, Six comparison shall be performed: <fileA, fileB>, <fileA, fileC>, <fileA, fileD>, <fileB, fileC>, <fileB, fileD>, <fileC, fileD>. The first wordlist file is compared with the second one, and as a result, from the second wordlist all words, which are in both wordlist, will be deleted. So, the first four comparisons (file A comparisons) will be shown in table 4.3.
| FileA | FileB | FileC | FileD |
|
a b c d e f g
|
i j k l m
|
m o p q r s |
j q s t u v w |
Table 4.3: Total words: 25.
Task 4: By comparing the rest of the files: <fileB, fileC>, <fileB, fileD> and <fileC, fileD>, the final state of files are shown in the table 4.4.
| FileA | FileB | FileC | FileD |
|
a b c d e f g
|
i j k l m
|
o p q r s |
t u v w |
Table 4.4: Total words: 21.
The reduction in the number of words are from 36 to 21 which is a 41.6% less of the original word list file.
5. Download
You can download the shell script if you accept the GPL v3 license, and if you accept it, YOU ARE RESPONSIBLE of any kind of damage as: lost of data, deleted files, or any other problems in your hardware, software or system.
EXECUTE IT WITH PRECAUTION! Read the “How to use it” section.
Download <Clean wordlists> version 0.0.1!!!
6. How to use it
After downloading it, change the execution permissions:
chmod +x cleanwl.sh
Then, copy it into the folder where you have the wordlists files you want to clean and execute the script:
[path_of_the_script]/cleanwl.sh
for example: ./cleanwl.sh
It will show you the statistics of the process, but if you don’t want to see that, you can redirect it to a file or to /dev/null. IMPORTANT: if you decide to redirect the output to a file, redirect it into another folder, not in the same folder that the wordlists files!!
for example: ./cleanwl.sh > ../stats.txt
