About DuckDuckGo's Sources

by N1xis10t

I discovered something interesting while poking around in the inner workings of the DuckDuckGo search engine.

Not counting Instant Answers (the fancy first results that usually come from DuckDuckGo's many partner companies), every search result appears to be marked as a Microsoft Bing search result.

Let me explain exactly what I mean by that.

"s":"bingv7aa"

While using DuckDuckGo, search results are transported from links.duckduckgo.com to the user's web browser inside a JavaScript array, in which each result is represented as a JavaScript dictionary.

Every search result in the array contains several key/value pairs that hold useful information, including (except in the case of the Instant Answers) one key called s, whose value is always bingv7aa.

For context, the current iteration of Microsoft Bing's programmable search interface is called the "Bing Web Search API v7," and this is all I can imagine bingv7aa standing for.

I discovered this information by watching network traffic and sifting through code manually, but testing multiple searches in this fashion quickly becomes tiresome.

It is for this reason that I wrote (and have included at the end of this article) two scripts in Python 3: a library for retrieving search results from DuckDuckGo, and a script that uses said library and displays information about each search result.

They work seamlessly together as an extremely minimalistic text-based web browser, with which the user can browse DuckDuckGo and more efficiently obtain information.  Using these tools, I have tested many searches with many different search terms, and have yet to find a search result that wasn't marked with bingv7aa.

Now, this wouldn't actually be an issue if DuckDuckGo were more transparent and told their users exactly where the results come from, but the reality is quite different.

What DuckDuckGo Has To Say

The DuckDuckGo help files (located at help.duckduckgo.com) contain two pages that are relevant to this subject.  The first one is directly about their sources, and the second one is about the quantity of search results, but it has some interesting information about their sources as well.

The help page located at help.duckduckgo.com/duckduckgo-help-pages/results/sources first discusses Instant Answers and where they come from, and then it says:

"... We also maintain our own crawler (DuckDuckBot) and many indexes to support our results.  Of course, we have more traditional links and images in our search results too, which we largely source from Bing."

I can't imagine that the DuckDuckBot and their "many indexes" are used for Instant Answers, because as far as I can tell, Instant Answers stand on their own and aren't supported by a web crawler or indexes.  I also haven't found any search results that have been marked as coming from the DuckDuckBot or its indexes.

The second help page (located at help.duckduckgo.com/duckduckgo-help-pages/results/number-of-results) goes even further.  It says:

"We get results from a variety of sources (including our own).  Because of this unique way of generating results, we cannot easily determine the number of results for a particular search ahead of time.  That's why we do not display such a number in our search results."

Take a Look Manually

If you are interested in seeing the data and verifying my claims for yourself, I would encourage you to follow this handy step-by-step guide:

1.)  Begin by opening a web browser (I have only tested this in Firefox and Chrome) and then press Ctrl+Shift+I to open the Developer Tools.  It will pop up from the bottom or side of the window, and you'll need to look at the top of it to find the Network tab.  Click on that.

2.)  Next, navigate to duckduckgo.com and initiate a search.  It doesn't matter what you search for, just pick something random like "cats in boxes".

3.)  When the search result page loads, take a look at the network traffic.  A bunch of stuff will show up in there, but you're only interested in one thing.  If you're using Firefox, look at the Domain column for a network request made to links.duckduckgo.com.  You'll probably need to scroll up to the top to see it.  If you're using Google Chrome, you need to find the document that has the type Script and a name that starts with something like d.js?q=cats%20in%20boxes&l=us-en&s=0&a=h_....  Make sure that it starts with d.js and not t.js.

4.)  Double-click on the document/network request that you found to open it in a new browser tab.

At this point, you will be greeted with a massive JavaScript document.

It looks scary and hard to read, but nestled somewhere inside all that mess is the list of search results.  You can try to find information just by reading through the code, but I recommend using your web browser's Find in Page tool by pressing Ctrl+F.  If you're using Firefox, check the Highlight All box when the tool pops up.

There are a few things that you can search the page for to get advanced data on the results.

If you search for "a" (including the quotes) then all of the descriptions for the search results will be shown.

If you search for "e" you will see some timestamps, which are only present on some of the results.

"u" will give you the URLs of the search results.

"t" will give you the titles.

"i" will give you the domain names.

"da" seems to provide some sort of category/grouping scheme.

And of course, "s" appears to always have a value of bingv7aa.

There are also a few keys that I do not know the meaning of, such as "k" which always seems to be null, and "m" which always seems to be 0.

The Python Scripts

As I mentioned previously, it would be very difficult to run many tests with the above method.  It is therefore of great benefit to have a computer program (or two) to help out.  

The first script that I wrote (ddg.py) is a Python library that can be used to make sequential requests to links.duckduckgo.com, and retrieve all available results for a given search.

Every time you use it to make a search, it first needs to run the query through DuckDuckGo's normal website in order to get something called a VQD.  I don't actually know what this is for (I presume it is some sort of unique session identifier), but links.duckduckgo.com won't return anything without it.

Once it has the VQD number, it can proceed to fetch the search results and import them as Python lists which are later concatenated for use.

The second script that I wrote (ddg_analysis.py) imports the first one, and after fetching all the results for a user-specified search term, it displays the value of key "s" for each search result, along with the URL and a snippet of the title.

Both scripts are User-Agent that have been designed for responsible non-robot use, and as such require confirmation from the user before loading each web page.

I have run many searches with these tools, with many variations in search terms.  I've tried common words, obscure words, and various phrases, and even with the additional efficiency afforded by the use of my scripts, I have never come across a result that wasn't labeled with bingv7aa.

If you are interested in running the scripts for yourself, or adapting them to your own purposes, you will find them at the end of this article.  If you decide to run them, make sure that they are in separate files in the same directory.

Additionally, I recommend running them in IDLE with full-screen mode on.

This is one of those issues that I actually want to be wrong about, so I am closing out this article with a plea to my readers.

If you can think of anything other than "Bing Web Search API v7" that might be meant by the string bingv7aa, or if you can find some search results that are not marked with this identifier, please send a letter to 2600 Magazine about your findings.

I'm sure we all want to know.

ddg.py:

Broken? - I hate python!!!!

#
# *** ddg.py ***
#
# This is a Python library for fetching search results from DuckDuckGo.
# It gets search results directly from links.duckduckgo.com.

from urllib.request import urlopen, Request
import re, json

headers = {'User-Agent': 'ddg.py'}

def loadPage(url):
    #
    # WARNING: The following line of code is necessary to make this program
    # a user agent rather than a robot. The user decides when and if
    # they want to load more pages. You are strongly encouraged not
    # to remove or "comment out" the following line.
    #
    input("\n[???] ENTER to fetch web page, CTRL+C to cancel ")
    page = urlopen(Request(url, headers=headers)).read().decode("utf-8")
    return page

def getVQD(page):
    return (re.search(',vqd="[0-9]-[0-9]*"', page)[0].replace(',vqd="', "").replace('"', ""))
def fetchAll(search):
    resultsList = []

    searchTerm = search.replace(" ", "+")
    print("[DDG] Search term is: " + searchTerm)

    # Get the VQD of this search from the first human readable page
    print("[DDG] Fetching first human readable page...")
    currentUrl = "https://duckduckgo.com/?q=" + searchTerm +"&ia=web"
    currentPage = loadPage(currentUrl)
    print("[DDG] Extracting VQD number...")
    VQD = getVQD(currentPage)
    print("[DDG] VQD number is: " + VQD)

    # Use the VQD to access the links subdomain
    print("[DDG] Getting JSON format SERP from links.duckduckgo.com...")
    currentUrl = ("https://links.duckduckgo.com/d.js?q=" + searchTerm + "&s=0&vqd=" + VQD)

    resultsFromLastPage = ['']

    while True:
        currentPage = loadPage(currentUrl)

    # Extract the results in JSON format
        try:
            JSONresultsString = (re.search(r"load\(\'d\'\,.*}]\);", currentPage)[0].replace("load('d',", "")[0:-2])
        except:
            break
        # Add the current page of JSON results to the results list
            resultsFromCurrentPage = json.loads(JSONresultsString.replace("\t", ""))
            if resultsFromCurrentPage[0:-1] == resultsFromLastPage[0:-1]:
                print("[DDG] Current page identical to last, assuming end reached")
                break
            resultsFromLastPage = resultsFromCurrentPage
            resultsList += resultsFromCurrentPage[0:-1]
            print("[DDG] Got " + str(len(resultsFromCurrentPage[0:-1])) + " results from current page")

            # Move to next page
            print("[DDG] Moving to next page...")
            try:
                currentUrl = ("https://links.duckduckgo.com/" + resultsFromCurrentPage[-1]['n'])
            except:
                print("[DDG] End of results")
                break

            return resultsList
        if __name__ == "__main__":
        # If running as main program, get search term from user and tell user how
        # to use the results object
            results = fetchAll(input("\n[???] Search term: "))
            print("""
            [***] To look at the results, browse the list called 'results' using the below
            [***] console. For example, try typing: results[0]['a']
            [***] This will show you the description of the first result.""")

ddg_analysis.py:

#
# *** ddg_analysis.py ***
#
# This is a script that uses the ddg library to show the user detailed
#
#i
import ddg

while True:
    # Fetch the search results for a user specified search term
    results = ddg.fetchAll(input("\n[???] Search term: "))

    print("")
    i = 1

    # Print out a list of data
    for result in results:
        title = result["t"][0:24]
        if len(title) < 24:
            title += (" " * (27 - len(title)))
        else:
            title += "..."
        try:
            source = result["s"]
        except:
            source = "Not Available"
        try:
            timestamp = result["e"]
        except:
            timestamp = "****** Not Available ******"
        print(" " + str(i) + ":" + (" " * (5 - len(str(i)))) 
                + "'s': " + source + " " 
                + "Title: " + title + " "
                # Uncomment the following line to also print out the timestamp
                #+ "Timestamp: " + timestamp + " "
                + "URL: " + result["u"])
        i += 1

Code: ddg.py

Code: ddg_analysis.py

Return to $2600 Index