CSci 4511 - Homework 6 helper instructions

How to download images

  1. Google search a set of images (e.g. go to https://images.google.com/ and look up "cats"). You might want to open these instructions in one tab, and your image search in another, and display them side by side. I did this in Chrome; the steps in Firefox might be slightly different.
  2. In the same tab or window of your Google search, open the JavaScript Console. E.g.
    Tools => Developer Tools => Console in Chrome, or
    View => Developer => JavaScript Console in Firefox,
    or maybe Tools => Javascript Console
    after you've done it already (Chrome made the shortcut for me after I did it the first time).
  3. Scroll down in the screen. Many many times; we want all the relevant images to be on-screen.
  4. Load jquery into the console by pasting the following code into the console. If you try to copy / paste all of the code from the next 3 steps at once it will probably bark at you.

    // pull down jquery into the JavaScript console
    var script = document.createElement('script');
    script.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js";
    document.getElementsByTagName('head')[0].appendChild(script);

  5. Use a CSS selector to grab the URLs by pasting the following code into the console:

    // grab the URLs
    var urls = $('.rg_di .rg_meta').map(function() { return JSON.parse($(this).text()).ou; });
  6. Lastly, write the URLs into a line by pasting this code into the console.

    // write the URls to file (one per line)
    var textToSave = urls.toArray().join('\n');
    var hiddenElement = document.createElement('a');
    hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
    hiddenElement.target = '_blank';
    hiddenElement.download = 'urls.txt';
    hiddenElement.click();
    // write the URls to file (one per line)
    var textToSave = urls.toArray().join('\n');
    var hiddenElement = document.createElement('a');
    hiddenElement.href = 'data:attachment/text,' + encodeURI(textToSave);
    hiddenElement.target = '_blank';
    hiddenElement.download = 'urls.txt';
    hiddenElement.click();

  7. Now you should have a file called "urls.txt" in your Downloads folder (or wherever you download to).
  8. Download the items via the ./downloadImages.py script. Run it with -u for text file containing URLs to download, and -o (for output directory to store images in, the directory must exist). E.g.:

    python3 ./downloadImages.py -u dog_urls.txt -o dogims
  9. Check all the downloaded images to be sure they make sense by opening them and looking at them yourself. Move images that don't make sense into another folder, e.g. "dog_noise" (we will use these later). E.g. downloading "dog" images will also pull in images of hotdogs; downloading "cat" images may also download images with cats and dogs, or cats and humans, cats and bunnies, big cats (like leopards), hand drawn images, etc.. Noise is the enemy, but so is a tiny dataset. Use your judgement.
    For example, of the 681 URLs I downloaded of 'cats', I kept only 547 for training / testing / validation, 61 noisy photos, and the rest my system couldn't read or failed to download.
  10. Save each category of item in its own folder; all the categorical folders should be in a single folder that you will specify when training your network. E.g. I trained a Cat vs Dog vs Cactus model, so my directory structure is

    datasets
    datasets/Cat
    datasets/Dog
    datasets/Cactus

You're ready! If you're happy with your images (e.g. your Cat folder doesn't include pictures of Cat pepper shakers), then you can continue on to training your network.

Copyright: © 2018 by the Regents of the University of Minnesota
Department of Computer Science and Engineering. All rights reserved.
Comments to: Marie Manner
Changes and corrections are in red.