postprocessing.repeat_detection_elimination package

This package contains tools for running MegaDetector’s repeat detection elimination (RDE) process, for quickly getting rid of false positives that are frequently detected as objects of interest. The RDE page on GitHub provides documentation about how to run that process.

Submodules

postprocessing.repeat_detection_elimination.find_repeat_detections module

find_repeat_detections.py

If you want to use this script, we recommend that you read the RDE user’s guide:

https://github.com/agentmorris/MegaDetector/tree/main/megadetector/postprocessing/repeat_detection_elimination

Really, don’t try to run this script without reading the user’s guide, you’ll think it’s more magical than it is.

This script looks through a sequence of detections in the API output json file, and finds candidates that might be “repeated false positives”, i.e. that random branch that the detector thinks is an animal/person/vehicle.

Typically after running this script, you would do a manual step to remove true positives, then run remove_repeat_detections to produce a final output file.

There’s no way that statement was self-explanatory; see the user’s guide.

This script is just a command-line driver for repeat_detections_core.py.

find_repeat_detections - CLI interface

find_repeat_detections [-h] [--outputFile OUTPUTFILE] [--imageBase IMAGEBASE]
                       [--outputBase OUTPUTBASE] [--confidenceMin CONFIDENCEMIN]
                       [--confidenceMax CONFIDENCEMAX] [--iouThreshold IOUTHRESHOLD]
                       [--occurrenceThreshold OCCURRENCETHRESHOLD]
                       [--minSuspiciousDetectionSize MINSUSPICIOUSDETECTIONSIZE]
                       [--maxSuspiciousDetectionSize MAXSUSPICIOUSDETECTIONSIZE]
                       [--maxImagesPerFolder MAXIMAGESPERFOLDER]
                       [--excludeClasses EXCLUDECLASSES [EXCLUDECLASSES ...]]
                       [--pass_detections_to_processes_method PASS_DETECTIONS_TO_PROCESSES_METHOD]
                       [--nWorkers NWORKERS] [--parallelizationUsesProcesses]
                       [--filterFileToLoad FILTERFILETOLOAD] [--omitFilteringFolder]
                       [--debugMaxDir DEBUGMAXDIR] [--debugMaxRenderDir DEBUGMAXRENDERDIR]
                       [--debugMaxRenderDetection DEBUGMAXRENDERDETECTION]
                       [--debugMaxRenderInstance DEBUGMAXRENDERINSTANCE]
                       [--forceSerialComparisons] [--forceSerialRendering]
                       [--maxOutputImageWidth MAXOUTPUTIMAGEWIDTH]
                       [--lineThickness LINETHICKNESS] [--boxExpansion BOXEXPANSION]
                       [--nDirLevelsFromLeaf NDIRLEVELSFROMLEAF] [--bRenderOtherDetections]
                       [--bRenderDetectionTiles]
                       [--detectionTilesPrimaryImageWidth DETECTIONTILESPRIMARYIMAGEWIDTH]
                       inputFile

find_repeat_detections positional arguments

inputFile - MD results .json file to process

find_repeat_detections options

-h, --help - show this help message and exit
--outputFile OUTPUTFILE - .json file to write filtered results to… do not use this if you are going to do manual review of the repeat detection images (which you should)
--imageBase IMAGEBASE - Image base dir
--outputBase OUTPUTBASE - filtering folder output dir
--confidenceMin CONFIDENCEMIN - Detection confidence threshold; don’t process anything below this
--confidenceMax CONFIDENCEMAX - Detection confidence threshold; don’t process anything above this
--iouThreshold IOUTHRESHOLD - Detections with IOUs greater than this are considered "the same detection"
--occurrenceThreshold OCCURRENCETHRESHOLD - More than this many near-identical detections in a group (e.g. a folder) is considered suspicious
--minSuspiciousDetectionSize MINSUSPICIOUSDETECTIONSIZE - Detections smaller than this fraction of image area are not considered suspicious
--maxSuspiciousDetectionSize MAXSUSPICIOUSDETECTIONSIZE - Detections larger than this fraction of image area are not considered suspicious
--maxImagesPerFolder MAXIMAGESPERFOLDER - Ignore folders with more than this many images in them
--excludeClasses EXCLUDECLASSES - List of integer classes we don’t want to treat as suspicious, separated by spaces.
--pass_detections_to_processes_method PASS_DETECTIONS_TO_PROCESSES_METHOD - Pass detections information to/from workers via "memory" (default) or "files"
--nWorkers NWORKERS - Level of parallelism for rendering and IOU computation
--parallelizationUsesProcesses - Parallelize with processes (defaults to threads)
--filterFileToLoad FILTERFILETOLOAD - Path to detectionIndex.json, which should be inside a folder of images that are manually verified to _not_ contain valid animals
--omitFilteringFolder - Should we skip creating the folder of rendered detections filtering?
--debugMaxDir DEBUGMAXDIR - For debugging only, limit the number of directories we process
--debugMaxRenderDir DEBUGMAXRENDERDIR - For debugging only, limit the number of directories we render
--debugMaxRenderDetection DEBUGMAXRENDERDETECTION - For debugging only, limit the number of detections we process per folder
--debugMaxRenderInstance DEBUGMAXRENDERINSTANCE - For debugging only, limit the number of instances we process per detection
--forceSerialComparisons - Disable parallelization during the comparison stage
--forceSerialRendering - Disable parallelization during the rendering stage
--maxOutputImageWidth MAXOUTPUTIMAGEWIDTH - Maximum output size for thumbnail images
--lineThickness LINETHICKNESS - Line thickness thumbnail images
--boxExpansion BOXEXPANSION - Box expansion for thumbnail images
--nDirLevelsFromLeaf NDIRLEVELSFROMLEAF - Number of levels from the leaf folders to use for repeat detection (0 == leaves)
--bRenderOtherDetections - Show non-target detections in light gray on each image
--bRenderDetectionTiles - Should we render a grid showing every instance (up to a limit) for each detection?
--detectionTilesPrimaryImageWidth DETECTIONTILESPRIMARYIMAGEWIDTH - The width of the main image when rendering images with detection tiles

postprocessing.repeat_detection_elimination.remove_repeat_detections module

remove_repeat_detections.py

Used after running find_repeat_detections, then manually filtering the results, to create a final filtered output file.

If you want to use this script, we recommend that you read the RDE user’s guide:

https://github.com/agentmorris/MegaDetector/tree/main/megadetector/postprocessing/repeat_detection_elimination

megadetector.postprocessing.repeat_detection_elimination.remove_repeat_detections.remove_repeat_detections(input_file, output_file, filtering_dir)[source]

Given an index file that was produced in a first pass through find_repeat_detections, and a folder of images (from which the user has deleted images they don’t want removed), remove the identified repeat detections from a set of MD results and write to a new file.

Parameters:

input_file (str) – .json file of MD results, from which we should remove repeat detections
output_file (str) – output .json file to which we should write MD results (with repeat detections removed)
filtering_dir (str) – the folder produced by find_repeat_detections, containing a detectionIndex.json file

remove_repeat_detections - CLI interface

remove_repeat_detections [-h] input_file output_file filtering_dir

remove_repeat_detections positional arguments

input_file - .json file containing the original, unfiltered API results
output_file - .json file to which you want to write the final, filtered API results
filtering_dir - directory where you looked at lots of images and decided which ones were really false positives

remove_repeat_detections options

-h, --help - show this help message and exit

postprocessing.repeat_detection_elimination.repeat_detections_core module

repeat_detections_core.py

Core utilities shared by find_repeat_detections and remove_repeat_detections.

Nothing in this file (in fact nothing in this subpackage) will make sense until you read the RDE user’s guide:

https://github.com/agentmorris/MegaDetector/tree/main/megadetector/postprocessing/repeat_detection_elimination

class megadetector.postprocessing.repeat_detection_elimination.repeat_detections_core.DetectionLocation(instance, detection, relative_dir, category, id=None)[source]

Bases: object

A unique-ish detection location, meaningful in the context of one directory. All detections within an IoU threshold of self.bbox will be stored in IndexedDetection objects.

bbox: bbox as x,y,w,h

category: category ID (not name) for this detection

clusterLabel: only used when doing cluster-based sorting

id: ID for this detection; this ID is only guaranteed to be unique within a directory

instances: list of IndexedDetections that match this detection

relativeDir: relative folder (i.e., camera name) in which this detectin was found

sampleImageDetections: list of detections on that canonical image that match this detection

sampleImageRelativeFileName: relative path to the canonical image representing this detection

to_api_detection()[source]

Converts this detection to a ‘detection’ dictionary, making the semi-arbitrary assumption that the first instance is representative of confidence.

Returns:: dictionary in the format used to store detections in MD results
Return type:: dict

class megadetector.postprocessing.repeat_detection_elimination.repeat_detections_core.IndexedDetection(i_detection=-1, filename='', bbox=None, confidence=-1, category='unknown')[source]

Bases: object

A single detection event on a single image

bbox: [x_min, y_min, width_of_box, height_of_box]

category: category ID (not name) of this detection

confidence: confidence value of this detection

filename: path to the image corresponding to this detection

i_detection: index of this detection within all detections for this filename

class megadetector.postprocessing.repeat_detection_elimination.repeat_detections_core.RepeatDetectionOptions[source]

Bases: object

Options that control the behavior of repeat detection elimination

bFailOnRenderError: Determines whether bounding-box rendering errors (typically network errors) should be treated as failures

bParallelizeComparisons: Should we parallelize (across cameras) comparisons to find repeat detections?

bParallelizeRendering: Should we parallelize image rendering?

bPrintMissingImageWarnings: Should we print a warning if images referred to in the MD results file are missing?

bRenderDetectionTiles: Optionally show a grid that includes a sample image for the detection, plus the top N additional detections

bRenderOtherDetections: Optionally show other detections (i.e., detections other than the one the user is evaluating), typically in a light gray.

bWriteFilteringFolder: Should we write the folder of images used to manually review repeat detections?

boxExpansion: Box expansion (in pixels)

categoryAgnosticComparisons: If this is False (default), a detection from class A is not considered to be “the same” as a detection from class B, even if they’re at the same location.

confidenceMax: Don’t consider detections with confidence higher than this as suspicious

confidenceMin: Don’t consider detections with confidence lower than this as suspicious

customDirNameFunction

An optional function that takes a string (an image file name) and returns a string (the corresponding folder ID), typically used when multiple folders actually correspond to the same camera in a manufacturer-specific way (e.g. a/b/c/RECONYX100 and a/b/c/RECONYX101 may really be the same camera).

See ct_utils for a common replacement function that handles most common manufacturer folder names:

from megadetector.utils import ct_utils self.customDirNameFunction = ct_utils.image_file_to_camera_folder

debugMaxDir

limit comparisons to a specific number of folders

Type:: For debugging

debugMaxRenderDetection

limit comparisons to a specific number of detections

Type:: For debugging

debugMaxRenderDir

limit rendering to a specific number of folders

Type:: For debugging

debugMaxRenderInstance

limit comparisons to a specific number of instances

Type:: For debugging

detectionTilesCroppedGridWidth

Width to use for the grid of detection instances.

Can be a width in pixels, or a number from 0 to 1 representing a fraction of the primary image width.

If you want to render the grid at exactly 1 pixel wide, I guess you’re out of luck.

detectionTilesMaxCrops: Maximum number of individual detection instances to include in the mosaic

detectionTilesPrimaryImageLocation: Location of the primary image within the mosaic (‘right’ or ‘left)

detectionTilesPrimaryImageWidth

Width of the original image (within the larger output image) when bRenderDetectionTiles is True.

If this is None, we’ll render the original image in the detection tile image at its original width.

excludeClasses

A list of category IDs (ints) that we don’t want consider as candidate repeat detections.

Typically used to say, e.g., “don’t bother analyzing people or vehicles for repeat detections”, which you could do by saying excludeClasses = [2,3].

excludeFolders: Exclude specific folders, mutually exclusive with [includeFolders]

filenameReplacements: Replace filename tokens after reading, useful when the directory structure has changed relative to the structure the detector saw.

filterFileToLoad: If this is not empty, we’ll load detections from a filter file rather than finding them from the detector output. This should be a .json file containing detections, generally this is the detectionIndex.json file in the filtering_* folder produced by find_repeat_detections().

filteredFileListToLoad

(optional) List of filenames remaining after deletion of identified repeated detections that are actually animals. This should be a flat text file, one relative filename per line.

This is a pretty esoteric code path and a candidate for removal.

The scenario where I see it being most useful is the very hypothetical one where we use an external tool for image handling that allows us to do something smarter and less destructive than deleting images to mark them as non-false-positives.

imageBase

Folder where images live; filenames in the MD results .json file should be relative to this folder.

imageBase can also be a SAS URL, in which case some error-checking is disabled.

includeFolders: Include only specific folders, mutually exclusive with [excludeFolders]

iouThreshold: What’s the IOU threshold for considering two boxes the same?

lineThickness: Line thickness (in pixels) for box rendering

maxImagesPerFolder: Ignore folders with more than this many images in them

maxOutputImageWidth

Image width for rendered images (it’s called “max” because we don’t resize smaller images).

Original size is preserved if this is None.

This does not include the tile image grid.

maxSuspiciousDetectionSize: Ignore “suspicious” detections larger than some size; these are often animals taking up the whole image. This is expressed as a fraction of the image size.

minSuspiciousDetectionSize: Ignore “suspicious” detections smaller than some size

missingImageWarningType: If bPrintMissingImageWarnings is True, should we print a warning about missing images just once (‘once’) or every time (‘all’)?

nDirLevelsFromLeaf

How many folders up from the leaf nodes should we be going to aggregate images into cameras?

If this is zero, each leaf folder is treated as a camera.

nWorkers: Number of workers to use for parallel operations

occurrenceThreshold: How many occurrences of a single location (as defined by the IOU threshold) are required before we declare it suspicious?

otherDetectionsColors

If bRenderOtherDetections is True, what color should we use to render the (hopefully pretty subtle) non-target detections?

In theory I’d like these “other detection” rectangles to be partially transparent, but this is not straightforward, and the alpha is ignored here. But maybe if I leave it here and wish hard enough, someday it will work.

otherDetectionsColors = [‘dimgray’]

otherDetectionsLineWidth: Line width (in pixels) for other detections

otherDetectionsThreshold: Threshold to use for other detections

outputBase: Folder where we should write temporary output.

parallelizationUsesThreads

Should we use threads (True) or processes (False) for parallelization?

Not relevant if nWorkers <= 1, or if bParallelizeComparisons and bParallelizeRendering are both False.

pass_detections_to_processes_method

For very large sets of results, passing chunks of results to and from workers as parameters (‘memory’) can be memory-intensive, so we can serialize to intermediate files instead (‘file’).

The use of ‘file’ here is still experimental.

smartSort

Sort detections within a directory so nearby detections are adjacent in the list, for faster review.

Can be None, ‘xsort’, or ‘clustersort’

None sorts detections chronologically by first occurrence
‘xsort’ sorts detections from left to right
‘clustersort’ clusters detections and sorts by cluster

smartSortDistanceThreshold: Only relevant if smartSort == ‘clustersort’

class megadetector.postprocessing.repeat_detection_elimination.repeat_detections_core.RepeatDetectionResults[source]

Bases: object

The results of an entire repeat detection analysis

detectionResults: The data table (Pandas DataFrame), as loaded from the input json file via load_api_results(). Has columns [‘file’, ‘detections’,’failure’].

detectionResultsFiltered: The data table after modification

filename_to_row: dict mapping filenames to rows in the master table

filterFile: The location of the .json file written with information about the RDE review images (typically detectionIndex.json)

otherFields: The other fields in the input json file, loaded via load_api_results()

rows_by_directory: dict mapping folder names to whole rows from the data table

suspicious_detections: An array of length nDirs, where each element is a list of DetectionLocation objects for that directory that have been flagged as suspicious

megadetector.postprocessing.repeat_detection_elimination.repeat_detections_core.find_repeat_detections(input_filename, output_file_name=None, options=None)[source]

Find detections in a MD results file that occur repeatedly and are likely to be rocks/sticks.

Parameters:

input_filename (str) – the MD results .json file to analyze
output_file_name (str, optional) – the filename to which we should write results with repeat detections removed, typically set to None during the first part of the RDE process.
options (RepeatDetectionOptions, optional) – all the interesting options controlling this process; see RepeatDetectionOptions for details.

Returns:

results of the RDE process; see RepeatDetectionResults for details.

Return type:

RepeatDetectionResults