utils package
This package contains utility functions for string manipulation, filename manipulation, downloading files from URLs, etc. Stuff one does when doing camera trap stuff that isn’t directly related to MegaDetector.
Submodules
utils.ct_utils module
ct_utils.py
Numeric/geometry/array utility functions.
- megadetector.utils.ct_utils.args_to_object(args, obj)[source]
Copies all fields from a Namespace (typically the output from parse_args) to an object. Skips fields starting with _. Does not check existence in the target object.
- Parameters:
args (argparse.Namespace) – the namespace to convert to an object
obj (object) – object whose whose attributes will be updated
- Returns:
the modified object (modified in place, but also returned)
- Return type:
object
- megadetector.utils.ct_utils.compare_values_nan_equal(v0, v1)[source]
Utility function for comparing two values when we want to return True if both values are NaN.
- Parameters:
v0 (object) – the first value to compare
v1 (object) – the second value to compare
- Returns:
True if v0 == v1, or if both v0 and v1 are NaN
- Return type:
bool
- megadetector.utils.ct_utils.convert_xywh_to_xyxy(api_box)[source]
Converts an xywh bounding box (the MD output format) to an xyxy bounding box (the format produced by TF-based MD models).
- Parameters:
api_box (list) – bbox formatted as [x_min, y_min, width_of_box, height_of_box]
- Returns:
bbox formatted as [x_min, y_min, x_max, y_max]
- Return type:
list
- megadetector.utils.ct_utils.convert_yolo_to_xywh(yolo_box)[source]
Converts a YOLO format bounding box [x_center, y_center, w, h] to [x_min, y_min, width_of_box, height_of_box].
- Parameters:
yolo_box (list) – bounding box of format [x_center, y_center, width_of_box, height_of_box]
- Returns:
bbox with coordinates represented as [x_min, y_min, width_of_box, height_of_box]
- Return type:
list
- megadetector.utils.ct_utils.dict_to_kvp_list(d, item_separator=' ', kv_separator='=', non_string_value_handling='error')[source]
Convert a string <–> string dict into a string containing list of list of key-value pairs. I.e., converts {‘a’:’dog’,’b’:’cat’} to ‘a=dog b=cat’. If d is None, returns None. If d is empty, returns ‘’.
- Parameters:
d (dict) – the dictionary to convert, must contain only strings
item_separator (str, optional) – the delimiter between KV pairs
kv_separator (str, optional) – the separator betweena a key and its value
non_string_value_handling (str, optional) – what do do with non-string values, can be “omit”, “error”, or “convert”
- Returns:
the string representation of [d]
- Return type:
str
- megadetector.utils.ct_utils.dict_to_object(d, obj)[source]
Copies all fields from a dict to an object. Skips fields starting with _. Does not check existence in the target object.
- Parameters:
d (dict) – the dict to convert to an object
obj (object) – object whose whose attributes will be updated
- Returns:
the modified object (modified in place, but also returned)
- Return type:
object
- megadetector.utils.ct_utils.environment_is_wsl()[source]
Determines whether we’re running in WSL.
- Returns:
True if we’re running in WSL
- megadetector.utils.ct_utils.get_iou(bb1, bb2)[source]
Calculates the intersection over union (IoU) of two bounding boxes.
Adapted from:
- Parameters:
bb1 (list) – [x_min, y_min, width_of_box, height_of_box]
bb2 (list) – [x_min, y_min, width_of_box, height_of_box]
- Returns:
intersection_over_union, a float in [0, 1]
- Return type:
float
- megadetector.utils.ct_utils.get_max_conf(im)[source]
Given an image dict in the MD output format, computes the maximum detection confidence for any class. Returns 0.0 if there were no detections, if there was a failure, or if ‘detections’ isn’t present.
- Parameters:
im (dict) – image dictionary in the MD output format (with a ‘detections’ field)
- Returns:
the maximum detection confidence across all classes
- Return type:
float
- megadetector.utils.ct_utils.image_file_to_camera_folder(image_fn)[source]
Removes common overflow folders (e.g. RECNX101, RECNX102) from paths, i.e. turn:
abcRECNX101image001.jpg
…into:
abc
Returns the same thing as os.dirname() (i.e., just the folder name) if no overflow folders are present.
Always converts backslashes to slashes.
- Parameters:
image_fn (str) – the image filename from which we should remove overflow folders
- Returns:
a version of [image_fn] from which camera overflow folders have been removed
- Return type:
str
- megadetector.utils.ct_utils.invert_dictionary(d, verify_unique=False)[source]
Creates a new dictionary that maps d.values() to d.keys()
- Parameters:
d (dict) – dictionary to invert
verify_unique (bool, optional) – error if values are not unique
- Returns:
inverted copy of [d]
- Return type:
dict
- megadetector.utils.ct_utils.is_empty(v, strip_strings=True)[source]
A common definition of “empty” used throughout the repo, particularly when loading data from .csv files. “empty” includes None, ‘’, and NaN.
- Parameters:
v (obj) – the object to evaluate for emptiness
strip_strings (bool, optional) – if v is a string, should whitespace be considered empty?
- Returns:
True if [v] is None, ‘’, or NaN, otherwise False
- Return type:
bool
- megadetector.utils.ct_utils.is_function_name(s, calling_namespace)[source]
Determines whether [s] is a callable function in the global or local scope, or a built-in function.
- Parameters:
s (str) – the string to test for function-ness
calling_namespace (dict) – typically pass the output of locals()
- megadetector.utils.ct_utils.is_iterable(x)[source]
Uses duck typing to assess whether [x] is iterable (list, set, dict, etc.).
- Parameters:
x (object) – the object to test
- Returns:
True if [x] appears to be iterable, otherwise False
- Return type:
bool
- megadetector.utils.ct_utils.is_list_sorted(L, reverse=False)[source]
Returns True if the list L appears to be sorted, otherwise False.
Calling is_list_sorted(L,reverse=True) is the same as calling is_list_sorted(L.reverse(),reverse=False).
- Parameters:
L (list) – list to evaluate
reverse (bool, optional) – whether to reverse the list before evaluating sort status
- Returns:
True if the list L appears to be sorted, otherwise False
- Return type:
bool
- megadetector.utils.ct_utils.is_running_in_gha()[source]
Determine whether we are running on a GitHub Actions runner.
- Returns:
True if we’re running in a GHA runner
- Return type:
bool
- megadetector.utils.ct_utils.is_sphinx_build()[source]
Determine whether we are running in the context of our Sphinx build.
- Returns:
True if we’re running a Sphinx build
- Return type:
bool
- megadetector.utils.ct_utils.isnan(v)[source]
Returns True if v is a nan-valued float, otherwise returns False.
- Parameters:
v (obj) – the object to evaluate for nan-ness
- Returns:
True if v is a nan-valued float, otherwise False
- Return type:
bool
- megadetector.utils.ct_utils.json_serialize_datetime(obj)[source]
Serializes datetime.datetime and datetime.date objects to ISO format.
- Parameters:
obj (object) – The object to serialize.
- Returns:
The ISO format string representation of the datetime object.
- Return type:
str
- Raises:
TypeError – If the object is not a datetime.datetime or datetime.date instance.
- megadetector.utils.ct_utils.make_temp_folder(top_level_folder='megadetector', subfolder=None, append_guid=True)[source]
Creates a temporary folder within the system temp folder, by default in a subfolder called megadetector/some_guid. Used for testing without making too much of a mess.
- Parameters:
top_level_folder (str, optional) – the top-level folder to use within the system temp folder
subfolder (str, optional) – the subfolder within [top_level_folder]
append_guid (bool, optional) – append a guid to the subfolder
- Returns:
the new directory
- Return type:
str
- megadetector.utils.ct_utils.make_test_folder(subfolder=None)[source]
Wrapper around make_temp_folder that creates folders within megadetector/tests
- Parameters:
subfolder (str) – specific subfolder to create within the default megadetector temp folder.
- megadetector.utils.ct_utils.max_none(a, b)[source]
Returns the maximum of a and b. If both are None, returns None. If one is None, returns the other.
- Parameters:
a (numeric) – the first value to compare
b (numeric) – the second value to compare
- Returns:
the maximum of a and b, or None
- Return type:
numeric
- megadetector.utils.ct_utils.min_none(a, b)[source]
Returns the minimum of a and b. If both are None, returns None. If one is None, returns the other.
- Parameters:
a (numeric) – the first value to compare
b (numeric) – the second value to compare
- Returns:
the minimum of a and b, or None
- Return type:
numeric
- megadetector.utils.ct_utils.parse_bool_string(s, strict=False)[source]
Convert the strings “true” or “false” to boolean values. Case-insensitive, discards leading and trailing whitespace. If s is already a bool, returns s.
- Parameters:
s (str or bool) – the string to parse, or the bool to return
strict (bool, optional) – only allow “true” or “false”, otherwise handles “1”, “0”, “yes”, and “no”.
- Returns:
the parsed value
- Return type:
bool
- megadetector.utils.ct_utils.parse_kvp(s, kv_separator='=')[source]
Parse a key/value pair, separated by [kv_separator]. Errors if s is not a valid key/value pair string. Strips leading/trailing whitespace from the key and value.
- Parameters:
s (str) – the string to parse
kv_separator (str, optional) – the string separating keys from values.
- Returns:
a 2-tuple formatted as (key,value)
- Return type:
tuple
- megadetector.utils.ct_utils.parse_kvp_list(items, kv_separator='=', d=None)[source]
Parse a list key-value pairs into a dictionary. If items is None or [], returns {}.
- Parameters:
items (list) – the list of KVPs to parse
kv_separator (str, optional) – the string separating keys from values.
d (dict, optional) – the initial dictionary, defaults to {}
- Returns:
a dict mapping keys to values
- Return type:
dict
- megadetector.utils.ct_utils.point_dist(p1, p2)[source]
Computes the distance between two points, represented as length-two tuples.
- Parameters:
p1 (list or tuple) – point, formatted as (x,y)
p2 (list or tuple) – point, formatted as (x,y)
- Returns:
the Euclidean distance between p1 and p2
- Return type:
float
- megadetector.utils.ct_utils.pretty_print_object(obj, b_print=True)[source]
Converts an arbitrary object to .json, optionally printing the .json representation.
- Parameters:
obj (object) – object to print
b_print (bool, optional) – whether to print the object
- Returns:
.json reprepresentation of [obj]
- Return type:
str
- megadetector.utils.ct_utils.rect_distance(r1, r2, format='x0y0x1y1')[source]
Computes the minimum distance between two axis-aligned rectangles, each represented as (x0,y0,x1,y1) by default.
Can also specify “format” as x0y0wh for MD-style bbox formatting (x0,y0,w,h).
- Parameters:
r1 (list or tuple) – rectangle, formatted as (x0,y0,x1,y1) or (x0,y0,xy,y1)
r2 (list or tuple) – rectangle, formatted as (x0,y0,x1,y1) or (x0,y0,xy,y1)
format (str, optional) – whether the boxes are formatted as ‘x0y0x1y1’ (default) or ‘x0y0wh’
- Returns:
the minimum distance between r1 and r2
- Return type:
float
- megadetector.utils.ct_utils.round_float(x, precision=3)[source]
Convenience wrapper for the native Python round()
- Parameters:
x (float) – number to truncate
precision (int, optional) – the number of significant digits to preserve, should be >= 1
- Returns:
rounded value
- Return type:
float
- megadetector.utils.ct_utils.round_float_array(xs, precision=3)[source]
Truncates the fractional portion of each floating-point value in the array [xs] to a specific number of floating-point digits.
- Parameters:
xs (list) – list of floats to round
precision (int, optional) – the number of significant digits to preserve, should be >= 1
- Returns:
list of rounded floats
- Return type:
list
- megadetector.utils.ct_utils.round_floats_in_nested_dict(obj, decimal_places=5, allow_iterator_conversion=False)[source]
Recursively rounds all floating point values in a nested structure to the specified number of decimal places. Handles dictionaries, lists, tuples, sets, and other iterables. Modifies mutable objects in place by default.
- Parameters:
obj (obj) – The object to process (can be a dict, list, set, tuple, or primitive value)
decimal_places (int, optional) – Number of decimal places to round to
allow_iterator_conversion (bool, optional) – for iterator types, should we convert to lists? Otherwise we error.
- Returns:
The processed object (useful for recursive calls)
- megadetector.utils.ct_utils.run_all_module_tests()[source]
Run all tests in the ct_utils module. This is not invoked by pytest; this is just a convenience wrapper for debugging the tests.
- megadetector.utils.ct_utils.sets_overlap(set1, set2)[source]
Determines whether two sets overlap.
- Parameters:
set1 (set) – the first set to compare (converted to a set if it’s not already)
set2 (set) – the second set to compare (converted to a set if it’s not already)
- Returns:
True if any elements are shared between set1 and set2
- Return type:
bool
- megadetector.utils.ct_utils.sort_dictionary_by_key(d, reverse=False)[source]
Sorts the dictionary [d] by key.
- Parameters:
d (dict) – dictionary to sort
reverse (bool, optional) – whether to sort in reverse (descending) order
- Returns:
sorted copy of [d]
- Return type:
dict
- megadetector.utils.ct_utils.sort_dictionary_by_value(d, sort_values=None, reverse=False)[source]
Sorts the dictionary [d] by value. If sort_values is None, uses d.values(), otherwise uses the dictionary sort_values as the sorting criterion. Always returns a new standard dict, so if [d] is, for example, a defaultdict, the returned value is not.
- Parameters:
d (dict) – dictionary to sort
sort_values (dict, optional) – dictionary mapping keys in [d] to sort values (defaults to None, uses [d] itself for sorting)
reverse (bool, optional) – whether to sort in reverse (descending) order
- Returns:
sorted copy of [d]
- Return type:
dict
- megadetector.utils.ct_utils.sort_list_of_dicts_by_key(L, k, reverse=False, none_handling='smallest')[source]
Sorts the list of dictionaries [L] by the key [k].
- Parameters:
L (list) – list of dictionaries to sort
k (object, typically str) – the sort key
reverse (bool, optional) – whether to sort in reverse (descending) order
none_handling (str, optional) – how to handle None values. Options: “smallest” - treat None as smaller than all other values (default) “largest” - treat None as larger than all other values “error” - raise error when None is compared with non-None
- Returns:
sorted copy of [L]
- Return type:
list
- megadetector.utils.ct_utils.sort_results_for_image(im)[source]
Sort classification and detection results in descending order by confidence (in place).
- Parameters:
im (dict) – image dictionary in the MD output format (with a ‘detections’ field)
- megadetector.utils.ct_utils.split_list_into_fixed_size_chunks(L, n)[source]
Split the list or tuple L into chunks of size n (allowing at most one chunk with size less than N, i.e. len(L) does not have to be a multiple of n).
- Parameters:
L (list) – list to split into chunks
n (int) – preferred chunk size
- Returns:
list of chunks, where each chunk is a list of length n or n-1
- Return type:
list
- megadetector.utils.ct_utils.split_list_into_n_chunks(L, n, chunk_strategy='greedy')[source]
Splits the list or tuple L into n equally-sized chunks (some chunks may be one element smaller than others, i.e. len(L) does not have to be a multiple of n).
chunk_strategy can be “greedy” (default, if there are k samples per chunk, the first k go into the first chunk) or “balanced” (alternate between chunks when pulling items from the list).
- Parameters:
L (list) – list to split into chunks
n (int) – number of chunks
chunk_strategy (str, optional) – “greedy” or “balanced”; see above
- Returns:
list of chunks, each of which is a list
- Return type:
list
- megadetector.utils.ct_utils.test_bounding_box_operations()[source]
Test bounding box conversion and IoU calculation.
- megadetector.utils.ct_utils.test_datetime_serialization()[source]
Test datetime serialization functions.
- megadetector.utils.ct_utils.test_detection_processing()[source]
Test functions related to processing detection results.
- megadetector.utils.ct_utils.test_dictionary_operations()[source]
Test dictionary manipulation and sorting functions.
- megadetector.utils.ct_utils.test_float_rounding_and_truncation()[source]
Test float rounding, truncation, and nested rounding functions.
- megadetector.utils.ct_utils.test_geometric_operations()[source]
Test geometric calculations like distances.
- megadetector.utils.ct_utils.test_list_operations()[source]
Test list sorting and chunking functions.
- megadetector.utils.ct_utils.test_object_conversion_and_presentation()[source]
Test functions that convert or present objects.
- megadetector.utils.ct_utils.test_string_parsing()[source]
Test string parsing utilities like KVP and boolean parsing.
- megadetector.utils.ct_utils.test_temp_folder_creation()[source]
Test temporary folder creation and cleanup.
- megadetector.utils.ct_utils.test_type_checking_and_validation()[source]
Test type checking and validation utility functions.
- megadetector.utils.ct_utils.to_bool(v)[source]
Convert an object to a bool with specific rules.
- Parameters:
v (object) – The object to convert
- Returns:
For strings: True if ‘true’ (case-insensitive), False if ‘false’, recursively applied if int-like
For int/bytes: False if 0, True otherwise
For bool: returns the bool as-is
For other types: None
- Return type:
bool or None
- megadetector.utils.ct_utils.truncate_float(x, precision=3)[source]
Truncates the fractional portion of a floating-point value to a specific number of floating-point digits.
For example:
truncate_float(0.0003214884) –> 0.000321 truncate_float(1.0003214884) –> 1.000321
This function is primarily used to achieve a certain float representation before exporting to JSON.
- Parameters:
x (float) – scalar to truncate
precision (int, optional) – the number of significant digits to preserve, should be >= 1
- Returns:
truncated version of [x]
- Return type:
float
- megadetector.utils.ct_utils.truncate_float_array(xs, precision=3)[source]
Truncates the fractional portion of each floating-point value in the array [xs] to a specific number of floating-point digits.
- Parameters:
xs (list) – list of floats to truncate
precision (int, optional) – the number of significant digits to preserve, should be >= 1
- Returns:
list of truncated floats
- Return type:
list
- megadetector.utils.ct_utils.write_json(path, content, indent=1, force_str=False, serialize_datetimes=False, ensure_ascii=True, encoding='utf-8')[source]
Standardized wrapper for json.dump().
- Parameters:
path (str) – filename to write to
content (object) – object to dump
indent (int, optional) – indentation depth passed to json.dump
force_str (bool, optional) – whether to force string conversion for non-serializable objects
serialize_datetimes (bool, optional) – whether to serialize datetime objects to ISO format
ensure_ascii (bool, optional) – whether to ensure ASCII characters in the output
encoding (str, optional) – string encoding to use
utils.directory_listing module
directory_listing.py
Script for creating Apache-style HTML directory listings for a local directory and all its subdirectories.
Also includes a preview of a jpg file (the first in an alphabetical list), if present.
- megadetector.utils.directory_listing.create_html_index(dir, overwrite=False, template_fun=<function _create_plain_index>, basepath=None, recursive=True)[source]
Recursively traverses the local directory [dir] and generates a index file for each folder using [template_fun] to generate the HTML output. Excludes hidden files.
- Parameters:
dir (str) – directory to process
overwrite (bool, optional) – whether to over-write existing index file
template_fun (func, optional) – function taking three arguments (string, list of string, list of string) representing the current root, the list of folders, and the list of files. Should return the HTML source of the index file.
basepath (str, optional) – if not None, the name used for each subfolder in [dir] in the output files will be relative to [basepath]
recursive (bool, optional) – recurse into subfolders
directory_listing - CLI interface
directory_listing [-h] [--basepath BASEPATH] [--overwrite] directory
directory_listing positional arguments
directory- Path to directory which should be traversed.
directory_listing options
--basepathBASEPATH- Folder names will be printed relative to basepath, if specified--overwrite- If set, the script will overwrite existing index.html files.
utils.md_tests module
md_tests.py
A series of tests to validate basic repo functionality and verify either “correct” inference behavior, or - when operating in environments other than the training environment - acceptable deviation from the correct results.
This module should not depend on anything else in this repo outside of the tests themselves, even if it means some duplicated code (e.g. for downloading files), since much of what it tries to test is, e.g., imports.
“Correctness” is determined by agreement with a file that this script fetches from lila.science.
- class megadetector.utils.md_tests.MDTestOptions[source]
Bases:
objectOptions controlling test behavior
- alt_model
For comparison tests, use a model that produces slightly different output
- alternative_batch_size
Batch size to use when testing batches of size > 1
- cli_test_pythonpath
PYTHONPATH to set for CLI tests; if None, inherits from the parent process. Only impacts the called functions, not the parent process.
- cli_working_dir
Current working directory when running CLI tests
If this is None, we won’t mess with the inherited working directory.
- cpu_execution_is_error
If GPU execution is requested, but a GPU is not available, should we error?
- default_model
Default model to use for testing (filename, URL, or well-known model string)
- detector_options
Detector options passed to PTDetector
- disable_gpu
Force CPU execution
- force_data_download
Download test data even if it appears to have already been downloaded
- force_data_unzip
Unzip test data even if it appears to have already been unzipped
- iou_threshold_for_file_comparison
IoU threshold used to determine whether boxes in two detection files likely correspond to the same box.
- max_conf_error
How much deviation from the expected confidence values should we allow before a disrepancy becomes an error?
- max_coord_error
How much deviation from the expected detection coordinates should we allow before a disrepancy becomes an error?
- model_folder
Used to drive a series of tests (typically with a low value for python_test_depth) over a folder of models.
- n_cores_for_multiprocessing_tests
Number of cores to use for multi-CPU inference tests
- python_test_depth
Used as a knob to control the level of Python tests, typically used when we want to run a series of simple tests on a small number of models, rather than a deep test of tests on a small number of models. The gestalt is that this is a range from 0-100.
- scratch_dir
Force a specific folder for temporary input/output
- skip_cli_tests
Skip CLI tests
- skip_cpu_tests
Skip force-CPU tests
- skip_download_tests
Skip download tests
- skip_image_tests
Skip tests related to still image processing
- skip_import_tests
Skip module import tests
- skip_localhost_downloads
Skip download tests for local URLs
- skip_python_tests
Skip tests launched via Python functions (as opposed to CLIs)
- skip_video_tests
Skip tests related to video processing
- test_data_url
Where does the test data live?
- test_mode
Currently should be ‘all’ or ‘utils-only’
- warning_mode
By default, any unexpected behavior is an error; this forces most errors to be treated as warnings.
- yolo_working_dir
YOLOv5 installation, only relevant if we’re testing run_inference_with_yolov5_val.
If this is None, we’ll skip that test.
- megadetector.utils.md_tests.compare_detection_lists(detections_a, detections_b, options, bidirectional_comparison=True)[source]
Compare two lists of MD-formatted detections, matching detections across lists using IoU criteria. Generally used to compare detections for the same image when two sets of results are expected to be more or less the same.
- Parameters:
detections_a (list) – the first set of detection dicts
detections_b (list) – the second set of detection dicts
options (MDTestOptions) – options that determine tolerable differences between files
bidirectional_comparison (bool, optional) – reverse the arguments and make a recursive call.
- Returns:
a dictionary with keys ‘max_conf_error’ and ‘max_coord_error’.
- Return type:
dict
- megadetector.utils.md_tests.compare_results(inference_output_file, expected_results_file, options, expected_results_file_is_absolute=False)[source]
Compare two MD-formatted output files that should be nearly identical, allowing small changes (e.g. rounding differences). Generally used to compare a new results file to an expected results file.
- Parameters:
inference_output_file (str) – the first results file to compare
expected_results_file (str) – the second results file to compare
options (MDTestOptions) – options that determine tolerable differences between files
expected_results_file_is_absolute (str, optional) – by default, expected_results_file is appended to options.scratch_dir; this option specifies that it’s an absolute path.
- Returns:
dictionary with keys ‘max_coord_error’ and ‘max_conf_error’
- Return type:
dict
- megadetector.utils.md_tests.download_test_data(options=None)[source]
Downloads the test zipfile if necessary, unzips if necessary. Initializes temporary fields in [options], particularly [options.scratch_dir].
- Parameters:
options (MDTestOptions, optional) – see MDTestOptions for details
- Returns:
the same object passed in as input, or the options that were used if [options] was supplied as None
- Return type:
- megadetector.utils.md_tests.execute(cmd)[source]
Runs [cmd] (a single string) in a shell, yielding each line of output to the caller.
- Parameters:
cmd (str) – command to run
- Returns:
the command’s return code, always zero, otherwise a CalledProcessError is raised
- Return type:
int
- megadetector.utils.md_tests.execute_and_print(cmd, print_output=True, catch_exceptions=False, echo_command=True)[source]
Runs [cmd] (a single string) in a shell, capturing (and optionally printing) output.
- Parameters:
cmd (str) – command to run
print_output (bool, optional) – whether to print output from [cmd]
catch_exceptions (bool, optional) – whether to catch exceptions, rather than raising them
echo_command (bool, optional) – whether to print [cmd] to stdout prior to execution
- Returns:
a dictionary with fields “status” (the process return code) and “output” (the content of stdout)
- Return type:
dict
- megadetector.utils.md_tests.get_expected_results_filename(gpu_is_available, model_string='mdv5a', test_type='image', augment=False, options=None)[source]
Expected results vary just a little across inference environments, particularly between PT 1.x and 2.x, so when making sure things are working acceptably, we compare to a reference file that matches the current environment.
This function gets the correct filename to compare to current results, depending on whether a GPU is available.
- Parameters:
gpu_is_available (bool) – whether a GPU is available
model_string (str, optional) – the model for which we’re retrieving expected results
test_type (str, optional) – the test type we’re running (“image” or “video”)
augment (bool, optional) – whether we’re running this test with image augmentation
options (MDTestOptions, optional) – additional control flow options
- Returns:
relative filename of the results file we should use (within the test data zipfile)
- Return type:
str
- megadetector.utils.md_tests.is_gpu_available(verbose=True)[source]
Checks whether a GPU (including M1/M2 MPS) is available, according to PyTorch. Returns false if PT fails to import.
- Parameters:
verbose (bool, optional) – enable additional debug console output
- Returns:
whether a GPU is available
- Return type:
bool
- megadetector.utils.md_tests.output_files_are_identical(fn1, fn2, verbose=False)[source]
Checks whether two MD-formatted output files are identical other than file sorting.
- Parameters:
fn1 (str) – the first filename to compare
fn2 (str) – the second filename to compare
verbose (bool, optional) – enable additional debug output
- Returns:
whether [fn1] and [fn2] are identical other than file sorting.
- Return type:
bool
- megadetector.utils.md_tests.run_cli_tests(options)[source]
Runs CLI (as opposed to Python-based) package tests.
- Parameters:
options (MDTestOptions) – see MDTestOptions for details
- megadetector.utils.md_tests.run_download_tests(options)[source]
Test automatic model downloads.
- Parameters:
options (MDTestOptions) – see MDTestOptions for details
- megadetector.utils.md_tests.run_python_tests(options)[source]
Runs Python-based (as opposed to CLI-based) package tests.
- Parameters:
options (MDTestOptions) – see MDTestOptions for details
- megadetector.utils.md_tests.run_tests(options)[source]
Runs Python-based and/or CLI-based package tests.
- Parameters:
options (MDTestOptions) – see MDTestOptions for details
- megadetector.utils.md_tests.test_package_imports(package_name, exceptions=None, verbose=True)[source]
Imports all modules in [package_name]
- Parameters:
package_name (str) – the package name to test
exceptions (list, optional) – exclude any modules that contain any of these strings
verbose (bool, optional) – enable additional debug output
- megadetector.utils.md_tests.test_suite_entry_point()[source]
This is the entry point when running tests via pytest; we run a subset of tests in this environment, e.g. we don’t run CLI or video tests.
md_tests - CLI interface
MegaDetector test suite
md_tests [-h] [--disable_gpu] [--cpu_execution_is_error] [--scratch_dir SCRATCH_DIR]
[--skip_image_tests] [--skip_video_tests] [--skip_video_rendering_tests]
[--skip_python_tests] [--skip_cli_tests] [--skip_download_tests]
[--skip_import_tests] [--skip_cpu_tests] [--force_data_download]
[--force_data_unzip] [--warning_mode] [--max_conf_error MAX_CONF_ERROR]
[--max_coord_error MAX_COORD_ERROR] [--cli_working_dir CLI_WORKING_DIR]
[--yolo_working_dir YOLO_WORKING_DIR] [--cli_test_pythonpath CLI_TEST_PYTHONPATH]
[--test_mode TEST_MODE] [--python_test_depth PYTHON_TEST_DEPTH]
[--model_folder MODEL_FOLDER] [--detector_options [KEY=VALUE ...]]
[--default_model DEFAULT_MODEL]
md_tests options
--disable_gpu- Disable GPU operation--cpu_execution_is_error- Fail if the GPU appears not to be available--scratch_dirSCRATCH_DIR- Directory for temporary storage (defaults to system temp dir)--skip_image_tests- Skip tests related to still images--skip_video_tests- Skip tests related to video--skip_video_rendering_tests- Skip tests related to rendering video--skip_python_tests- Skip python tests--skip_cli_tests- Skip CLI tests--skip_download_tests- Skip model download tests--skip_import_tests- Skip module import tests--skip_cpu_tests- Skip force-CPU tests--force_data_download- Force download of the test data file, even if it’s already available--force_data_unzip- Force extraction of all files in the test data file, even if they’re already available--warning_mode- Turns numeric/content errors into warnings--max_conf_errorMAX_CONF_ERROR- Maximum tolerable confidence value deviation from expected (default 0.005)--max_coord_errorMAX_COORD_ERROR- Maximum tolerable coordinate value deviation from expected (default 0.001)--cli_working_dirCLI_WORKING_DIR- Working directory for CLI tests--yolo_working_dirYOLO_WORKING_DIR- Working directory for yolo inference tests--cli_test_pythonpathCLI_TEST_PYTHONPATH- PYTHONPATH to set for CLI tests; if None, inherits from the parent process--test_modeTEST_MODE- Test mode:"all"or"utils-only"--python_test_depthPYTHON_TEST_DEPTH- Used as a knob to control the level of Python tests (0-100)--model_folderMODEL_FOLDER- Run Python tests on every model in this folder--detector_optionsKEY=VALUE- Detector-specific options, as a space-separated list of key-value pairs--default_modelDEFAULT_MODEL- Default model file or well-known model name (used for most tests)
utils.path_utils module
path_utils.py
Miscellaneous useful utils for path manipulation, i.e. things that could almost be in os.path, but aren’t.
- class megadetector.utils.path_utils.TestPathUtils[source]
Bases:
objectTests for path_utils.py
- test_is_executable()[source]
Test the is_executable function. This is a basic test; comprehensive testing is environment-dependent.
- test_parallel_copy_files()[source]
Test the parallel_copy_files function (with max_workers=1 for test simplicity).
- test_parallel_zip_individual_files_and_folders()[source]
Test parallel_zip_files, parallel_zip_folders, and zip_each_file_in_folder.
- megadetector.utils.path_utils.add_files_to_single_tar_file(input_files, output_fn, arc_name_base, overwrite=False, verbose=False, mode='x')[source]
Adds all the files in [input_files] to the tar file [output_fn]. Archive names are relative to arc_name_base.
- Parameters:
input_files (list) – list of absolute filenames to include in the .tar file
output_fn (str) – .tar file to create
arc_name_base (str) – absolute folder from which relative paths should be determined; behavior is undefined if there are files in [input_files] that don’t live within [arc_name_base]
overwrite (bool, optional) – whether to overwrite an existing .tar file
verbose (bool, optional) – enable additional debug console output
mode (str, optional) – compression type, can be ‘x’ (no compression), ‘x:gz’, or ‘x:bz2’.
- Returns:
the output tar file, whether we created it or determined that it already exists
- Return type:
str
- megadetector.utils.path_utils.clean_filename(filename, allow_list='~-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', char_limit=255, force_lower=False, remove_trailing_leading_whitespace=True, replace_whitespace=None)[source]
Removes non-ASCII and other invalid filename characters (on any reasonable OS) from a filename, then optionally trims to a maximum length.
Does not allow :/ by default, use clean_path if you want to preserve those.
Adapted from https://gist.github.com/wassname/1393c4a57cfcbf03641dbc31886123b8
- Parameters:
filename (str) – filename to clean
allow_list (str, optional) – string containing all allowable filename characters
char_limit (int, optional) – maximum allowable filename length, if None will skip this step
force_lower (bool, optional) – convert the resulting filename to lowercase
remove_trailing_leading_whitespace (bool, optional) – remove trailing and leading whitespace from each component of a path, e.g. does not allow a/b/c /d.jpg
replace_whitespace (str, optional) – replace all contiguous whitespace with this string, or None to leave whitespace intact
- Returns:
cleaned version of [filename]
- Return type:
str
- megadetector.utils.path_utils.clean_path(pathname, allow_list='~-_.() abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:\\/', char_limit=255, force_lower=False, remove_trailing_leading_whitespace=True)[source]
Removes non-ASCII and other invalid path characters (on any reasonable OS) from a path, then optionally trims to a maximum length.
- Parameters:
pathname (str) – path name to clean
allow_list (str, optional) – string containing all allowable filename characters
char_limit (int, optional) – maximum allowable filename length, if None will skip this step
force_lower (bool, optional) – convert the resulting filename to lowercase
remove_trailing_leading_whitespace (bool, optional) – remove trailing and leading whitespace from each component of a path, e.g. does not allow a/b/c /d.jpg
- Returns:
cleaned version of [filename]
- Return type:
str
- megadetector.utils.path_utils.compute_file_hash(file_path, algorithm='sha256', allow_failures=True)[source]
Compute the hash of a file.
Adapted from:
https://www.geeksforgeeks.org/python-program-to-find-hash-of-file/
- Parameters:
file_path (str) – the file to hash
algorithm (str, optional) – the hashing algorithm to use (e.g. md5, sha256)
allow_failures (bool, optional) – if True, read failures will silently return None; if false, read failures will raise exceptions
- Returns:
the hash value for this file
- Return type:
str
- megadetector.utils.path_utils.delete_file(input_file, verbose=False)[source]
Deletes a single file.
- Parameters:
input_file (str) – file to delete
verbose (bool, optional) – enable additional debug console output
- Returns:
True if file was deleted successfully, False otherwise
- Return type:
bool
- megadetector.utils.path_utils.file_list(base_dir, convert_slashes=True, return_relative_paths=False, sort_files=True, recursive=False)[source]
Trivial wrapper for recursive_file_list, which was a poor function name choice at the time, since I later wanted to add non-recursive lists, but it doesn’t make sense to have a “recursive” option in a function called “recursive_file_list”.
- Parameters:
base_dir (str) – folder to enumerate
convert_slashes (bool, optional) – force forward slashes; if this is False, will use the native path separator
return_relative_paths (bool, optional) – return paths that are relative to [base_dir], rather than absolute paths
sort_files (bool, optional) – force files to be sorted, otherwise uses the sorting provided by os.walk()
recursive (bool, optional) – enumerate recursively
- Returns:
list of filenames
- Return type:
list
- megadetector.utils.path_utils.fileparts(path)[source]
Breaks down a path into the directory path, filename, and extension.
Note that the ‘.’ lives with the extension, and separators are removed.
Examples:
>>> fileparts('file') ('', 'file', '') >>> fileparts(r'c:/dir/file.jpg') ('c:/dir', 'file', '.jpg') >>> fileparts('/dir/subdir/file.jpg') ('/dir/subdir', 'file', '.jpg')- Parameters:
path (str) – path name to separate into parts
- Returns:
- tuple containing (p,n,e):
p: str, directory path
n: str, filename without extension
e: str, extension including the ‘.’
- Return type:
tuple
- megadetector.utils.path_utils.find_image_strings(strings)[source]
Given a list of strings that are potentially image file names, looks for strings that actually look like image file names (based on extension).
- Parameters:
strings (list) – list of filenames to check for image-ness
- Returns:
the subset of [strings] that appear to be image filenames
- Return type:
list
- megadetector.utils.path_utils.find_images(dirname, recursive=False, return_relative_paths=False, convert_slashes=True)[source]
Finds all files in a directory that look like image file names. Returns absolute paths unless return_relative_paths is set. Uses the OS-native path separator unless convert_slashes is set, in which case will always use ‘/’.
- Parameters:
dirname (str) – the folder to search for images
recursive (bool, optional) – whether to search recursively
return_relative_paths (str, optional) – return paths that are relative to [dirname], rather than absolute paths
convert_slashes (bool, optional) – force forward slashes in return values
- Returns:
list of image filenames found in [dirname]
- Return type:
list
- megadetector.utils.path_utils.flatten_path(pathname, separator_chars=':\\/', separator_char_replacement='~')[source]
Removes non-ASCII and other invalid path characters (on any reasonable OS) from a path, then trims to a maximum length. Replaces all valid separators with [separator_char_replacement.]
- Parameters:
pathname (str) – path name to flatten
separator_chars (str, optional) – string containing all known path separators
separator_char_replacement (str, optional) – string to insert in place of path separators.
- Returns:
flattened version of [pathname]
- Return type:
str
- megadetector.utils.path_utils.folder_list(base_dir, convert_slashes=True, return_relative_paths=False, sort_folders=True, recursive=False)[source]
Enumerates folders (not files) in [base_dir].
- Parameters:
base_dir (str) – folder to enumerate
convert_slashes (bool, optional) – force forward slashes; if this is False, will use the native path separator
return_relative_paths (bool, optional) – return paths that are relative to [base_dir], rather than absolute paths
sort_folders (bool, optional) – force folders to be sorted, otherwise uses the sorting provided by os.walk()
recursive (bool, optional) – enumerate recursively
- Returns:
list of folder names
- Return type:
list
- megadetector.utils.path_utils.folder_summary(folder, print_summary=True)[source]
Returns (and optionally prints) a summary of [folder], including:
The total number of files
The total number of folders
The number of files for each extension
- Parameters:
folder (str) – folder to summarize
print_summary (bool, optional) – whether to print the summary
- Returns:
with fields “n_files”, “n_folders”, and “extension_to_count”
- Return type:
dict
- megadetector.utils.path_utils.get_file_sizes(filenames, max_workers=1, use_threads=True, verbose=False, recursive=True, convert_slashes=True, return_relative_paths=True)[source]
Returns a dictionary mapping every file in [filenames] to the corresponding file size, or None for errors. If [filenames] is a folder, will enumerate the folder (optionally recursively).
- Parameters:
filenames (list or str) – list of filenames for which we should read sizes, or a folder within which we should read all file sizes recursively
max_workers (int, optional) – number of concurrent workers; set <= 1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False); ignored if max_workers <= 1
verbose (bool, optional) – enable debug output
recursive (bool, optional) – enumerate recursively, only relevant if [filenames] is a folder.
convert_slashes (bool, optional) – convert backslashes to forward slashes
return_relative_paths (bool, optional) – return relative paths; only relevant if [filenames] is a folder.
- Returns:
mapping filename to file size in bytes, or None for files that error
- Return type:
dict
- megadetector.utils.path_utils.insert_before_extension(filename, s=None, separator='.')[source]
Insert string [s] before the extension in [filename], separated with [separator].
If [s] is empty, generates a date/timestamp. If [filename] has no extension, appends [s].
Examples:
>>> insert_before_extension('/dir/subdir/file.ext', 'insert') '/dir/subdir/file.insert.ext' >>> insert_before_extension('/dir/subdir/file', 'insert') '/dir/subdir/file.insert' >>> insert_before_extension('/dir/subdir/file') '/dir/subdir/file.2020.07.20.10.54.38'- Parameters:
filename (str) – filename to manipulate
s (str, optional) – string to insert before the extension in [filename], or None to insert a datestamp
separator (str, optional) – separator to place between the filename base and the inserted string
- Returns:
modified string
- Return type:
str
- megadetector.utils.path_utils.is_executable(filename)[source]
Checks whether [filename] is on the system path and marked as executable.
- Parameters:
filename (str) – filename to check for executable status
- Returns:
True if [filename] is on the system path and marked as executable, otherwise False
- Return type:
bool
- megadetector.utils.path_utils.is_image_file(s, img_extensions=('.jpg', '.jpeg', '.gif', '.png', '.tif', '.tiff', '.bmp', '.webp', '.avif'))[source]
Checks a file’s extension against a hard-coded set of image file extensions. Uses case-insensitive comparison.
Does not check whether the file exists, only determines whether the filename implies it’s an image file.
- Parameters:
s (str) – filename to evaluate for image-ness
img_extensions (list, optional) – list of known image file extensions
- Returns:
True if [s] appears to be an image file, else False
- Return type:
bool
- megadetector.utils.path_utils.make_executable(filename, catch_exceptions=False)[source]
Make [filename] executable.
- Parameters:
filename (str) – filename to make executable
catch_exceptions (bool, optional) – treat errors as warnings
- megadetector.utils.path_utils.open_file(filename, attempt_to_open_in_wsl_host=False, browser_name=None)[source]
Opens [filename] in the default OS file handler for this file type.
If browser_name is not None, uses the webbrowser module to open the filename in the specified browser; see https://docs.python.org/3/library/webbrowser.html for supported browsers. Falls back to the default file handler if webbrowser.open() fails. In this case, attempt_to_open_in_wsl_host is ignored unless webbrowser.open() fails.
If browser_name is ‘default’, uses the system default. This is different from the parameter to webbrowser.get(), where None implies the system default.
- Parameters:
filename (str) – file to open
attempt_to_open_in_wsl_host (bool, optional) – if this is True, and we’re in WSL, attempts to open [filename] in the Windows host environment
browser_name (str, optional) – see above
- megadetector.utils.path_utils.open_file_in_chrome(filename)[source]
Open a file in chrome, regardless of file type. I typically use this to open .md files in Chrome.
- Parameters:
filename (str) – file to open
- Returns:
whether the operation was successful
- Return type:
bool
- megadetector.utils.path_utils.parallel_compute_file_hashes(filenames, max_workers=16, use_threads=True, recursive=True, algorithm='sha256', verbose=False)[source]
Compute file hashes for a list or folder of images.
- Parameters:
filenames (list or str) – a list of filenames or a folder
max_workers (int, optional) – the number of parallel workers to use; set to <=1 to disable parallelization
use_threads (bool, optional) – whether to use threads (True) or processes (False) for parallelization
algorithm (str, optional) – the hashing algorithm to use (e.g. md5, sha256)
recursive (bool, optional) – if [filenames] is a folder, whether to enumerate recursively. Ignored if [filenames] is a list.
verbose (bool, optional) – enable additional debug output
- Returns:
a dict mapping filenames to hash values; values will be None for files that fail to load.
- Return type:
dict
- megadetector.utils.path_utils.parallel_copy_files(input_file_to_output_file, max_workers=16, use_threads=True, overwrite=False, verbose=False, move=False)[source]
Copy (or move) files from source to target according to the dict input_file_to_output_file.
- Parameters:
input_file_to_output_file (dict) – dictionary mapping source files to the target files to which they should be copied
max_workers (int, optional) – number of concurrent workers; set to <=1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False) for parallel copying; ignored if max_workers <= 1
overwrite (bool, optional) – whether to overwrite existing destination files
verbose (bool, optional) – enable additional debug output
move (bool, optional) – move instead of copying
- megadetector.utils.path_utils.parallel_delete_files(input_files, max_workers=16, use_threads=True, verbose=False)[source]
Deletes one or more files in parallel.
- Parameters:
input_files (list) – list of files to delete
max_workers (int, optional) – number of concurrent workers, set to <= 1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False); ignored if max_workers <= 1
verbose (bool, optional) – enable additional debug console output
- megadetector.utils.path_utils.parallel_unzip_files(input_files, output_folder=None, max_workers=16, use_threads=True, verbose=False)[source]
Unzips one or more zipfiles in parallel.
- Parameters:
input_files (list) – list of zipfiles to unzip
output_folder (str, optional) – folder to which we should unzip all files in [input_files], defaults to unzipping each file to the folder where it lives
max_workers (int, optional) – number of concurrent workers, set to <= 1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False); ignored if max_workers <= 1
verbose (bool, optional) – enable additional debug console output
- megadetector.utils.path_utils.parallel_zip_files(input_files, max_workers=16, use_threads=True, compress_level=9, overwrite=False, verbose=False)[source]
Zips one or more files to separate output files in parallel, leaving the original files in place. Each file is zipped to [filename].zip.
- Parameters:
input_files (str) – list of files to zip
max_workers (int, optional) – number of concurrent workers, set to <= 1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False); ignored if max_workers <= 1
compress_level (int, optional) – zip compression level between 0 and 9
overwrite (bool, optional) – whether to overwrite an existing .tar file
verbose (bool, optional) – enable additional debug console output
- megadetector.utils.path_utils.parallel_zip_folders(input_folders, max_workers=16, use_threads=True, compress_level=9, overwrite=False, verbose=False)[source]
Zips one or more folders to separate output files in parallel, leaving the original folders in place. Each folder is zipped to [folder_name].zip.
- Parameters:
input_folders (list) – list of folders to zip
max_workers (int, optional) – number of concurrent workers, set to <= 1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False); ignored if max_workers <= 1
compress_level (int, optional) – zip compression level between 0 and 9
overwrite (bool, optional) – whether to overwrite an existing .tar file
verbose (bool, optional) – enable additional debug console output
- megadetector.utils.path_utils.path_is_abs(p)[source]
Determines whether [p] is an absolute path. An absolute path is defined as one that starts with slash, backslash, or a letter followed by a colon.
- Parameters:
p (str) – path to evaluate
- Returns:
True if [p] is an absolute path, else False
- Return type:
bool
- megadetector.utils.path_utils.path_join(*paths, convert_slashes=True)[source]
Wrapper for os.path.join that optionally converts backslashes to forward slashes.
- Parameters:
*paths (variable-length set of strings) – Path components to be joined.
convert_slashes (bool, optional) – whether to convert \ to /
- Returns:
A string with the joined path components.
- megadetector.utils.path_utils.read_list_from_file(filename)[source]
Reads a json-formatted list of strings from a file.
- Parameters:
filename (str) – .json filename to read
- Returns:
list of strings read from [filename]
- Return type:
list
- megadetector.utils.path_utils.recursive_file_list(base_dir, convert_slashes=True, return_relative_paths=False, sort_files=True, recursive=True)[source]
Enumerates files (not directories) in [base_dir].
- Parameters:
base_dir (str) – folder to enumerate
convert_slashes (bool, optional) – force forward slashes; if this is False, will use the native path separator
return_relative_paths (bool, optional) – return paths that are relative to [base_dir], rather than absolute paths
sort_files (bool, optional) – force files to be sorted, otherwise uses the sorting provided by os.walk()
recursive (bool, optional) – enumerate recursively
- Returns:
list of filenames
- Return type:
list
- megadetector.utils.path_utils.remove_empty_folders(path, remove_root=False)[source]
Recursively removes empty folders within the specified path.
- Parameters:
path (str) – the folder from which we should recursively remove empty folders.
remove_root (bool, optional) – whether to remove the root directory if it’s empty after removing all empty subdirectories. This will always be True during recursive calls.
- Returns:
True if the directory is empty after processing, False otherwise
- Return type:
bool
- megadetector.utils.path_utils.safe_create_link(link_exists, link_new)[source]
Creates a symlink at [link_new] pointing to [link_exists].
If [link_new] already exists, make sure it’s a link (not a file), and if it has a different target than [link_exists], removes and re-creates it.
Creates a real directory if necessary.
Errors if [link_new] already exists but it’s not a link.
- Parameters:
link_exists (str) – the source of the (possibly-new) symlink
link_new (str) – the target of the (possibly-new) symlink
- megadetector.utils.path_utils.split_path(path)[source]
Splits [path] into all its constituent file/folder tokens.
Examples:
>>> split_path(r'c:\dir\subdir\file.txt') ['c:\\', 'dir', 'subdir', 'file.txt'] >>> split_path('/dir/subdir/file.jpg') ['/', 'dir', 'subdir', 'file.jpg'] >>> split_path('c:\\') ['c:\\'] >>> split_path('/') ['/']- Parameters:
path (str) – path to split into tokens
- Returns:
list of path tokens
- Return type:
list
- megadetector.utils.path_utils.test_file_write(fn, overwrite=True)[source]
Writes an empty file to [fn], used to test that we have appropriate permissions. If [fn] exists and overwrite is False, this function errors. Creates the directory containing [fn] if necessary. Does not delete the test file.
- Parameters:
fn (str) – the filename to which we should perform a test write
overwrite (bool, optional) – if [fn] exists, whether we should overwrite (True) or error (False)
- Returns:
currently always returns True or errors
- Return type:
bool
- megadetector.utils.path_utils.unzip_file(input_file, output_folder=None)[source]
Unzips a zipfile to the specified output folder, defaulting to the same location as the input file.
- Parameters:
input_file (str) – zipfile to unzip
output_folder (str, optional) – folder to which we should unzip [input_file], defaults to unzipping to the folder where [input_file] lives
- megadetector.utils.path_utils.windows_path_to_wsl_path(filename, failure_behavior='none')[source]
Converts a Windows path to a WSL path, or returns None if that’s not possible. E.g. converts:
e:abc
…to:
/mnt/e/a/b/c
- Parameters:
filename (str) – filename to convert
failure_behavior (str, optional) – what to do if the path can’t be processed as a Windows path. ‘none’ to return None in this case, ‘original’ to return the original path.
- Returns:
WSL equivalent to the Windows path [filename]
- Return type:
str
- megadetector.utils.path_utils.write_list_to_file(output_file, strings)[source]
Writes a list of strings to either a JSON file or text file, depending on extension of the given file name.
- Parameters:
output_file (str) – file to write
strings (list) – list of strings to write to [output_file]
- megadetector.utils.path_utils.wsl_path_to_windows_path(filename, failure_behavior='none')[source]
Converts a WSL path to a Windows path. For example, converts:
/mnt/e/a/b/c
…to:
e:abc
- Parameters:
filename (str) – filename to convert
failure_behavior (str, optional) – what to do if the path can’t be processed as a WSL path. ‘none’ to return None in this case, ‘original’ to return the original path.
- Returns:
Windows equivalent to the WSL path [filename]
- Return type:
str
- megadetector.utils.path_utils.zip_each_file_in_folder(folder_name, recursive=False, max_workers=16, use_threads=True, compress_level=9, overwrite=False, required_token=None, verbose=False, exclude_zip=True)[source]
Zips each file in [folder_name] to its own zipfile (filename.zip), optionally recursing. To zip a whole folder into a single zipfile, use zip_folder().
- Parameters:
folder_name (str) – the folder within which we should zip files
recursive (bool, optional) – whether to recurse within [folder_name]
max_workers (int, optional) – number of concurrent workers, set to <= 1 to disable parallelism
use_threads (bool, optional) – whether to use threads (True) or processes (False); ignored if max_workers <= 1
compress_level (int, optional) – zip compression level between 0 and 9
overwrite (bool, optional) – whether to overwrite an existing .tar file
required_token (str, optional) – only zip files whose names contain this string
verbose (bool, optional) – enable additional debug console output
exclude_zip (bool, optional) – skip files ending in .zip
- megadetector.utils.path_utils.zip_file(input_fn, output_fn=None, overwrite=False, verbose=False, compress_level=9)[source]
Zips a single file.
- Parameters:
input_fn (str) – file to zip
output_fn (str, optional) – target zipfile; if this is None, we’ll use [input_fn].zip
overwrite (bool, optional) – whether to overwrite an existing target file
verbose (bool, optional) – enable existing debug console output
compress_level (int, optional) – compression level to use, between 0 and 9
- Returns:
the output zipfile, whether we created it or determined that it already exists
- Return type:
str
- megadetector.utils.path_utils.zip_files_into_single_zipfile(input_files, output_fn, arc_name_base, overwrite=False, verbose=False, compress_level=9)[source]
Zip all the files in [input_files] into [output_fn]. Archive names are relative to arc_name_base.
- Parameters:
input_files (list) – list of absolute filenames to include in the .tar file
output_fn (str) – .tar file to create
arc_name_base (str) – absolute folder from which relative paths should be determined; behavior is undefined if there are files in [input_files] that don’t live within [arc_name_base]
overwrite (bool, optional) – whether to overwrite an existing .tar file
verbose (bool, optional) – enable additional debug console output
compress_level (int, optional) – compression level to use, between 0 and 9
- Returns:
the output zipfile, whether we created it or determined that it already exists
- Return type:
str
- megadetector.utils.path_utils.zip_folder(input_folder, output_fn=None, overwrite=False, verbose=False, compress_level=9)[source]
Recursively zip everything in [input_folder] into a single zipfile, storing files as paths relative to [input_folder].
- Parameters:
input_folder (str) – folder to zip
output_fn (str, optional) – output filename; if this is None, we’ll write to [input_folder].zip
overwrite (bool, optional) – whether to overwrite an existing .tar file
verbose (bool, optional) – enable additional debug console output
compress_level (int, optional) – compression level to use, between 0 and 9
- Returns:
the output zipfile, whether we created it or determined that it already exists
- Return type:
str
utils.process_utils module
process_utils.py
Run something at the command line and capture the output, based on:
Includes handy example code for doing this on multiple processes/threads.
- megadetector.utils.process_utils.execute(cmd, encoding=None, errors=None, env=None, verbose=False)[source]
Run [cmd] (a single string) in a shell, yielding each line of output to the caller.
The “encoding”, “errors”, and “env” parameters are passed directly to subprocess.Popen().
“verbose” only impacts output about process management, it is not related to printing output from the child process.
- Parameters:
cmd (str) – command to run
encoding (str, optional) – stdout encoding, see Popen() documentation
errors (str, optional) – error handling, see Popen() documentation
env (dict, optional) – environment variables, see Popen() documentation
verbose (bool, optional) – enable additional debug console output
- Returns:
the command’s return code, always zero, otherwise a CalledProcessError is raised
- Return type:
int
- megadetector.utils.process_utils.execute_and_print(cmd, print_output=True, encoding=None, errors=None, env=None, verbose=False, catch_exceptions=True, echo_command=False)[source]
Run [cmd] (a single string) in a shell, capturing and printing output. Returns a dictionary with fields “status” and “output”.
The “encoding”, “errors”, and “env” parameters are passed directly to subprocess.Popen().
“verbose” only impacts output about process management, it is not related to printing output from the child process.
- Parameters:
cmd (str) – command to run
print_output (bool, optional) – whether to print output from [cmd] (stdout is captured regardless of the value of print_output)
encoding (str, optional) – stdout encoding, see Popen() documentation
errors (str, optional) – error handling, see Popen() documentation
env (dict, optional) – environment variables, see Popen() documentation
verbose (bool, optional) – enable additional debug console output
catch_exceptions (bool, optional) – catch exceptions and include in the output, otherwise raise
echo_command (bool, optional) – print the command before executing
- Returns:
a dictionary with fields “status” (the process return code) and “output” (the content of stdout)
- Return type:
dict
utils.split_locations_into_train_val module
split_locations_into_train_val.py
Splits a list of location IDs into training and validation, targeting a specific train/val split for each category, but allowing some categories to be tighter or looser than others. Does nothing particularly clever, just randomly splits locations into train/val lots of times using the target val fraction, and picks the one that meets the specified constraints and minimizes weighted error, where “error” is defined as the sum of each class’s absolute divergence from the target val fraction.
- megadetector.utils.split_locations_into_train_val.split_locations_into_train_val(location_to_category_counts, n_random_seeds=10000, target_val_fraction=0.15, category_to_max_allowable_error=None, category_to_error_weight=None, default_max_allowable_error=0.1, require_complete_coverage=True)[source]
Splits a list of location IDs into training and validation, targeting a specific train/val split for each category, but allowing some categories to be tighter or looser than others. Does nothing particularly clever, just randomly splits locations into train/val lots of times using the target val fraction, and picks the one that meets the specified constraints and minimizes weighted error, where “error” is defined as the sum of each class’s absolute divergence from the target val fraction.
- Parameters:
location_to_category_counts (dict) –
a dict mapping location IDs to dicts, with each dict mapping a category name to a count. Any categories not present in a particular dict are assumed to have a count of zero for that location.
For example:
{'location-000': {'bear':4,'wolf':10}, 'location-001': {'bear':12,'elk':20}}n_random_seeds (int, optional) – number of random seeds to try, always starting from zero
target_val_fraction (float, optional) – fraction of images containing each species we’d like to put in the val split
category_to_max_allowable_error (dict, optional) – a dict mapping category names to maximum allowable errors. These are hard constraints (i.e., we will error if we can’t meet them). Does not need to include all categories; categories not included will be assigned a maximum error according to [default_max_allowable_error]. If this is None, no hard constraints are applied.
category_to_error_weight (dict, optional) – a dict mapping category names to error weights. You can specify a subset of categories; categories not included here have a weight of 1.0. If None, all categories have the same weight.
default_max_allowable_error (float, optional) – the maximum allowable error for categories not present in [category_to_max_allowable_error]. Set to None (or >= 1.0) to disable hard constraints for categories not present in [category_to_max_allowable_error]
require_complete_coverage (bool, optional) – require that every category appear in both train and val
- Returns:
- A two-element tuple:
list of location IDs in the val split
a dict mapping category names to the fraction of images in the val split
- Return type:
tuple
utils.string_utils module
string_utils.py
Miscellaneous string utilities.
- class megadetector.utils.string_utils.TestStringUtils[source]
Bases:
objectTests for string_utils.py
- megadetector.utils.string_utils.human_readable_to_bytes(size)[source]
Given a human-readable byte string (e.g. 2G, 10GB, 30MB, 20KB), returns the number of bytes. Will return 0 if the argument has unexpected form.
https://gist.github.com/beugley/ccd69945346759eb6142272a6d69b4e0
- Parameters:
size (str) – string representing a size
- Returns:
the corresponding size in bytes
- Return type:
int
- megadetector.utils.string_utils.is_float(s)[source]
Checks whether [s] is an object (typically a string) that can be cast to a float
- Parameters:
s (object) – object to evaluate
- Returns:
True if s successfully casts to a float, otherwise False
- Return type:
bool
- megadetector.utils.string_utils.is_int(s)[source]
Checks whether [s] is an object (typically a string) that can be cast to a int
- Parameters:
s (object) – object to evaluate
- Returns:
True if s successfully casts to a int, otherwise False
- Return type:
bool
utils.url_utils module
url_utils.py
Frequently-used functions for downloading, manipulating, or serving URLs
- class megadetector.utils.url_utils.DownloadProgressBar[source]
Bases:
objectProgress updater based on the progressbar2 package.
https://stackoverflow.com/questions/37748105/how-to-use-progressbar-module-with-urlretrieve
- class megadetector.utils.url_utils.QuietHTTPRequestHandler(*args, directory=None, **kwargs)[source]
Bases:
SimpleHTTPRequestHandlerSimpleHTTPRequestHandler subclass that suppresses console printouts
- log_message(format, *args)[source]
Log an arbitrary message.
This is used by all other logging functions. Override it if you have specific logging wishes.
The first argument, FORMAT, is a format string for the message to be logged. If the format string contains any % escapes requiring parameters, they should be specified as subsequent arguments (it’s just like printf!).
The client ip and current date/time are prefixed to every message.
Unicode control characters are replaced with escaped hex before writing the output to stderr.
- class megadetector.utils.url_utils.SingletonHTTPServer[source]
Bases:
objectHTTP server that runs on a local port, serving a particular local folder. Runs as a singleton, so starting a server in a new folder closes the previous server. I use this primarily to serve MD/SpeciesNet previews from manage_local_batch, which can exceed the 260-character filename length limitation imposed by browser on Windows, so really the point here is just to remove characters from the URL.
- classmethod is_running()[source]
Check whether the server is currently running.
- Returns:
True if the server is running
- Return type:
bool
- classmethod start_server(directory, port=8000, host='localhost')[source]
Start or restart the HTTP server with a specific directory
- Parameters:
directory (str) – the root folder served by the server
port (int, optional) – the port on which to create the server
host (str, optional) – the host on which to listen, typically either “localhost” (default) or “0.0.0.0”
- Returns:
URL to the running host
- Return type:
str
- class megadetector.utils.url_utils.TestUrlUtils[source]
Bases:
objectTests for url_utils.py
- test_download_url_to_specified_file()[source]
Test download_url with a specified destination filename.
- megadetector.utils.url_utils.download_relative_filename(url, output_base, verbose=False)[source]
Download a URL to output_base, preserving relative path. Path is relative to the site, so:
…will get downloaded to:
output_base/xyz/123.txt
- Parameters:
url (str) – the URL to download
output_base (str) – the base folder to which we should download this file
verbose (bool, optional) – enable additional debug console output
- Returns:
the local destination filename
- Return type:
str
- megadetector.utils.url_utils.download_url(url, destination_filename=None, progress_updater=None, force_download=False, verbose=True, escape_spaces=True)[source]
Downloads a URL to a file. If no file is specified, creates a temporary file, making a best effort to avoid filename collisions.
Prints some diagnostic information and makes sure to omit SAS tokens from printouts.
- Parameters:
url (str) – the URL to download
destination_filename (str, optional) – the target filename; if None, will create a file in system temp space
progress_updater (object or bool, optional) – can be “None”, “False”, “True”, or a specific callable object. If None or False, no progress updated will be displayed. If True, a default progress bar will be created.
force_download (bool, optional) – download this file even if [destination_filename] exists.
verbose (bool, optional) – enable additional debug console output
escape_spaces (bool, optional) – replace ‘ ‘ with ‘%20’
- Returns:
the filename to which [url] was downloaded, the same as [destination_filename] if [destination_filename] was not None
- Return type:
str
- megadetector.utils.url_utils.get_url_size(url, verbose=False, timeout=None)[source]
Get the size of the file pointed to by a URL, based on the Content-Length property. If the URL is not available, or the Content-Length property is not available, or the content-Length property is not an integer, returns None.
- Parameters:
url (str) – the url to test
verbose (bool, optional) – enable additional debug output
timeout (int, optional) – timeout in seconds to wait before considering this access attempt to be a failure; see requests.head() for precise documentation
- Returns:
the file size in bytes, or None if it can’t be retrieved
- Return type:
int
- megadetector.utils.url_utils.get_url_sizes(urls, n_workers=1, pool_type='thread', timeout=None, verbose=False)[source]
Retrieve file sizes for the URLs specified by [urls]. Returns None for any URLs that we can’t access, or URLs for which the Content-Length property is not set.
- Parameters:
urls (list) – list of URLs for which we should retrieve sizes
n_workers (int, optional) – number of concurrent workers, set to <=1 to disable parallelization
pool_type (str, optional) – worker type to use; should be ‘thread’ or ‘process’
timeout (int, optional) – timeout in seconds to wait before considering this access attempt to be a failure; see requests.head() for precise documentation
verbose (bool, optional) – print additional debug information
- Returns:
maps urls to file sizes, which will be None for URLs for which we were unable to retrieve a valid size.
- Return type:
dict
- megadetector.utils.url_utils.parallel_download_urls(url_to_target_file, verbose=False, overwrite=False, n_workers=20, pool_type='thread')[source]
Downloads a list of URLs to local files.
Catches exceptions and reports them in the returned “results” array.
- Parameters:
url_to_target_file (dict) – a dict mapping URLs to local filenames.
verbose (bool, optional) – enable additional debug console output
overwrite (bool, optional) – whether to overwrite existing local files
n_workers (int, optional) – number of concurrent workers, set to <=1 to disable parallelization
pool_type (str, optional) – worker type to use; should be ‘thread’ or ‘process’
- Returns:
- list of dicts with keys:
’url’: the url this item refers to
’status’: ‘skipped’, ‘success’, or a string starting with ‘error’
’target_file’: the local filename to which we downloaded (or tried to download) this URL
- Return type:
list
- megadetector.utils.url_utils.test_url(url, error_on_failure=True, timeout=None)[source]
Tests the availability of [url], returning an http status code.
- Parameters:
url (str) – URL to test
error_on_failure (bool, optional) – whether to error (vs. just returning an error code) if accessing this URL fails
timeout (int, optional) – timeout in seconds to wait before considering this access attempt to be a failure; see requests.head() for precise documentation
- Returns:
http status code (200 for success)
- Return type:
int
- megadetector.utils.url_utils.test_urls(urls, error_on_failure=True, n_workers=1, pool_type='thread', timeout=None, verbose=False)[source]
Verify that URLs are available (i.e., returns status 200). By default, errors if any URL is unavailable.
- Parameters:
urls (list) – list of URLs to test
error_on_failure (bool, optional) – whether to error (vs. just returning an error code) if accessing this URL fails
n_workers (int, optional) – number of concurrent workers, set to <=1 to disable parallelization
pool_type (str, optional) – worker type to use; should be ‘thread’ or ‘process’
timeout (int, optional) – timeout in seconds to wait before considering this access attempt to be a failure; see requests.head() for precise documentation
verbose (bool, optional) – enable additional debug output
- Returns:
a list of http status codes, the same length and order as [urls]
- Return type:
list
utils.gpu_test module
gpu_test.py
Simple script to verify CUDA availability, used to verify a CUDA environment for TF or PyTorch
- megadetector.utils.gpu_test.directml_test()[source]
Check whether DirectML support is available.
- Returns:
Whether directML support is available.
- Return type:
bool
utils.wi_taxonomy_utils module
wi_taxonomy_utils.py
Functions related to working with the SpeciesNet / Wildlife Insights taxonomy.
- class megadetector.utils.wi_taxonomy_utils.TaxonomyHandler(taxonomy_file, geofencing_file, country_code_file)[source]
Bases:
objectHandler for taxonomy mapping and geofencing operations.
- binomial_name_to_taxonomy_info
Maps a binomial name (one, two, or three ws-delimited tokens) to the same dict described above.
- common_name_to_taxonomy_info
Maps a common name to the same dict described above
- country_code_to_country
Maps upper-case country codes to lower-case country names
- country_to_country_code
Maps lower-case country names to upper-case country codes
- export_geofence_data_to_csv(csv_fn=None, include_common_names=True)[source]
Converts the geofence .json representation into an equivalent .csv representation, with one taxon per row and one region per column. Empty values indicate non-allowed combinations, positive numbers indicate allowed combinations. Negative values are reserved for specific non-allowed combinations.
- Parameters:
csv_fn (str) – output .csv file
include_common_names (bool, optional) – include a column for common names
- Returns:
the pandas representation of the csv output file
- Return type:
dataframe
- generate_csv_rows_for_species(species_string, allow_countries=None, block_countries=None, allow_states=None, block_states=None)[source]
Generate rows in the format expected by geofence_fixes.csv, representing a list of allow and/or block rules for the specified species and countries/states. Does not check that the rules make sense; e.g. nothing will stop you in this function from both allowing and blocking a country.
- Parameters:
species_string (str) – five-token string in semicolon-delimited WI taxonomy format
allow_countries (list or str, optional) – three-letter country codes, list of country codes, or comma-separated list of country codes to allow
block_countries (list or str, optional) – three-letter country codes, list of country codes, or comma-separated list of country codes to block
allow_states (list or str, optional) – two-letter state codes, list of state codes, or comma-separated list of state codes to allow
block_states (list or str, optional) – two-letter state code, list of state codes, or comma-separated list of state codes to block
- Returns:
lines ready to be pasted into geofence_fixes.csv
- Return type:
list of str
- generate_csv_rows_to_block_all_countries_except(species_string, block_except_list)[source]
Generate rows in the format expected by geofence_fixes.csv, representing a list of allow and block rules to block all countries currently allowed for this species except [block_except_list], and add allow rules for these countries.
- Parameters:
species_string (str) – five-token taxonomy string
block_except_list (list) – list of country codes not to block
- Returns:
strings compatible with geofence_fixes.csv
- Return type:
list of str
- species_allowed_in_country(species, country, state=None, return_status=False)[source]
Determines whether [species] is allowed in [country], according to already-initialized geofencing rules.
- Parameters:
species (str) – can be a common name, a binomial name, or a species string
country (str) – country name or three-letter code
state (str, optional) – two-letter US state code
return_status (bool, optional) – by default, this function returns a bool; if you want to know why [species] is allowed/not allowed, settings return_status to True will return additional information.
- Returns:
typically returns True if [species] is allowed in [country], else False. Returns a more detailed string if return_status is set.
- Return type:
bool or str
- species_string_to_canonical_species_string(species)[source]
Convert a string that may be a 5-token species string, a binomial name, or a common name into a 5-token species string, using taxonomic lookup.
- Parameters:
species (str) – 5-token species string, binomial name, or common name
- Returns:
common name
- Return type:
str
- Raises:
ValueError – if [species] is not in our dictionary
- species_string_to_taxonomy_info(species)[source]
Convert a string that may be a 5-token species string, a binomial name, or a common name into a taxonomic info dictionary, using taxonomic lookup.
- Parameters:
species (str) – 5-token species string, binomial name, or common name
- Returns:
taxonomy information
- Return type:
dict
- Raises:
ValueError – if [species] is not in our dictionary
- taxonomy_string_to_geofencing_rules
Dict mapping 5-token semicolon-delimited taxonomy strings to geofencing rules
- taxonomy_string_to_taxonomy_info
Maps a taxonomy string (e.g. mammalia;cetartiodactyla;cervidae;odocoileus;virginianus) to a dict with keys taxon_id, common_name, kingdom, phylum, class, order, family, genus, species
- class megadetector.utils.wi_taxonomy_utils.TestWITaxonomyUtils[source]
Bases:
objectTests for wi_taxonomy_utils.py
- megadetector.utils.wi_taxonomy_utils.clean_taxonomy_string(s, truncate_multiple_description_strings=True)[source]
If [s] is a seven-token prediction string, trim the GUID and common name to produce a “clean” taxonomy string. Else if [s] is a five-token string, return it. Else error.
- Parameters:
s (str) – the seven- or five-token taxonomy/prediction string to clean
truncate_multiple_description_strings (bool, optional) – we use | to delimit multiple descriptions in the same string; if this is True, clean and return just the first, else error.
- Returns:
the five-token taxonomy string
- Return type:
str
- megadetector.utils.wi_taxonomy_utils.find_geofence_adjustments(ensemble_json_file, use_latin_names=False)[source]
Count the number of instances of each unique change made by the geofence.
- Parameters:
ensemble_json_file (str) – SpeciesNet-formatted .json file produced by the full ensemble.
use_latin_names (bool, optional) – return a mapping using binomial names rather than common names.
- Returns:
- maps strings that look like “puma,felidae family” to integers,
where that entry would indicate the number of times that “puma” was predicted, but mapped to family level by the geofence. Sorted in descending order by count.
- Return type:
dict
- megadetector.utils.wi_taxonomy_utils.generate_geofence_adjustment_html_summary(rollup_pair_to_count, min_count=10)[source]
Given a list of geofence rollups, likely generated by find_geofence_adjustments, generate an HTML summary of the changes made by geofencing. The resulting HTML is wrapped in <div>, but not, for example, in <html> or <body>.
- Parameters:
rollup_pair_to_count (dict) – list of changes made by geofencing, see find_geofence_adjustments for details
min_count (int, optional) – minimum number of changes a pair needs in order to be included in the report.
- megadetector.utils.wi_taxonomy_utils.generate_instances_json_from_folder(folder, country=None, admin1_region=None, lat=None, lon=None, output_file=None, filename_replacements=None, tokens_to_ignore=['$RECYCLE.BIN'])[source]
Generate an instances.json record that contains all images in [folder], optionally including location information, in a format suitable for run_model.py. Optionally writes the results to [output_file].
- Parameters:
folder (str) – the folder to recursively search for images
country (str, optional) – a three-letter country code
admin1_region (str, optional) – an administrative region code, typically a two-letter US state code
lat (float, optional) – latitude to associate with all images
lon (float, optional) – longitude to associate with all images
output_file (str, optional) – .json file to which we should write instance records
filename_replacements (dict, optional) – str –> str dict indicating filename substrings that should be replaced with other strings. Replacement occurs after converting backslashes to forward slashes.
tokens_to_ignore (list, optional) – ignore any images with these tokens in their names, typically used to avoid $RECYCLE.BIN. Can be None.
- Returns:
dict with at least the field “instances”
- Return type:
dict
- megadetector.utils.wi_taxonomy_utils.generate_md_results_from_predictions_json(predictions_json_file, md_results_file=None, base_folder=None, max_decimals=5, convert_human_to_person=True, convert_homo_species_to_human=True, verbose=False)[source]
Generate an MD-formatted .json file from a predictions.json file, generated by the SpeciesNet ensemble. Typically, MD results files use relative paths, and predictions.json files use absolute paths, so this function optionally removes the leading string [base_folder] from all file names.
Uses the classification from the “prediction” field if it’s available, otherwise uses the “classifications” field.
When using the “prediction” field, records the top class in the “classifications” field to a field in each image called “top_classification_common_name”. This is often different from the value of the “prediction” field.
speciesnet_to_md.py is a command-line driver for this function.
- Parameters:
predictions_json_file (str) – path to a predictions.json file, or a dict
md_results_file (str, optional) – path to which we should write an MD-formatted .json file
base_folder (str, optional) – leading string to remove from each path in the predictions.json file. Typically the folder on which you ran run_model.py. If base_folder does not end in a slash, but filenames start with base_folder + ‘/’, this function assumes that you meant to add the slash.
max_decimals (int, optional) – number of decimal places to which we should round all values
convert_human_to_person (bool, optional) – WI predictions.json files sometimes use the detection category “human”; MD files usually use “person”. If True, this function will change the detection category name “human” to “person”.
convert_homo_species_to_human (bool, optional) – the ensemble often rolls human predictions up to “homo species”, which isn’t wrong, but looks odd. This forces these back to “homo sapiens”.
verbose (bool, optional) – enable additional debug output
- Returns:
results in MD format
- Return type:
dict
- megadetector.utils.wi_taxonomy_utils.generate_predictions_json_from_md_results(md_results_file, predictions_json_file, base_folder=None)[source]
Generate a predictions.json file from the MD-formatted .json file [md_results_file]. Typically, MD results files use relative paths, and predictions.json files use absolute paths, so this function optionally prepends [base_folder]. Does not handle classification results in MD format, since this is intended to prepare data for passing through the WI classifier.
md_to_wi.py is a command-line driver for this function.
- Parameters:
md_results_file (str) – path to an MD-formatted .json file
predictions_json_file (str) – path to which we should write a predictions.json file
base_folder (str, optional) – folder name to prepend to each path in md_results_file, to convert relative paths to absolute paths. If [base_folder] is non-empty and doesn’t end in a slash, a slash will be added.
- megadetector.utils.wi_taxonomy_utils.generate_whole_image_detections_for_classifications(classifications_json_file, detections_json_file, ensemble_json_file=None, ignore_blank_classifications=True, verbose=True)[source]
Given a set of classification results in SpeciesNet format that were likely run on already-cropped images, generate a file of [fake] detections in SpeciesNet format in which each image is covered in a single whole-image detection.
- Parameters:
classifications_json_file (str) – SpeciesNet-formatted file containing classifications
detections_json_file (str) – SpeciesNet-formatted file to write with detections
ensemble_json_file (str, optional) – SpeciesNet-formatted file to write with detections and classfications
ignore_blank_classifications (bool, optional) – use non-top classifications when the top classification is “blank” or “no CV result”
verbose (bool, optional) – enable additional debug output
- Returns:
the contents of [detections_json_file]
- Return type:
dict
- megadetector.utils.wi_taxonomy_utils.get_common_name_from_prediction_string(s)[source]
Extract the common name from the seven-token prediction string [s], or generate a reasonable one (e.g. “vulpes genus”). Prediction strings look like:
‘90d950db-2106-4bd9-a4c1-777604c3eada;mammalia;rodentia;;;;rodent’
- Parameters:
s (str) – the string for which we should extract a common name
- Returns:
the extracted common name
- Return type:
str
- megadetector.utils.wi_taxonomy_utils.get_kingdom(prediction_string)[source]
Return the kingdom field from a WI prediction string
- Parameters:
prediction_string (str) – a string in the semicolon-delimited prediction string format
- Returns:
the kingdom field from the input string
- Return type:
str
- megadetector.utils.wi_taxonomy_utils.is_animal_classification(prediction_string)[source]
Determines whether the input string represents an animal classification, which excludes, e.g., humans, blanks, vehicles, unknowns
- Parameters:
prediction_string (str) – a string in the semicolon-delimited prediction string format
- Returns:
whether this string corresponds to an animal category
- Return type:
bool
- megadetector.utils.wi_taxonomy_utils.is_human_classification(prediction_string)[source]
Determines whether the input string represents a human classification, which includes a variety of common names (hiker, person, etc.)
- Parameters:
prediction_string (str) – a string in the semicolon-delimited prediction string format
- Returns:
whether this string corresponds to a human category
- Return type:
bool
- megadetector.utils.wi_taxonomy_utils.is_taxonomic_prediction_string(s)[source]
Determines whether [s] is a classification string that has taxonomic properties; this does not include, e.g., blanks/vehicles/no cv result. It also excludes “animal”.
- Parameters:
s (str) – a five- or seven-token taxonomic string
- Returns:
whether [s] is a taxonomic category
- Return type:
bool
- megadetector.utils.wi_taxonomy_utils.is_valid_prediction_string(s)[source]
Determine whether [s] is a valid WI prediction string. Prediction strings look like:
‘90d950db-2106-4bd9-a4c1-777604c3eada;mammalia;rodentia;;;;rodent’
- Parameters:
s (str) – the string to be tested for validity
- Returns:
True if this looks more or less like a WI prediction string
- Return type:
bool
- megadetector.utils.wi_taxonomy_utils.is_valid_taxonomy_string(s)[source]
Determine whether [s] is a valid 5-token WI taxonomy string. Taxonomy strings look like:
‘mammalia;rodentia;;;;rodent’ ‘mammalia;chordata;canidae;canis;lupus dingo’
- Parameters:
s (str) – the string to be tested for validity
- Returns:
True if this looks more or less like a WI taxonomy string
- Return type:
bool
- megadetector.utils.wi_taxonomy_utils.is_vehicle_classification(prediction_string)[source]
Determines whether the input string represents a vehicle classification.
- Parameters:
prediction_string (str) – a string in the semicolon-delimited prediction string format
- Returns:
whether this string corresponds to the vehicle category
- Return type:
bool
- megadetector.utils.wi_taxonomy_utils.load_md_or_speciesnet_file(fn, verbose=True)[source]
Load a .json file that may be in MD or SpeciesNet format. Typically used so SpeciesNet files can be supplied to functions originally written to support MD format.
- Parameters:
fn (str) – a .json file in predictions.json (MD or SpeciesNet) format
verbose (bool, optional) – enable additional debug output
- Returns:
the contents of [fn], in MD format.
- Return type:
dict
- megadetector.utils.wi_taxonomy_utils.merge_prediction_json_files(input_prediction_files, output_prediction_file)[source]
Merge all predictions.json files in [files] into a single .json file.
- Parameters:
input_prediction_files (list) – list of predictions.json files to merge
output_prediction_file (str) – output .json file
- megadetector.utils.wi_taxonomy_utils.split_instances_into_n_batches(instances_json, n_batches, output_files=None)[source]
Given an instances.json file, split it into batches of equal size.
- Parameters:
instances_json (str) – input .json file in
n_batches (int) – number of new files to generate
output_files (list, optional) – output .json files for each batch. If supplied, should have length [n_batches]. If not supplied, filenames will be generated based on [instances_json].
- Returns:
list of output files that were written; identical to [output_files] if it was supplied as input.
- Return type:
list
- megadetector.utils.wi_taxonomy_utils.taxonomy_info_to_taxonomy_string(taxonomy_info, include_taxon_id_and_common_name=False)[source]
Convert a taxonomy record in dict format to a five- or seven-token semicolon-delimited string
- Parameters:
taxonomy_info (dict) – dict in the format stored in, e.g., taxonomy_string_to_taxonomy_info
include_taxon_id_and_common_name (bool, optional) – by default, this function returns a five-token string of latin names; if this argument is True, it includes the leading (GUID) and trailing (common name) tokens
- Returns:
string in the format used as keys in, e.g., taxonomy_string_to_taxonomy_info
- Return type:
str
- megadetector.utils.wi_taxonomy_utils.taxonomy_level_index(s)[source]
Returns the taxonomy level up to which [s] is defined (0 for non-taxnomic, 1 for kingdom, 2 for phylum, etc. Empty strings and non-taxonomic strings are treated as level 0. 1 and 2 will never be returned; “animal” doesn’t look like other taxonomic strings, so here we treat it as non-taxonomic.
- Parameters:
s (str) – 5-token or 7-token taxonomy string
- Returns:
taxonomy level
- Return type:
int
- megadetector.utils.wi_taxonomy_utils.taxonomy_level_string_to_index(s)[source]
Maps strings (‘kingdom’, ‘species’, etc.) to level indices.
- Parameters:
s (str) – taxonomy level string
- Returns:
taxonomy level index
- Return type:
int
- megadetector.utils.wi_taxonomy_utils.taxonomy_level_to_string(k)[source]
Maps taxonomy level indices (0 for kindgom, 1 for phylum, etc.) to strings.
- Parameters:
k (int) – taxonomy level index
- Returns:
taxonomy level string
- Return type:
str
- megadetector.utils.wi_taxonomy_utils.test_wi_taxonomy_utils()[source]
Module-level test entry point.
- megadetector.utils.wi_taxonomy_utils.validate_predictions_file(fn, instances=None, verbose=True)[source]
Validate the predictions.json file [fn].
- Parameters:
fn (str) – a .json file in predictions.json (SpeciesNet) format
instances (str or list, optional) – a folder, instances.json file, or dict loaded from an instances.json file. If supplied, this function will verify that [fn] contains the same number of images as [instances].
verbose (bool, optional) – enable additional debug output
- Returns:
the contents of [fn]
- Return type:
dict
utils.wi_platform_utils module
wi_platform_utils.py
Utility functions for working with the Wildlife Insights platform, specifically:
Retrieving images based on .csv downloads
Pushing results to the ProcessCVResponse() API (requires an API key)
- megadetector.utils.wi_platform_utils.find_images_in_identify_tab(download_folder_with_identify, download_folder_excluding_identify)[source]
Based on extracted download packages with and without the “exclude images in ‘identify’ tab checkbox” checked, figure out which images are in the identify tab. Returns a list of dicts (one per image).
- Parameters:
download_folder_with_identify (str) – the folder containing the download bundle that includes images from the “identify” tab
download_folder_excluding_identify (str) – the folder containing the download bundle that excludes images from the “identify” tab
- Returns:
list of image records that are present in the identify tab
- Return type:
list of dict
- megadetector.utils.wi_platform_utils.generate_blank_prediction_payload(data_file_id, project_id, blank_confidence=0.9, model_version='3.1.2', prediction_source='manual_update')[source]
Generate a payload that will set a single image to the blank classification, with no detections. Suitable for upload via push_results_for_images.
- Parameters:
data_file_id (str) – unique identifier for this image used in the WI DB
project_id (int) – WI project ID
blank_confidence (float, optional) – confidence value to associate with this prediction
model_version (str, optional) – model version string to include in the payload
prediction_source (str, optional) – prediction source string to include in the payload
- Returns:
dictionary suitable for uploading via push_results_for_images
- Return type:
dict
- megadetector.utils.wi_platform_utils.generate_no_cv_result_payload(data_file_id, project_id, no_cv_confidence=0.9, model_version='3.1.2', prediction_source='manual_update')[source]
Generate a payload that will set a single image to the blank classification, with no detections. Suitable for uploading via push_results_for_images.
- Parameters:
data_file_id (str) – unique identifier for this image used in the WI DB
project_id (int) – WI project ID
no_cv_confidence (float, optional) – confidence value to associate with this prediction
model_version (str, optional) – model version string to include in the payload
prediction_source (str, optional) – prediction source string to include in the payload
- Returns:
dictionary suitable for uploading via push_results_for_images
- Return type:
dict
- megadetector.utils.wi_platform_utils.generate_payload_for_prediction_string(data_file_id, project_id, prediction_string, prediction_confidence=0.8, detections=None, model_version='3.1.2', prediction_source='manual_update')[source]
Generate a payload that will set a single image to a particular prediction, optionally including detections. Suitable for uploading via push_results_for_images.
- Parameters:
data_file_id (str) – unique identifier for this image used in the WI DB
project_id (int) – WI project ID
prediction_string (str) – WI-formatted prediction string to include in the payload
prediction_confidence (float, optional) – confidence value to associate with this prediction
detections (list, optional) – list of MD-formatted detection dicts, with fields [‘category’] and ‘conf’
model_version (str, optional) – model version string to include in the payload
prediction_source (str, optional) – prediction source string to include in the payload
- Returns:
dictionary suitable for uploading via push_results_for_images
- Return type:
dict
- megadetector.utils.wi_platform_utils.generate_payload_with_replacement_detections(wi_result, detections, prediction_score=0.9, model_version='3.1.2', prediction_source='manual_update')[source]
Generate a payload for a single image that keeps the classifications from [wi_result], but replaces the detections with the MD-formatted list [detections].
- Parameters:
wi_result (dict) – dict representing a WI prediction result, with at least the fields in the constant wi_result_fields
detections (list) – list of WI-formatted detection dicts (with fields [‘conf’] and [‘category’])
prediction_score (float, optional) – confidence value to use for the combined prediction
model_version (str, optional) – model version string to include in the payload
prediction_source (str, optional) – prediction source string to include in the payload
- Returns:
dictionary suitable for uploading via push_results_for_images
- Return type:
dict
- megadetector.utils.wi_platform_utils.parallel_push_results_for_images(payloads, headers, url='https://placeholder', verbose=False, pool_type='thread', n_workers=10)[source]
Push results for the list of payloads in [payloads] to the process_cv_response API, parallelized over multiple workers.
- Parameters:
payloads (list of dict) – payloads to upload to the API
headers (dict) – authorization headers, see prepare_data_update_auth_headers
url (str, optional) – API URL
verbose (bool, optional) – enable additional debug output
pool_type (str, optional) – ‘thread’ or ‘process’
n_workers (int, optional) – number of parallel workers
- Returns:
list of http response codes, one per payload
- Return type:
list of int
- megadetector.utils.wi_platform_utils.prepare_data_update_auth_headers(auth_token_file)[source]
Read the authorization token from a text file and prepare http headers.
- Parameters:
auth_token_file (str) – a single-line text file containing a write-enabled
token. (API)
- Returns:
http headers, with fields ‘Authorization’ and ‘Content-Type’
- Return type:
dict
- megadetector.utils.wi_platform_utils.push_results_for_images(payload, headers, url='https://placeholder', verbose=False)[source]
Push results for one or more images represented in [payload] to the process_cv_response API, to write to the WI DB.
- Parameters:
payload (dict) – payload to upload to the API
headers (dict) – authorization headers, see prepare_data_update_auth_headers
url (str, optional) – API URL
verbose (bool, optional) – enable additional debug output
- Returns:
response status code
- Return type:
int
- megadetector.utils.wi_platform_utils.read_images_from_download_bundle(download_folder)[source]
Reads all images.csv files from [download_folder], returns a dict mapping image IDs to a list of dicts that describe each image. It’s a list of dicts rather than a single dict because images may appear more than once, typically indicating multiple species.
- Parameters:
download_folder (str) – a folder containing one or more images.csv files, typically representing a Wildlife Insights download bundle. If this is a single .csv file, reads just that file.
- Returns:
- Maps image GUIDs to dicts with at least the following fields:
project_id (int)
deployment_id (str)
image_id (str, should match the key)
filename (str, the filename without path at the time of upload)
location (str, starting with gs://)
May also contain classification fields: wi_taxon_id (str), species, etc. Returns None if no image .csv files are available.
- Return type:
dict
- megadetector.utils.wi_platform_utils.read_sequences_from_download_bundle(download_folder)[source]
Reads all sequences.csv files from [download_folder], returns a dict mapping sequence_id values to a list of dicts that describe each image. It’s a list of dicts rather than a single dict because sequences may appear more than once, typically indicating multiple species.
- Parameters:
download_folder (str) – a folder containing one or more sequences.csv files, typically representing a Wildlife Insights download bundle. If this is a single .csv file, reads just that file.
- Returns:
- Maps string-formatted sequence IDs to dicts with at least the following fields:
project_id (int)
deployment_id (str)
May also contain classification fields: wi_taxon_id (str), species, etc. Returns None if no sequence .csv files are available.
- Return type:
dict
- megadetector.utils.wi_platform_utils.record_is_unidentified(record)[source]
A record is considered “unidentified” if the “identified by” field is either NaN or “computer vision”
- Parameters:
record (dict) – dict representing a WI result loaded from a .csv file, with at least the field “identified_by”
- Returns:
True if the “identified_by” field is either NaN or a string indicating that this record has not yet been human-reviewed.
- Return type:
bool
- megadetector.utils.wi_platform_utils.record_lists_are_identical(records_0, records_1, verbose=False)[source]
Takes two lists of records in the form returned by read_images_from_download_bundle and determines whether they are the same.
- Parameters:
records_0 (list of dict) – the first list of records to compare
records_1 (list of dict) – the second list of records to compare
verbose (bool, optional) – enable additional debug output
- Returns:
True if the two lists are identical
- Return type:
bool
- megadetector.utils.wi_platform_utils.url_to_relative_path(url, image_flattening='deployment')[source]
Convert a WI gs:// URL to a relative path.
- Parameters:
url (str) – the URL to convert to a relative path
image_flattening (str, optional) – if ‘none’ or None, relative paths will be returned as the entire URL for each image, other than gs://. Can be ‘guid’ (just return [GUID].JPG) or ‘deployment’ (return [deployment]/[GUID].JPG).
- Returns:
converted path
- Return type:
str
- megadetector.utils.wi_platform_utils.validate_payload(payload)[source]
Verifies that the dict [payload] is compatible with the ProcessCVResponse() API. Throws an error if [payload] is invalid.
- Parameters:
payload (dict) – payload in the format expected by push_results_for_images.
- Returns:
successful validation; this is just future-proofing, currently never returns False
- Return type:
bool
- megadetector.utils.wi_platform_utils.wi_result_to_prediction_string(r)[source]
Convert the dict [r] - typically loaded from a row in a downloaded .csv file - to a valid prediction string, e.g.:
1f689929-883d-4dae-958c-3d57ab5b6c16;;;;;;animal 90d950db-2106-4bd9-a4c1-777604c3eada;mammalia;rodentia;;;;rodent
- Parameters:
r (dict) – dict containing WI prediction information, with at least the fields specified in wi_result_fields.
- Returns:
the result in [r], as a semicolon-delimited prediction string
- Return type:
str
- megadetector.utils.wi_platform_utils.write_download_commands(image_records, download_dir_base, force_download=False, n_download_workers=25, download_command_file_base=None, image_flattening='deployment')[source]
Given a list of dicts with at least the field ‘location’ (a gs:// URL), prepare a set of “gcloud storage” commands to download images, and write those to a series of .sh scripts, along with one .sh script that runs all the others and blocks.
gcloud commands will use relative paths.
- Parameters:
image_records (list of dict) – list of dicts with at least the field ‘location’. Can also be a dict whose values are lists of record dicts.
download_dir_base (str) – local destination folder
force_download (bool, optional) – include gs commands even if the target file exists
n_download_workers (int, optional) – number of scripts to write (that’s our hacky way of controlling parallelization)
download_command_file_base (str, optional) – path of the .sh script we should write, defaults to “download_wi_images.sh” in the destination folder. Individual worker scripts will have a number added, e.g. download_wi_images_00.sh.
image_flattening (str, optional) – if ‘none’, relative paths will be preserved representing the entire URL for each image. Can be ‘guid’ (just download to [GUID].JPG) or ‘deployment’ (download to [deployment]/[GUID].JPG).
- megadetector.utils.wi_platform_utils.write_prefix_download_command(image_records, download_dir_base, force_download=False, download_command_file=None)[source]
Write a .sh script to download all images (using gcloud) from the longest common URL prefix in the images represented in [image_records].
- Parameters:
image_records (list of dict) – list of dicts with at least the field ‘location’. Can also be a dict whose values are lists of record dicts.
download_dir_base (str) – local destination folder
force_download (bool, optional) – overwrite existing files
download_command_file (str, optional) – path of the .sh script we should write, defaults to “download_wi_images_with_prefix.sh” in the destination folder.
utils.write_html_image_list module
write_html_image_list.py
Given a list of image file names, writes an HTML file that shows all those images, with optional one-line headers above each.
Each “filename” can also be a dict with elements ‘filename’,’title’, ‘imageStyle’,’textStyle’, ‘linkTarget’
- megadetector.utils.write_html_image_list.write_html_image_list(filename=None, images=None, options=None)[source]
Given a list of image file names, writes an HTML file that shows all those images, with optional one-line headers above each.
- Parameters:
filename (str, optional) – the .html output file; if None, just returns a valid options dict
images (list, optional) –
the images to write to the .html file; if None, just returns a valid options dict. This can be a flat list of image filenames, or this can be a list of dictionaries with one or more of the following fields:
filename (image filename) (required, all other fields are optional)
imageStyle (css style for this image)
textStyle (css style for the title associated with this image)
title (text label for this image)
linkTarget (URL to which this image should link on click)
options (dict, optional) –
a dict with one or more of the following fields:
f_html (file pointer to write to, used for splitting write operations over multiple calls)
pageTitle (HTML page title)
headerHtml (html text to include before the image list)
subPageHeaderHtml (html text to include before the images when images are broken into pages)
trailerHtml (html text to include after the image list)
defaultImageStyle (default css style for images)
defaultTextStyle (default css style for image titles)
maxFiguresPerHtmlFile (max figures for a single HTML file; overflow will be handled by creating multiple files and a TOC with links)
urlEncodeFilenames (default True, e.g. ‘#’ will be replaced by ‘%23’)
urlEncodeLinkTargets (default True, e.g. ‘#’ will be replaced by ‘%23’)
utils.extract_frames_from_video module
extract_frames_from_video.py
Extracts frames from a source video or folder of videos and writes those frames to jpeg files. For single videos, writes frame images to the destination folder. For folders of videos, creates subfolders in the destination folder (one per video) and writes frame images to those subfolders.
- class megadetector.utils.extract_frames_from_video.FrameExtractionOptions[source]
Bases:
objectParameters controlling the behavior of extract_frames().
- detector_output_file
Path to MegaDetector .json output file. When specified, extracts frames referenced in this file. Mutually exclusive with frame_sample. [source] must be a folder when this is specified.
- frame_sample
Sample every Nth frame starting from the first frame; if this is None or 1, every frame is extracted. If this is a negative value, it’s interpreted as a sampling rate in seconds, which is rounded to the nearest frame sampling rate. Mutually exclusive with detector_output_file.
- max_width
Maximum width for extracted frames (defaults to None)
- n_workers
Number of workers to use for parallel processing
- parallelize_with_threads
Use threads for parallel processing
- quality
JPEG quality for extracted frames
- verbose
Enable additional debug output
- megadetector.utils.extract_frames_from_video.extract_frames(source, destination, options=None)[source]
Extracts frames from a video or folder of videos.
- Parameters:
source (str) – path to a single video file or folder of videos
destination (str) – folder to write frame images to (will be created if it doesn’t exist)
options (FrameExtractionOptions, optional) – parameters controlling frame extraction
- Returns:
- for single videos, returns (list of frame filenames, frame rate).
for folders, returns (list of lists of frame filenames, list of frame rates, list of video filenames)
- Return type:
tuple
extract_frames_from_video - CLI interface
Extract frames from videos and save as JPEG files
extract_frames_from_video [-h] [--n_workers N_WORKERS] [--parallelize_with_threads]
[--quality QUALITY] [--max_width MAX_WIDTH] [--verbose]
[--frame_sample FRAME_SAMPLE | --detector_output_file DETECTOR_OUTPUT_FILE]
source destination
extract_frames_from_video positional arguments
source- Path to a single video file or folder containing videosdestination- Output folder for extracted frames (will be created if it does not exist)
extract_frames_from_video options
--n_workersN_WORKERS- Number of workers to use for parallel processing (default: %(default)s)--parallelize_with_threads- Use threads for parallel processing (default: use processes)--qualityQUALITY- JPEG quality for extracted frames (default: %(default)s)--max_widthMAX_WIDTH- Maximum width for extracted frames (default: no resizing)--verbose- Enable additional debug output--frame_sampleFRAME_SAMPLE- Sample every Nth frame starting from the first frame; if this is None or 1, every frame is extracted. If this is a negative value, it’s interpreted as a sampling rate in seconds, which is rounded to the nearest frame sampling rate--detector_output_fileDETECTOR_OUTPUT_FILE- Path to MegaDetector .json output file. When specified, extracts frames referenced in this file. Source must be a folder when this is specified.