The standard check is size and other file attributes (for example time stamps). a, -archive archive mode equals -rlptgoD (no -H,-A,-X) c, -checksum skip based on checksum, not mod-time & size The most rigorous check is to compare each byte, but as you write, it takes a lot of time when there is a lot of data, for example a whole backup. The manual man rsync is very detailed, and you can identify what I describe, and probably also some other interesting alternatives. It has several levels of checking if files are identical. Shellscript implementation of the OP's, idea Background with the example rsync ( fslint never got updated for Python3, apparently czkawka is a modern clone in Rust, according to an askubuntu answer.) So in my experience, duplicate file finding is not something where I've felt a need to take a faster but risky approach. Some like fslint have the option to hard-link duplicates to each other (or symlink), so next time you look for duplicates, they'll already be the same file. Comments under the question point out that jdupes can use hashes of the first N megabytes of a file after a size compare, so that's a step in the right direction.įor other use-cases, maybe you'd be ok with less stringent checking, but given that duplicate file finders exist that only compare or hash when there are files of identical size, you can just let one of those run (overnight or while you're going out), and come back to a fully checked list. Your method could be good for detecting files that are corrupt (or metadata-edited) copies of each other, something that normal duplicate-finders won't do easily. Then seek to there and visually inspect which one is corrupt. Or -f framemd5 to see which frame has a difference that wasn't an invalid h.264 stream. If you do find a bitwise difference but neither file has errors a decoder notices, use ffmpeg -i foo.mp4 -f framecrc foo.fcrc If two files are almost the same but have a few bit-differences, use ffmpeg -i foo.mp4 -f null - to find glitches, decoding but doing nothing with the output. Ruling out duplicates quickly based on a hash signature of an early part of a file is useful, but you'd still want to do a full check before declaring two files duplicates for most purposes. I wouldn't delete a "duplicate" without checking for bitwise identical (or at least identical hashes). Keep in mind that bit-rot is possible, especially if files have been stored on DVD-R or other optical media. Which it probably would the pagecache is managed in whole pages.) (A write sector size of at least 4096B is typical, but a logical sector size of 512 might allow a SATA disk to only send the requested 512B over the wire, if the kernel doesn't widen the request to a full page itself. Also, read granularity from storage is typically at least 512 bytes not 16, so might as well do that a tiny bit extra CPU time to compare more data is trivial. You'd probably want to make sure you do a full compare (or hash) on the first and last 1MiB or so, where metadata can live that might be edited without introducing offsets to the compressed data. $ convert -size 1000x1000 plasma:fractal d.jpgĬheck only the size: $ linux_czkawka_cli dup -directories /run/shm/test/ -search-method sizeįound 2 files in 1 groups with same size(may have different content) which took 361.76 KiB:Ĭheck files by their hashes: $ linux_czkawka_cli dup -directories /run/shm/test/ -search-method hashįound 2 duplicated files in 1 groups with same content which took 361.76 KiB:Ĭheck files by analyzing them as images: $ linux_czkawka_cli image -directories /run/shm/test/įound 1 images which have similar friends $ convert -size 1000x1000 plasma:fractal c.jpg $ convert -size 1000x1000 plasma:fractal a.jpg We generate random images, then copy a.jpg to b.jpg in order to have a duplicate. With the GUI version, hashes will be stored in a cache so that searching for duplicates later will be way faster. But when scanning a hundred of thousands or millions of files with HDD or slow processor, typically this step can take much time. Such hash is computed usually very fast, especially on SSD and fast multicore processors. This part from the documentation may interest you:įaster scanning for big number of duplicatesīy default for all files grouped by same size are computed partial hash(hash from only of 2KB each file). Czkawka is an open source tool which was created to find duplicate files (and images, videos or music) and present them through command-line or graphical interfaces, with an emphasis on speed.
0 Comments
Leave a Reply. |