How do you test if 2 large videos are identical?(你如何测试两个大视频是否相同?)
问题描述
I have a system where video files are ingested and then multiple CPU intensive tasks are started. As these tasks are computationally expensive I would like to skip processing a file if it has already been processed.
Videos come from various sources so file names etc are not viable options.
If I was using pictures I would compare the MD5 hash but on a 5GB - 40GB video this can take a long time to compute.
To compare the 2 videos I am testing this method:
- check relevant metadata matches
- check length of file with ffmpeg / ffprobe
- use ffmpeg to extract frames at 100 predfined timestamps [1-100]
- create MD5 hashes of each of those frames
- compare the MD5 hashes to check for a match
Does anyone know a more efficient way of doing this? Or a better way to approach the problem?
First, you need to properly define under which conditions two video files are considered the same. Do you mean exactly identical as in byte-for-byte? Or do you mean identical in content, then you need to define a proper comparison method for the content.
I'm assuming the first (exactly identical files). This is independent of what the files actually contain. When you receive a file, always build the a hash for the file, store the hash along with the file.
Checking for duplicates then is a multi-step process:
1.) Compare hashes, if you find no matching hash, file is new. In most cases of a new file you can expect this step to be the only step, a good hash (SHA1 or something bigger) will have few collisions for any practical number of files.
2.) If you found other files with the same hash, check file length. If they don't match, the file is new.
3.) If both hash and file length matched, you have to compare the entire file contents, stop when you find the first difference. If the entire file compare turns out to be identical the file it the same.
In the worst case (files are identical) this should take no longer than the raw IO speed for reading the two files. In the best case (hashes differ) the test will only take as much time as the hash lookup (in a DB or HashMap or whatever you use).
EDIT: You are concerned about the IO to build the hash. You may partially avoid that if you compare the file length first and skip everything of the file length is unique. On the other hand, you then need to also keep track for which files you already did build the hash. This would allow you to defer building the hash until you really need it. In case of a missing hash you could skip directly to comparing the two files, while building the hashes in the same pass. Its a lot more state to keep track of, but it may be worth it depending on your scenario (You need a solid data basis of how often duplicate files occur and their average size distribution to make a decision).
这篇关于你如何测试两个大视频是否相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:你如何测试两个大视频是否相同?


- 转换 ldap 日期 2022-01-01
- 将 Java Swing 桌面应用程序国际化的最佳实践是什么? 2022-01-01
- 获取数字的最后一位 2022-01-01
- 在 Java 中,如何将 String 转换为 char 或将 char 转换 2022-01-01
- 如何使 JFrame 背景和 JPanel 透明且仅显示图像 2022-01-01
- 如何指定 CORS 的响应标头? 2022-01-01
- 未找到/usr/local/lib 中的库 2022-01-01
- Eclipse 的最佳 XML 编辑器 2022-01-01
- GC_FOR_ALLOC 是否更“严重"?在调查内存使用情况时? 2022-01-01
- java.lang.IllegalStateException:Bean 名称“类别"的 BindingResult 和普通目标对象都不能用作请求属性 2022-01-01