Getting close to the 1.0 release of the metadata library.
If there are no major problems with this release candidate, that is.
tarball: http://dark.fhtr.org/repos/metadata/metadata-1.0-rc2.tar.gz
git: http://dark.fhtr.org/repos/metadata
Changes
- bittorrent .torrent support
- untested ape/musepack/wavepack support with apetag gem
(couldn’t find any ape samples) - ‘name’ and ‘author’ parsed from mplayer output
- more documentation
- better handling of images that have no exif info
- tested ra, wma and m4a
- tested more image formats
- fixed audio samplerate and channels for videos
(i.e. convert the string to int.) - bin/mdh now prints the metadata as the default action.
Use -c to create an MDH file. - MDH files have a “MDH#{version_byte}” header now,
the validity of which is checked by mdh.
Thanks
Konrad M. for his patient testing and bug reports.
Darren Kirby for the heads-up on wmainfo’s ASF-parsing capabilities
(along with being the author of wmainfo-rb and flacinfo-rb.)
Description
This package Metadata' comes with a library called
metadata’ and
a small program called `mdh’.
The library probes files for their metadata (e.g. jpeg dimensions
and camera make, mp3 artist, pdf text and word count) and returns the
metadata as a Hash. All strings in the metadata are converted to
UTF-8.
The `mdh’-program can print out file metadata as YAML and package the
metadata with the file.
The metadata hash follows the shared file metadata spec naming, with
some
additional fields, see list at the end of this file (Appendix A.)
For details on the MDH file format, see the end of this file (Appendix
B.)
Usage
print out metadata for myfile.jpg
mdh myfile.jpg
create myfile.jpg.mdh, which consists of an MDH metadata header +
myfile.jpg
mdh -c myfile.jpg
print out the metadata header from an MDH file
mdh -e -p myfile.jpg.mdh
strip out the metadata header from an MDH file and save it to
myfile.jpg
mdh -e myfile.jpg.mdh
print out the list of options
mdh -h
irb> require ‘metadata’
irb> Metadata.extract(‘myfile.jpg’)
irb> Metadata.extract_text(‘myfile.pdf’)
irb> Pathname.new(“myfile.jpg”).metadata
List of supported formats
Audio:
Whatever you manage to make mplayer play.
Plus special handlers for FLAC, m4a, ape, musepack, wavepack and
wma.
Successfully tested with:
mp3, flac, ogg, wav, ra, wma, m4a
Should also work:
wv, mpc, ape
Video:
Whatever you manage to make mplayer play.
Successfully tested with:
wmv, mov, divx, xvid, flv, ogm, mpg, mkv
Images:
Should handle pretty much anything (apart from ORF.)
Successfully tested with:
jpeg, png, gif, nef, dng, crw, pef, psd, tga, tif, xcf, xpm, ppm,
bmp
Documents:
Successfully tested with:
pdf, ppt, odp, sxi, ps, ps.gz, html, txt
Should work:
- OpenOffice docs work to some degree (personally, I’m using unoconv
to
convert OO docs to temp PDFs for the text & dimensions extraction,
so
those bits of data are missing.)
- MS Office docs to some degree (ppt at least, doc and xls should
work too,
dimensions missing due to the above temp PDF -thing.)
Others:
- BitTorrent .torrent files
- Archive contents
- Whatever `extract’ outputs and I am handling
Requirements
-
Ruby 1.8
-
Tons of metadata extraction programs and libs.
This package has many dependencies since there is no single
universal
metadata header format that all files use. Blame resource forks,
filename
extensions, bags of bytes and mimetypes.List of gems:
flacinfo-rb
wmainfo-rb
MP4info
id3lib-ruby
apetagList of Debian packages:
dcraw
libimlib2-ruby
extract
libimage-exiftool-perl
poppler-utils
mplayer
html2text
imagemagick
unhtml
pstotext
antiword
catdoc
shared-mime-info -
You do want to install the latest versions of dcraw and
shared-mime-info to be able to handle camera raw images.
http://cybercom.net/~dcoffin/dcraw/
shared-mime-info -
Python + chardet library
http://chardet.feedparser.org/
Install
De-compress archive and enter its top directory.
Then type:
($ su)
# ruby setup.rb
These simple step installs this program under the default
location of Ruby libraries. You can also install files into
your favorite directory by supplying setup.rb some options.
Try “ruby setup.rb --help”.
Appendix A: Additional metadata fields
This list contains the metadata fields added to the shared
file metadata spec.
shared-filemetadata-spec
field name | field type
Archive.Contents array of pathnames
Audio.Band string
Audio.Composer string
Audio.Conductor string
Audio.Copyright string (copyright message)
Audio.Grouping string
Audio.Image binary string (embedded image data)
Audio.InterpretedBy string
Audio.Lyricist string
Audio.Publisher string
Audio.RemixedBy string
Audio.Subtitle string
Audio.Tempo integer
Audio.VariableBitrate boolean
Audio.Writer string
Audio.Publicationright string
Audio.File string
Audio.EAN/UPC string
Audio.ISBN string
Audio.Catalog string
Audio.LC string
Audio.Media string
Audio.Index string
Audio.Related string
Audio.ISRC string
Audio.Abstract string
Audio.Language string
Audio.Bibliography string
Audio.Introplay string
Audio.Dummy string
Audio.DebutAlbum string
Audio.RecordDate string
Audio.RecordLocation string
Doc.Album string
Doc.Artist string
Doc.Charset string
Doc.Description string
Doc.Genre string
Doc.Language string
Doc.ModifyDate date
Doc.PageSizeName string (A4, A5, letter, …)
File.Software string (software used to create the file)
Image.DateCreated date
Image.DateTimeCreated date
Image.DateTimeOriginal date
Image.DimensionUnit string (px, mm, pt, …)
Image.EXIF string (exiftool output)
Image.Frames integer
Image.Modified date
Image.OriginatingProgram string
Location.Latitude float
Location.Longitude float
Video.Album string
Video.Artist string
Video.Bitrate integer
Video.Codec string
Video.Comment string
Video.Duration float
Video.Framerate float (frames per second)
Video.Genre string
Video.ReleaseDate date
Video.Title string
Video.TrackNo integer
BitTorrent.Files array of {‘path’ => string, ‘length’ =>
integer}
BitTorrent.Length integer (size of single-file torrents)
BitTorrent.Announce string (announce url)
BitTorrent.AnnounceList array of arrays of strings
BitTorrent.Nodes array of [hostname, port] -arrays
Appendix B: The MDH file format
MDH files are built as follows:
bytes | content
3 | "MDH" - MDH file format identifier
1 | "\x01" - MDH file format version number
4 | Long, network byte order - the size of the metadata struct in
bytes
var | YAML - The MDH metadata struct
var | The actual file contents
All string fields in the metadata are UTF-8.
License
Ruby’s