Metadata 1.0-rc2

Getting close to the 1.0 release of the metadata library.
If there are no major problems with this release candidate, that is.

tarball: http://dark.fhtr.org/repos/metadata/metadata-1.0-rc2.tar.gz
git: http://dark.fhtr.org/repos/metadata

Changes

  • bittorrent .torrent support
  • untested ape/musepack/wavepack support with apetag gem
    (couldn’t find any ape samples)
  • ‘name’ and ‘author’ parsed from mplayer output
  • more documentation
  • better handling of images that have no exif info
  • tested ra, wma and m4a
  • tested more image formats
  • fixed audio samplerate and channels for videos
    (i.e. convert the string to int.)
  • bin/mdh now prints the metadata as the default action.
    Use -c to create an MDH file.
  • MDH files have a “MDH#{version_byte}” header now,
    the validity of which is checked by mdh.

Thanks

Konrad M. for his patient testing and bug reports.
Darren Kirby for the heads-up on wmainfo’s ASF-parsing capabilities
(along with being the author of wmainfo-rb and flacinfo-rb.)

Description

This package Metadata' comes with a library called metadata’ and
a small program called `mdh’.

The library probes files for their metadata (e.g. jpeg dimensions
and camera make, mp3 artist, pdf text and word count) and returns the
metadata as a Hash. All strings in the metadata are converted to
UTF-8.

The `mdh’-program can print out file metadata as YAML and package the
metadata with the file.

The metadata hash follows the shared file metadata spec naming, with
some
additional fields, see list at the end of this file (Appendix A.)

For details on the MDH file format, see the end of this file (Appendix
B.)

Usage

print out metadata for myfile.jpg

mdh myfile.jpg

create myfile.jpg.mdh, which consists of an MDH metadata header +

myfile.jpg
mdh -c myfile.jpg

print out the metadata header from an MDH file

mdh -e -p myfile.jpg.mdh

strip out the metadata header from an MDH file and save it to

myfile.jpg
mdh -e myfile.jpg.mdh

print out the list of options

mdh -h

irb> require ‘metadata’
irb> Metadata.extract(‘myfile.jpg’)
irb> Metadata.extract_text(‘myfile.pdf’)
irb> Pathname.new(“myfile.jpg”).metadata

List of supported formats

Audio:
Whatever you manage to make mplayer play.
Plus special handlers for FLAC, m4a, ape, musepack, wavepack and
wma.
Successfully tested with:
mp3, flac, ogg, wav, ra, wma, m4a
Should also work:
wv, mpc, ape

Video:
Whatever you manage to make mplayer play.
Successfully tested with:
wmv, mov, divx, xvid, flv, ogm, mpg, mkv

Images:
Should handle pretty much anything (apart from ORF.)
Successfully tested with:
jpeg, png, gif, nef, dng, crw, pef, psd, tga, tif, xcf, xpm, ppm,
bmp

Documents:
Successfully tested with:
pdf, ppt, odp, sxi, ps, ps.gz, html, txt
Should work:
- OpenOffice docs work to some degree (personally, I’m using unoconv
to
convert OO docs to temp PDFs for the text & dimensions extraction,
so
those bits of data are missing.)
- MS Office docs to some degree (ppt at least, doc and xls should
work too,
dimensions missing due to the above temp PDF -thing.)

Others:
- BitTorrent .torrent files
- Archive contents
- Whatever `extract’ outputs and I am handling

Requirements

  • Ruby 1.8

  • Tons of metadata extraction programs and libs.
    This package has many dependencies since there is no single
    universal
    metadata header format that all files use. Blame resource forks,
    filename
    extensions, bags of bytes and mimetypes.

    List of gems:
    flacinfo-rb
    wmainfo-rb
    MP4info
    id3lib-ruby
    apetag

    List of Debian packages:
    dcraw
    libimlib2-ruby
    extract
    libimage-exiftool-perl
    poppler-utils
    mplayer
    html2text
    imagemagick
    unhtml
    pstotext
    antiword
    catdoc
    shared-mime-info

  • You do want to install the latest versions of dcraw and
    shared-mime-info to be able to handle camera raw images.
    http://cybercom.net/~dcoffin/dcraw/
    shared-mime-info

  • Python + chardet library
    http://chardet.feedparser.org/

Install

De-compress archive and enter its top directory.
Then type:

($ su)
# ruby setup.rb

These simple step installs this program under the default
location of Ruby libraries. You can also install files into
your favorite directory by supplying setup.rb some options.
Try “ruby setup.rb --help”.

Appendix A: Additional metadata fields

This list contains the metadata fields added to the shared
file metadata spec.
shared-filemetadata-spec

field name | field type

Archive.Contents array of pathnames

Audio.Band string
Audio.Composer string
Audio.Conductor string
Audio.Copyright string (copyright message)
Audio.Grouping string
Audio.Image binary string (embedded image data)
Audio.InterpretedBy string
Audio.Lyricist string
Audio.Publisher string
Audio.RemixedBy string
Audio.Subtitle string
Audio.Tempo integer
Audio.VariableBitrate boolean
Audio.Writer string
Audio.Publicationright string
Audio.File string
Audio.EAN/UPC string
Audio.ISBN string
Audio.Catalog string
Audio.LC string
Audio.Media string
Audio.Index string
Audio.Related string
Audio.ISRC string
Audio.Abstract string
Audio.Language string
Audio.Bibliography string
Audio.Introplay string
Audio.Dummy string
Audio.DebutAlbum string
Audio.RecordDate string
Audio.RecordLocation string

Doc.Album string
Doc.Artist string
Doc.Charset string
Doc.Description string
Doc.Genre string
Doc.Language string
Doc.ModifyDate date
Doc.PageSizeName string (A4, A5, letter, …)

File.Software string (software used to create the file)

Image.DateCreated date
Image.DateTimeCreated date
Image.DateTimeOriginal date
Image.DimensionUnit string (px, mm, pt, …)
Image.EXIF string (exiftool output)
Image.Frames integer
Image.Modified date
Image.OriginatingProgram string

Location.Latitude float
Location.Longitude float

Video.Album string
Video.Artist string
Video.Bitrate integer
Video.Codec string
Video.Comment string
Video.Duration float
Video.Framerate float (frames per second)
Video.Genre string
Video.ReleaseDate date
Video.Title string
Video.TrackNo integer

BitTorrent.Files array of {‘path’ => string, ‘length’ =>
integer}
BitTorrent.Length integer (size of single-file torrents)
BitTorrent.Announce string (announce url)
BitTorrent.AnnounceList array of arrays of strings
BitTorrent.Nodes array of [hostname, port] -arrays

Appendix B: The MDH file format

MDH files are built as follows:

bytes | content

3   | "MDH"  - MDH file format identifier
1   | "\x01" - MDH file format version number
4   | Long, network byte order - the size of the metadata struct in 

bytes
var | YAML - The MDH metadata struct
var | The actual file contents

All string fields in the metadata are UTF-8.

License

Ruby’s

Quoth Ilmari H.:

(couldn't find any ape samples)
the validity of which is checked by mdh.

mdh myfile.jpg

create myfile.jpg.mdh, which consists of an MDH metadata header +

myfile.jpg

Whatever you manage to make mplayer play.

  convert OO docs to temp PDFs for the text & dimensions extraction, so
  those bits of data are missing.)
- MS Office docs to some degree (ppt at least, doc and xls should work 

too,

  • Ruby 1.8

  • Tons of metadata extraction programs and libs.
    This package has many dependencies since there is no single universal
    metadata header format that all files use. Blame resource forks,
    filename
    dcraw
    catdoc

location of Ruby libraries. You can also install files into

Audio.InterpretedBy string
Audio.ISBN string
Audio.Dummy string
Doc.ModifyDate date
Image.Modified date
Video.Duration float
BitTorrent.Nodes array of [hostname, port] -arrays
1 | “\x01” - MDH file format version number
4 | Long, network byte order - the size of the metadata struct in
bytes


Ilmari H. <ilmari.heikkinen gmail com>
http://fhtr.blogspot.com

I’m getting some issues with .torrents. Ex:
$ mdh -p Fedora-7-i386.torrent undefined method to_utf8' for ["F-7-i386-DVD.iso"]:Array undefined method to_utf8’ for [“F-7-i386-DVD.iso”]:Array
/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:822:in enc_utf8' /usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:452:in application_x_bittorrent’
/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:451:in map' /usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:451:in application_x_bittorrent’
/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:146:in __send__' /usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:146:in extract’
/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:38:in `metadata’
(wrapped lines indented.)

HTH,

On 9/20/07, Konrad M. [email protected] wrote:

/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:146:in __send__' /usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:146:in extract’
/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:38:in `metadata’
(wrapped lines indented.)

Thanks, fixed. Along with extracting more metadata from .torrents
and handling flacs with id3 tags better.

http://dark.fhtr.org/repos/metadata/metadata-1.0-rc3.tar.gz

Quoth Ilmari H.:

/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:451:in
http://dark.fhtr.org/repos/metadata/metadata-1.0-rc3.tar.gz

Oh, and, if there are any other filetypes that could do with more
metadata, feel free to request. Especially if some other program
shows it. And still more especially if the other program is a
cmdline app with a debian package :wink:

I’ve actually been doing all this testing under Fedora 7, all the
packages
needed are here IIRC but they may have slightly different names
(especially
debian -dev → fedora -devel).

Quoth Ilmari H.:

/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:451:in
http://dark.fhtr.org/repos/metadata/metadata-1.0-rc3.tar.gz

Oh, and, if there are any other filetypes that could do with more
metadata, feel free to request. Especially if some other program
shows it. And still more especially if the other program is a
cmdline app with a debian package :wink:

Very cool:

$ mdh a-downloads/Fedora-7-i386.torrent —
Doc.Created: 2007-05-29T10:58:50-07:00
BitTorrent.Name: Fedora-7-i386
BitTorrent.Files:

  • length: 2900602880
    path: F-7-i386-DVD.iso
  • length: 101816320
    path: F-7-i386-rescuecd.iso
  • length: 359
    path: SHA1SUM
    BitTorrent.PieceLength: 262144
    Doc.Title: Fedora-7-i386
    BitTorrent.PieceCount: 11454
    File.Size: 229377
    File.Modified: 2007-05-31T06:28:34-07:00
    File.Format: application/x-bittorrent
    BitTorrent.Announce: http://torrent.linux.duke.edu:6969/announce

Thanks a lot!

On 9/20/07, Ilmari H. [email protected] wrote:

`application_x_bittorrent'

/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:146:in __send__' /usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:146:in extract’
/usr/lib/ruby/site_ruby/1.8/metadata/extract.rb:38:in `metadata’
(wrapped lines indented.)

Thanks, fixed. Along with extracting more metadata from .torrents
and handling flacs with id3 tags better.

http://dark.fhtr.org/repos/metadata/metadata-1.0-rc3.tar.gz

Oh, and, if there are any other filetypes that could do with more
metadata, feel free to request. Especially if some other program
shows it. And still more especially if the other program is a
cmdline app with a debian package :wink:

Quoth Peña, Botp:

From: Ilmari H. [mailto:[email protected]]

tarball: http://dark.fhtr.org/repos/metadata/metadata-1.0-rc2.tar.gz

is it possible for a gem?
will this run on windows?

kind regards -botp

In short, no (to both).
In earlier messages I asked for a gem too, and it turns out this is hard
to do
because of a python script that’s included. And as far as windows, I
think a
lot of this relies on shelling out, which windows… doesn’t have. So
you
might look for another solution (or go to *nix).

HTH,

From: Ilmari H. [mailto:[email protected]]

tarball: http://dark.fhtr.org/repos/metadata/metadata-1.0-rc2.tar.gz

is it possible for a gem?
will this run on windows?

kind regards -botp

On 9/21/07, Peña, Botp [email protected] wrote:

From: Ilmari H. [mailto:[email protected]]

tarball: http://dark.fhtr.org/repos/metadata/metadata-1.0-rc2.tar.gz

is it possible for a gem?

If there’s a way to install bin/chardet as-is, now gem adds
“#!/usr/bin/ruby”-boilerplate (which doesn’t quite work for a
python script.) …Though, I could always turn it into a ruby script
that executes a bit of python. So, yeah, it is possible. I’ll build one.

will this run on windows?

In theory. At least the Ruby parts (i.e. wma, flac, mp3, torrents.)
No idea if you can get the external programs and libs installed and
running, sounds unlikely. But hey, feel free to prove me wrong :slight_smile:

Quoth Ilmari H.:

that executes a bit of python. So, yeah, it is possible. I’ll build one.

will this run on windows?

In theory. At least the Ruby parts (i.e. wma, flac, mp3, torrents.)
No idea if you can get the external programs and libs installed and
running, sounds unlikely. But hey, feel free to prove me wrong :slight_smile:

Or, one could make a VMWare image or somesuch and use it as a
metadata appliance. Anyone know how that might work best?

Uglily, at best.

On 9/21/07, Ilmari H. [email protected] wrote:

will this run on windows?

In theory. At least the Ruby parts (i.e. wma, flac, mp3, torrents.)
No idea if you can get the external programs and libs installed and
running, sounds unlikely. But hey, feel free to prove me wrong :slight_smile:

Or, one could make a VMWare image or somesuch and use it as a
metadata appliance. Anyone know how that might work best?

From: “Ilmari H.” [email protected]

that executes a bit of python. So, yeah, it is possible. I’ll build one.

will this run on windows?

In theory. At least the Ruby parts (i.e. wma, flac, mp3, torrents.)
No idea if you can get the external programs and libs installed and
running, sounds unlikely. But hey, feel free to prove me wrong :slight_smile:

Or, one could make a VMWare image or somesuch and use it as a
metadata appliance. Anyone know how that might work best?

Or perhaps build the external programs with cygwin?

(I think most cygwin programs only depend on cygwin1.dll which is
about 1.2 meg on my system.)

Regards,

Bill

From: Bill K. [mailto:[email protected]]

(I think most cygwin programs only depend on cygwin1.dll

which is about 1.2 meg on my system.)

interesting, i didn’t know that. So what is the minimum set of files
needed to run cygwin? i’m asking since i haven’t used one for a long
long time now…

kind regards -botp

From: “Peña, Botp” [email protected]

From: Bill K. [mailto:[email protected]]

(I think most cygwin programs only depend on cygwin1.dll

which is about 1.2 meg on my system.)

interesting, i didn’t know that. So what is the minimum set of files needed to run
cygwin? i’m asking since i haven’t used one for a long long time now…

It depends. (lol, pun not intended, but my next sentence was going to
involve
depend.exe)

Er… using depends.exe [1] one can view a tree of dependencies for a
.exe or
.dll file (among other object formats.)

A number of cygwin-compiled binaries only depend on cygwin1.dll. (Not
counting windows modules like kernel32.dll.)

To verify this, I was successfully able to copy just cygwin1.dll and
cygwin/bin/ftp.exe into a subdirectory elsewhere, and change the shell
execution path to simply “.”, and the program worked.

Other binaries, like cygwin/bin/ls.exe, required a couple extra .dll’s
(in this case for handling character set encodings) like cygintl-3.dll
and
cygiconv-2.dll

A more complex program like cygwin/bin/wget.exe still had minimal
and reasonable dependencies: character set handling and crypto
libraries:

cygwin1.dll, cygintl-3.dll, cygiconv-2.dll, cygcrypto-0.9.8.dll,
cygssl-0.9.8.dll

So it appears the cygwin folks have done a nice job at avoiding
dependency bloat.

[1] http://www.dependencywalker.com/

Regards,

Bill