#!/usr/bin/perl
# $Id: fork-and-rename 526 2016-04-16 14:49:03Z whynot $

use strict;
use warnings;

package main;
use version 0.77; our $VERSION = qv v0.1.2;

=head1 NAME

fork-and-rename - rename bunch of files and put your system on knees

=head1 README

That renames files found in directory, if applied rule matches.
Target names are pseudo-randomized.
And BTW attempts to DOS your system.

=head1 USAGE

    fork-and-rename --destination=target/ --filter=. source0/ source1/
    fork-and-rename --move --filter=sh --filter=txt place0/ place1/

=head1 DESCRIPTION

Neither current nor previous mobile phone that I use is capable to name saved
files any useful way.
What's even worse either have no grasp of what overwrithing is.
So I have a lots of files incredibly named that I have to maintain somehow.

B<fork-and-rename> (hereafter B<f-a-r>) takes file, counts it B<CRC-32>, looks
its B<mtime> and renames it this way
(after renaming, mtime of source is applied on target):

    ppS-IY9QqxUM.jpg -> 911692DA-20080520-112926.jpg

where:

=over

=item 911692DA

is CRC-32, in hexadecimal, all caps

=item 20080520

is date of mtime, in 4 decimals of year, then 2 decimals of month and then 2
decimals of day of month

=item 112926

is time part of mtime, in hours, minutes, and seconds in 2 decimals each.

=back

That's the purpose part.
That's not that interestening, isn't it?
So what B<fork> does in its name?
The complete processing of each file is done in separate process
(one file -- one process).
The main process finds suitable file, forks, collects all already finished
zombies, and when there's none zombie left, goes for next file.

So does it achieve its target of putting the system on knees?
No and yes.
See L</Notes: Forks!> and L</Notes: Marks!>.

=cut

use File::Find;
use Digest::CRC;
use Fcntl qw| :DEFAULT                       |;
use POSIX qw| strftime :limits_h :sys_wait_h |;
use Getopt::Long;

=head1 DEPENDENCIES

=over

=item B<File::Find>

Provides directory traversing facility.
Subject to be distributed with Perl.

=item B<Digest::CRC>

It's used to provide distribution among filenames
(B<f-a-r> doesn't randomize, remember).
In use is 32bit variant.
I think, 16bit variant would have chance for clashes
(although, that's untested).
While 32bit variant is short enough.
The next step would be 128byte hash,
but do you really want that long filenames?

=item B<Fcntl>

B<fork>ed process (when in copy-to mode) copies files by itself block-by-block.
So it uses B<sysread> and B<syswrite>, and thus requires constants.
Subject to be distributed with Perl.

=item B<POSIX>

The block size for the system for pipe reads
(B<f-a-r> doens't B<pipe>, but I've found that constant useful).
B<strftime> is used, remember?
And contstant for unhanging B<waitpid>.
Subject to be distributed with Perl.

=item B<version> and B<Getopt::Long>

Cosmetic.
Subject to be distributed with Perl.

=back

=cut

#=head1 OSNAMES

=head1 INCOMPATIBILITIES

POSIX-clean slash (B</>) is used in constructing and parsing full pathnames.
You know.

=cut

my $template = q|%08X-%04d%02d%02d-%02d%02d%02d|;
my $mask = qr|^\w{8}-\d{8}-\d{6}|;

my( $dst, @filter, $move, $resuffix, $lowmark, $highmark );

sub short_help {
    print <<END_OF_HELP;
  --destination=dir sets fork-and-rename in 'copy-to' mode
                    and assigns a target directory 'dir'
  --move            sets fork-and-rename in 'rename-in-place' mode
        either one of above is required

  --filter=suffix   adds one suffix to the list of filters
  --suffix=suffix   sets suffix for renamed files
  --lowmark=int     remove fork-throtling at that point
  --highmark=int    force fork-throtling at that point
  --help            this help
  --version         version of this
END_OF_HELP
         exit 0 }

sub short_version {
    print <<"END_OF_VERSION";
fork-and-rename - $VERSION
File::Find      - $File::Find::VERSION
Digest::CRC     - $Digest::CRC::VERSION
Fcntl           - $Fcntl::VERSION
POSIX           - $POSIX::VERSION
Getopt::Long    - $Getopt::Long::VERSION
version         - $version::VERSION
AS-IS, NO-WARRANTY, HOPE-TO-BE-USEFUL - GNU GPLv3
END_OF_VERSION
            exit 0 }

GetOptions
  q|help!|         =>    \&short_help,
  q|version!|      => \&short_version,
  q|destination=s| =>           \$dst,
  q|filter=s@|     =>        \@filter,
  q|move!|         =>          \$move,
  q|suffix=s|      =>      \$resuffix,
  q|lowmark=i|     =>       \$lowmark,
  q|highmark=i|    =>      \$highmark;

=head1 ARGUMENTS

=over

=item I<--destination> C<targetZ<>E<sol>>

Sets B<f-a-r> in copy-to mode and assigns the target directory.
In that mode files are copied in the I<--destination> directory.
The source directory tree isn't recreated.

=item I<--move>

Sets B<f-a-r> in rename-on-place mode.
In that mode files are renamed in a directory they were found.

=item I<--filter>

B<f-a-r> has 2 modes

=over

=item multiple filters

Each filter names one suffix (with neither leading nor inter dots).
If file has simple suffix, and that suffix is equal (case-blindly) with one of
I<--filter>s, then file is processed.
(B<Simple suffix> means that anything on the left of rightmost dot isn't
suffix.
If there's no dot at all, then there's no suffix.)
The file is ignored otherwise.

=item one filter for all

However, in case you want to process all the files specifying I<--filter>s for
every suffix would be error-prone, ridiculous etc.
And you can't specify I<--filter> for empty suffix anyway.
You can set I<--filter> to dot

    --filter=.

And then any file will match -- with any suffix or without suffix at all.
That magic filter must be alone.

=back

Yeah, such a brain-dead construct.

And one more note one filtering.
If file looks like already renamed
(8 hexadecimals, 8 decimals, and 6 decimals separated by hyphen (B<->))
then file is skipped unconditionally.
If filename starts with a dot (B<.>) then the file is skipped too.

=back

=head1 OPTIONS

=over

=item I<--suffix> C<suffix>

Renamed files keep a suffix of source.
This option is supposed to maintain that any-case zoo.
This sets a suffix for a target file --
If a source file happens to have a suffix it will be replaced;
In case there's no source's suffix, it will be added
(think: L</one filter for all>).

=item I<--lowmark> C<lowmark>

=item I<--highmark> C<highmark>

Must be used simultaneously.
Set two marks for childs to follow.
Read L</Notes: Marks!> for details how it works.
Disabled by default.

=back

=cut

$move && $dst                                                          and die
  qq|destination ($dst) is set simumltaneously with move\n|;
$move || $dst                      or die qq|neither mode has been choosen\n|;

if( $dst )          {
    -d $dst                   or die qq|destination ($dst) isn't a directory|;
    $dst =~ s{/+$}{} }

-d $_ or die qq|($_) isn't a directory|                         foreach @ARGV;
s{/+$}{}                                                        foreach @ARGV;

@filter                                             or die qq|missing filter|;
@filter = ()                             if 1 == @filter && '.' eq $filter[0];

$lowmark && !$highmark                                                 and die
  qq|{--lowmark} is set and {--highmark} isn't\n|;
!$lowmark && $highmark                                                 and die
  qq|{--highmark} is set and {--lowmark} isn't\n|;
$lowmark && $highmark && $lowmark > $highmark                          and die
  qq|{--lowmark} is ge than {--highmark}\n|;
require Time::HiRes                                               if $lowmark;

my $digest = Digest::CRC->new( type => q|crc32| );

my( $pidgap, $gapshift ) = ( 1, 1 );
my( $pidmark, $pidflag );
sub process_this ( $ ) {
    my $file = shift @_;
    defined( my $pid = fork )                   or die qq|[fork] ($file): $!|;
    if( $pid ) {
        printf qq|[%i]: came\n|, $pid;
        $pidmark++;
        my @gone;
        push @gone, $_                  until 0 >= ($_ = waitpid -1, WNOHANG);
        my $fix = $pidmark - @gone;
        if( $highmark && $highmark < $fix               ) {
            print qq|==== HIGHMARK HAS BEEN LEFT BEHIND ====\n|;
            $pidgap += $gapshift;
            while( $lowmark < $pidmark - @gone ) {
                Time::HiRes::nanosleep( $pidgap );
                push @gone, $_                                           until
                  0 >= ($_ = waitpid -1, WNOHANG) }        }
        elsif( $lowmark && $lowmark < $fix && !$pidflag ) {
            print qq|==== LOWMARK HAS BEEN LEFT BEHIND ====\n|;
            $pidflag++;
            Time::HiRes::nanosleep( $pidgap )              }
        elsif( $lowmark && $lowmark > $fix && $pidflag  ) {
            print qq|==== IT IS SAFE LEVEL NOW ====\n|;
            undef $pidflag                                 }
        printf qq|[%s]: gone\n|, join q|] [|, @gone                  if @gone;
        $pidmark -= @gone;
         return }
    my $suffix =
      $resuffix || ( split m{\.}, ( split m{/}, $file )[-1] )[-1] || '';
    $suffix = ''                      if $suffix eq ( split m{/}, $file )[-1];
    my $mtime = ( stat $file )[9];
    open my $fh, q|<|, $file                    or die qq|[open] ($file): $!|;
    $digest->addfile( $fh );
    do { $file =~ m{^(.+)/}; $dst = $1 }                             if $move;
    my $target = sprintf qq|%s/$template%s|,
      $dst,
      $digest->digest,
      split( m{ }, strftime q|%Y %m %d %H %M %S|, localtime $mtime ),
      $suffix ? qq|.$suffix| : '';
    printf qq|[%i]: %s %s\n|,
      $$, ( split m{/}, $target )[-1], ( split m{/}, $file )[-1];
    unless( $move )                         {
        sysopen my $fhi, $file, O_RDONLY                                or die
          qq|[sysopen] ($file) for reading: $!|;
        sysopen my $fho, $target, O_WRONLY | O_EXCL | O_CREAT           or die
          qq|[sysopen] ($target) for writing: $!|;
        my( $chunk, $buf );
# XXX:20090525211216:whynot: What if $chunk != syswrite()?
        defined syswrite $fho, $buf, $chunk                             or die
          qq|[syswrite] ($target): $!|                                   while
          $chunk = sysread $fhi, $buf, PIPE_BUF;
        utime +( stat $fhi )[8,9], $fho                                 or die
          qq|[utime] ($file -> $target): $!| }
    else                                    {
        -f $target            and die qq|target ($target) for ($file) exists|;
        rename $file, $target                }
                   exit }

=head1 DIAGNOSTICS

B<f-a-r> reports its progress, and that's unavoidable.
B<fork>s, B<waitpid>s, and source-target pairs are reported.
Zombies ripped in main cycle after B<fork> are reported on one line.
In final cleanup -- immediately after ripping.

One more note on "forked" reports.
No directories are reported;
The filenames are dumped in misleading reverse order -- I believe that
increases readability
(target filename is almost constant lenght (subject to suffix variation), while
source filename length can change a lot).

Also, if enabled, The Marks At Work (see L</Notes: Marks!> for details) will
spit some diagnostics on bypassing the lowmark and the highmark.
These are higlhighted by four equals on both sides of a note.

=cut

=head1 NOTES

=cut

=head2 Notes: Forks!

On a snapshot of my mobile's memory card B<f-a-r> stabilizes on 17..20
processes first, then spikes to 22..25 processes.
At that point audio starts to glitch.
Most number of zombies reaped at once was 3, sometime 4.
I fail to see any difference either between modes (see below) or
filesystems (ext3 and ext2).
I still have no resources to check bigger files
(such as found in F<E<sol>usrZ<>E<sol>shareZ<>E<sol>doc>).

That seems that B<fork(2)> (or whatever it's emulated by) is a way costly.

Have you read those 2 paragraphes above?
Looks bad, doesn't it?
Forget it.
All that was experienced when B<Date::Manip> was in use
(I don't rant about B<Date::Manip> per se).
Looking for timezone it B<fork>s B<date(1)>.
After finding that, I've proudly dropped B<D::M> and rewritten those 2 lines
with B<POSIX::strftime> (in mind and in use).
And...
It's hard to say how many processes run at once -- roughly 2..5, up to 9
zombies are collected at once, and (what I like most) PIDs of B<fork>s are
highly sequential.
The copy-to mode somewhat differs -- processes don't come in batches
(as they do for rename-in-place mode).
However everything is a way fast.

One interesting observation.
Whatever wrapping is choosen (S<C<if( $pid )>> or S<C<unless( $pid )>>),
B<fork>ed process reports target-source pair before parent reports B<fork>ed
PID.

I think, that B<fork>ing B<perl> and B<fork>ing B<shell> are a way different
things.

=cut

=head2 Notes: Marks!

Context:
one core,
low on memory (both physical and virtual),
loads (couple of hundreds of thousands) of small files (couple KB each).
B<f-a-r>, in I<--move> mode, stabilizes at ~30 childs.
Then, by B<cron> something big with big children kicks in.
Then B<f-a-r> goes trashing memory
(physical -- each child first consumes a file it's about to handle, thus child
grows;  a little, but there more and more children),
then swap, what results in IO trashing.
Huge neighbor doesn't give up and fights for memory and IO too.
Eventually, The OOM-Killer kicks in, what doesn't help --
footprint of primary B<f-a-r> is small (almost smaller then anything else),
thus it's never considred to be a treat.
Whatever resources are freed by The OOM-Killer are immediately consumed by new
forks of primary B<f-a-r>.
Eventually you can't wake up monitor (because whatever it does wants to fork).
And you can't ssh-in either (because forks, man).
Then you discover that B<SysRq> is disabled for years now.
Thanks a lot, Debian, that helps
(at least B<syslogd> doesn't fork).

Anyway, DOSing your system is secondary objective.
First -- files must be renamed.
And two options (I<--lowmark> and I<--highmark>) enable it.
That's how it works:

When number of children reaches I<--highmark> B<f-a-r> will stall until number
of children falls to I<--lowmark>.
Each time it happens some aggressive adjustments are made to internal
parameters that affect how the stalling happens.
Because there's no any science behind this these parameters aren't present for
manipulation by user.
However, you still can edit B<f-a-r> directly if you wish so.
Doesn't look like you can break anything though.

Enjoy your marks.

=cut

=head1 BUGS AND LIMITATIONS

=over

=item *

I<(caveat)>
Marks aren't enabled by default.

=item *

I<(caveat)>, may be I<(bug)>
As already mentioned, copy-to mode doesn't recreate directory tree.

=item *

I<(caveat)>
And then if two (or more) files are met (in possibly different directories)
that have equal CRC-32 and mtime's, then the target filenames will be the same.
So only first file will be copied.
In two cases when I stepped in that -- offending files were plain same
(icons distributed with some app).

=item *

I<(caveat)>
The very same situation (however, that seems to be quite improbable) could
happen in rename-in-place mode too.

=item *

(I<bug>?)
The atime of source could be collected before file is opened for CRC-32
calculation.

=back

=cut

find
{ wanted   => sub  {
    my $file = ( split m{/} )[-1];
    !-f $_                                             ||
      $file =~ m{^\.}                                  ||
      $file =~ m{$mask}                                ||
      @filter && !grep $file =~ m{\.\Q$_\E$}i, @filter           and return;
    process_this $_ },
  no_chdir =>       1 },
  @ARGV;

printf qq|[%i]: gone\n|, $_                   until 0 >= ($_ = waitpid -1, 0);

=head1 AUTHOR

Eric Pozharski, E<lt>whynot@cpan.orgZ<>E<gt>

=head1 COPYRIGHT & LICENSE

Copyright 2009, 2013 by Eric Pozharski

This utility is free in sense: AS-IS, NO-WARRANTY, HOPE-TO-BE-USEFUL.
This utility is released under GNU GPLv3.

=cut

# vim: set filetype=perl
