Tag Archives: cygwin

CommandLineFu: FOR LOOP – report file count in pwd/* && print disk usage

#!/bin/bash
for i in */ ;
    do
        echo -n "$i:" >> "/path/to/some/file/already/created.txt" ;
        find "$i" -type f | wc -l >> "/path/to/same/file/already/created.txt" ;
        du -hs "$i" ;
    done
exit 0

Quick and dirty…
Actually this is quite slow when dealing with directories containing thousands of little files.
But it gets the job done.  I’ll play around with it and see if forking helps.

CommandLineFu: Split a Text file based on line numbers

Assumptions: I have a text file that contains 25+ Million lines, I want to split them into 100,000 line text files.

#!/bin/bash
x=1
y=100000
z=1
while [ $z -le 26 ]
do
    sed -n "$x,${y}p;${y}q;" tbl_001.txt > "t$z.txt"
    x=$(( $x + 100000 ))
    y=$(( $y + 100000 ))
    z=$(( $z + 1 ))
done

If you want to split by a different amount change the “y” variable, and the + whatever number to the number of lines you want.

The “z” variable is used as the filename, and the cutoff point.  If my original file only had one million lines I would change the “while” condition to 10 instead of 26.

I’m sure there’s a way to have the machine do the math for me.  But I don’t have the patience to hunt down how to do this right now.  I imagine it would have something to do with storing the line count (wc -l) in a variable, prompting the end user for the max line count (read $maxcount), and looping until the file is completely done.  (Not sure how to do this last part).  A project for another day.

CommandLineFu – REMOVE BATES NUMBER REFERENCE FROM OCR TEXT FILES

find . -type f -iname “*.TXT” -exec sed -i ‘s/<< …-.-…….. >>//g’ ‘{}’ \;
*Assuming original Bates number uses the following format ABC-X-01234567

Depending on how cygwin is configured be sure to convert files back to ‘DOS’ format after using sed
find . -type f -iname “*.TXT” -exec unix2dos ‘{}’ \;

The find command is good if your files are not organized in a cleanly numbered subfolder structure, or are mixed in with other file types.  Usually simply globbing will do the trick a bit faster:

sed -i ‘s/foo/bar/g’ IMAGES/00/0[0-9]/*.TXT ; unix2dos IMAGES/00/0[0-9]/*.TXT

sed == http://sed.sourceforge.net/sedfaq.html

_____________________________________________________________________________________