randomfox (randomfox) wrote,
randomfox
randomfox

Using Lingua::EN::Fathom to analyze readability and list the most common words

This script is almost identical to the example given in that module's documentation. The only enhancement is some simple processing on the word frequency table.


#!perl -w
use strict;
use Lingua::EN::Fathom;

die "Usage: $0 <filename>...\n" unless @ARGV > 0;

my $text = new Lingua::EN::Fathom;

for my $arg (@ARGV) {
    for (glob $arg) {
	$text->analyse_file($_, 1);
    }
}

# $num_chars             = $text->num_chars;
# $num_words             = $text->num_words;
# $percent_complex_words = $text->percent_complex_words;
# $num_sentences         = $text->num_sentences;
# $num_text_lines        = $text->num_text_lines;
# $num_blank_lines       = $text->num_blank_lines;
# $num_paragraphs        = $text->num_paragraphs;
# $syllables_per_word    = $text->syllables_per_word;
# $words_per_sentence    = $text->words_per_sentence;

print "Top 50 words:\n";
my %words = $text->unique_words;
for my $word ( (sort { $words{$b} <=> $words{$a} } keys %words)[0 .. 49] )
{
    printf "%-15s %d\n", $word, $words{$word};
}

print "\n";

# $fog     = $text->fog;
# $flesch  = $text->flesch;
# $kincaid = $text->kincaid;

print $text->report;

__END__


Example:
C:\temp>lingua.pl c:\new.txt
Top 50 words:
the             1178
to              441
a               395
in              315
is              314
job             276
and             262
that            231
of              194
for             177
on              155
it              148
be              125
handler         124
restore         121
this            120
if              115
not             115
from            104
i               103
server          101
backup          101
error           99
with            90
so              90
are             82
there           75
at              75
drive           73
was             71
will            71
message         70
when            68
an              65
task            62
as              59
file            56
tape            56
set             54
or              54
code            53
com             53
has             52
one             51
but             51
problem         49
oct             49
event           49
log             47
jobs            46

File name                  : c:\new.txt
Number of characters       : 146953
Number of words            : 16132
Percent of complex words   : 11.18
Average syllables per word : 1.5523
Number of sentences        : 4501
Average words per sentence : 3.5841
Number of text lines       : 4265
Number of blank lines      : 2436
Number of paragraphs       : 1602


READABILITY INDICES

Fog                        : 5.9067
Flesch                     : 71.8710
Flesch-Kincaid             : 4.1252

Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 0 comments