Main page > Software

Juta - Usenet Traffic Analyzer

Introduction

Juta is a set of Java command line programs that process Usenet postings. The postings are supposed to be stored in a text file (or they must be piped to a program's standard input as one text stream). To run any of the tools, you must have a JRE (Java Runtime Environment) installed, version 1.2 or higher (here's a list). The tools are distributed as a single JAR file which contains both bytecode and source code.

The programs

These are short descriptions of all the programs included in Juta. Call a program with --help as single parameter to get usage information. If you want to use any other program than juta you will have to decompress the JAR archive, e. g. by calling unzip juta.jar or jar xf juta.jar. The various tools whose names start with news are designed in a way similar to the Unix command line tools - they can be easily combined by connecting one program's output to another program's input using pipes. See the Examples section for some ideas.

juta

The most important tool in the Juta package is called juta. It will create HTML output with statistics on Usenet postings from one month. The juta.jar distribution has a manifest file which points to the juta application so that you can start it by simply entering java -jar juta.jar on the shell.

For example HTML output created by juta see the homepage of the German Java newsgroup de.comp.lang.java.

Note that publishing the HTML files as created by juta on the Internet might be against the law depending on where you live and what kind of information you make juta include in that HTML output file. A person that is listed in the authors section might not want to appear there. To learn more about the details for Germany, see the Datenschutz section of the Open Directory.

juta comes with parameters that let you turn off the inclusion of any personal information (like number of postings per user per month) and email addresses. If a posting contains X-No-Archive: Yes in its header, the person appears as Anonymous (if you have chosen to include authors in the first place).

juta users

newscount

Reads postings from standard input, counts them and prints the number of postings to standard output.

newsfilter

This tool will read postings from standard input and write all postings to standard output that match the parameters. You can filter postings for a specific newsgroup or month. You can also make the program write only part of each posting to output, e. g. only bodies, only headers or only signatures.

newsgrep

This tool has one mandatory parameter, a regular expression. It will read postings from standard input, test each line from each posting and output each matching line to standard output.

The regular expression matching part of this program could of course also be done by the normal grep program that is available on about any platform. However, newsgrep not only finds the matching lines, it also prints a one-line summary of each newsposting that contains at least one matching line so that it will be much easier to find out quickly which postings contain a match.

newslist

Will read postings from standard input and write one-line summaries for each posting to standard output. You can configure what information will be include in the one-liners.

newssort

Will read all postings from standard input into an internal list, sort them by their message ID values and write the sorted postings to standard output. This program might need more memory than the virtual machine has with its standard settings. Give the VM more memory using the -mx switch. Example: java -mx128m newssort will give 128 MB to it. This program is particularly useful in combination with newsuniq to remove duplicates.

newsuniq

This program assumes that its input is sorted by message ID values. Reads sorted postings from standard input and writes postings to standard output. If a posting appears more than once, it will be written only once to output. As the postings are sorted, this is particularly easy to implement - the program will compare a posting simply to its predecessor and can decide whether it is equal by comparing the message ID values. To be used in combination with newssort.

How to create the input text file

How does one get an input file to be used with the various tools? I can only describe how I do it with Forte Agent.

  1. Select the newsgroup in the newsgroup list (typically on the left).
  2. In the message list, sort the messages by pressing on the Date table header.
  3. Select all messages from the month that you want to examine.
  4. In the menu, pick File | Save Messages as...
  5. In the dialog box, enter a file name, make sure the checkbox Save raw (unformatted) message is checked and pick Unix file message file format.
  6. If you use a file name in the format YYYYMM.txt (e. g. 200108.txt for August 2001, note the leading zero before the 8) you will not have to specify the month.

Credits

Mike Campbell
Mike was kind enough to send me a list of first names generated from the database that also serves as the basis for his site http://www.behindthename.com. This site explains the background for a large number of first names. Go visit it!

Download

Juta is distributed under the GNU General Public License version 2.

The Jakarta ORO library that is included in juta.jar is distributed under the Apache Software License, version 1.1. The copyright statement of that library is as follows:

Copyright (c) 2000 The Apache Software Foundation.
All rights reserved.

Download the latest version: juta.jar (383 KB). This .jar file contains

Examples

Make sure that a Java Runtime Environment (JRE) version 1.2 or higher is installed and juta.jar (the file with all the command line programs) is in your classpath. Open a shell (console, prompt, whatever you call that on your system) and optionally go to the directory with the input text files you want to create statistics for.

Create statistics for a single month and a single newsgroup

Let's say the file cljg200108.txt contains all postings of newsgroup comp.lang.java.programmer for August 2001. The following will create a statistics file cljg-stat200108.html:

java -jar juta.jar -i cljg200108.txt -n comp.lang.java.gui -m 8 2001 -o cljg-stat200108.html

Count postings from a particular newsgroup

If the file all.txt holds all your newspostings, the following command will count all postings from group comp.lang.java.programmer:

java newsfilter -i all.txt -n comp.lang.java.programmer | java newscount

Extract postings from a specific month

Assuming that file 2001.txt holds all postings from the year 2001 for a specific newsgroup, the following command will extract postings from May of that year and write them to a new file 200105.txt:

java newsfilter -i 2001.txt -m 5 2001 > 200105.txt

Merge several newsfeeds

Let's say you have three files with postings of May 2001 for a specific newsgroup. The files will contain mostly duplicates. This will create a single file 200105.txt which will hold all unique postings:

cat 200105a.txt 200105b.txt 200105c.txt | java newssort | java newsuniq > 200105.txt

Process a number of newsgroups at a time

This section requires you to create a directory structure that reflects the names of the newsgroups you are archiving. Example: you want to store postings from the two newsgroups alt.tv.friends and comp.compression. Let's further assume that the main directory of your Usenet archive on disk is c:\usenet\.

Then you'd have to create directories c:\usenet\alt\tv\friends\ and c:\usenet\comp\compression\. In each directory you'd store all postings of one month in one file that reflects the month. Example: postings for the Friends newsgroup for May 2001 would be put into c:\usenet\alt\tv\friends\200105.txt (note the leading 0 for the month), postings from March 1999 for the compression newsgroup would have to be stored as c:\usenet\comp\compression\199903.txt.

Now you can make juta process a complete tree of Usenet postings in one run:

java juta -u c:\usenet\ 

juta will load all text files YYYYMM.txt and create the corresponding YYYYMM.html file with statistics. It will not do that if there is a YYYYMM.html with a last modification date newer than the text file. You can force the creation of all statistics files (e.g. when you get a newer version of juta with modified features) using the -f (force output) switch. If you don't want the HTML files in the same tree as the postings, specify an output root directory after the -h (HTML output directory) switch. Example: -h c:\usenet-stats\. All necessary subdirectories will be created by juta (e.g. c:\usenet-stats\alt\tv\friends).

If you have newsgroups with a lot of postings, you might want to give more memory to the JVM. I use java -mx192m juta OTHERPARAMETERS to give it 192 MB.

Switches

% java -jar juta.jar --help
juta - Java Usenet Traffic Analyzer
Creates HTML output with statistics from Usenet postings.
Usage: java <VM args> juta <PROGRAM-ARGUMENTS>
        --help        show this usage screen and exit
        -f            process all newsgroup files
        -i FILE       input file ("-" for standard input)
        -m MM YYYY    month and year to be examined
        -n NEWSGROUP  newsgroup for which the statistics are created
        -o FILE       output file ("-" for standard output)
        -u DIR        Usenet archive root directory
 Output switches
        -a            do not include authors
        -c ATTR       CSS file name attribute
        -C            do not include clients
        -D            do not include top level domains
        -e            do not include email addresses
        -g            do not include Google Groups links
        -G            do not include gender distribution
        -k            keep original text file (when in -z mode)
        -L            do not include thread lengths
        -O            do not include operating systems
        -s NUM        number of threads in Top threads section
                      (default is 50, use 0 for all threads)
        -t            do not include tags
        -T            do not include top threads
        -U            do not include top URLs
        -x            do not include explanations
        -y            ignore "X-No-Archive: Yes" header field
        -z            gzip uncompressed text files after processing them

This program is distributed under the GNU General Public License (GPL)
version 2. See http://www.gnu.org/copyleft/gpl.html or the LICENSE file
within juta.jar (use a Zip utility to access that file).

Version: 0.4.3

Juta's homepage is at http://schmidt.devlib.org/software/juta.html

Also see the Freshmeat entry at http://freshmeat.net/projects/juta

TODO

ChangeLog

2005-06-30
Moved page to new domain.
2005-01-06
Released version 0.4.4
Adjusted Google Groups links, they didn't work anymore with GG2.
2001-12-15
Released version 0.4.3
juta can now process text input files that are gzipped (.gz extension); the zip format (.zip extension) is not supported
added switch -z that makes juta replace all text files with gzipped versions of those text files
added switch -k that makes juta keep the text files when running in -z mode (so a .txt and a .txt.gz file will be in the same directory)
Added some names to the file with the first names.
2001-09-13
Released version 0.4.1
The only thing changed is the longer list of first names for gender detection, thanks to Mike Campbell.
2001-09-10
Released version 0.4.
Added various switches to define what will be included in output.
Added output section gender distribution by postings and by authors (detection is done by searching names from postings in a list of first names).
Words encoded with =?...?= (as defined in RFC 1342) are now decoded, both ?Q? and ?B? style (e. g. =?iso-8859-1?Q?encodedword?=).
At various points in the output percentage values have been added to the tables where previously only absolute values were shown.
Directory processing now only creates HTML output where necessary (HTML file missing or older than corresponding text file).
2001-09-04
Released version 0.3.
Changed juta program parameters.
The inclusion of all the various statistics information can now be changed with arguments.
Added directory processing mode. This lets you create statistics for a complete archive of Usenet messages.
Improved quality of HTML output.
Added support for messages stored in rnews file format.
All frequency lists in output now show empty table cells for everything but the first appearance of a specific frequency.
2001-08-26
Released version 0.2.
Added various small command line tools.
juta can now also read from stdin and write to stdout (use - instead of file names)
2001-08-07
Added Top URL category.
2001-07-19
Created this homepage.
2001-07-17
Added thread length histogram.
Added news client normalization (to have less rows in the clients section).
Added average number of lines per posting in the General section.

Links

AgtFind
A couple of Windows tools to deal with Forte Agent (Windows newsreader) group files.