Juta is a set of Java command line programs that process Usenet postings. The postings are supposed to be stored in a text file (or they must be piped to a program's standard input as one text stream). To run any of the tools, you must have a JRE (Java Runtime Environment) installed, version 1.2 or higher (here's a list). The tools are distributed as a single JAR file which contains both bytecode and source code.
These are short descriptions of all the programs included in Juta.
Call a program with --help as single parameter to get usage information.
If you want to use any other program than juta you will have to
decompress the JAR archive, e. g. by calling
unzip juta.jar or jar xf juta.jar.
The various tools whose names start with news are designed in a way similar
to the Unix command line tools - they can be easily combined by connecting one program's output
to another program's input using pipes.
See the Examples section for some ideas.
The most important tool in the Juta package is called juta.
It will create HTML output with statistics on Usenet postings from one month.
The juta.jar distribution has a manifest file which points to
the juta application so that you can start it by simply entering
java -jar juta.jar on the shell.
For example HTML output created by juta see the homepage of the German Java newsgroup de.comp.lang.java.
Note that publishing the HTML files as created by juta on the Internet
might be against the law depending on where you live and what kind of
information you make juta include in that HTML output file.
A person that is listed in the authors section might not want to appear
there.
To learn more about the details for Germany, see the
Datenschutz
section of the Open Directory.
juta comes with parameters that let you turn off the inclusion of any
personal information (like number of postings per user per month) and
email addresses.
If a posting contains X-No-Archive: Yes in its header,
the person appears as Anonymous (if you have chosen to include
authors in the first place).
Reads postings from standard input, counts them and prints the number of postings to standard output.
This tool will read postings from standard input and write all postings to standard output that match the parameters. You can filter postings for a specific newsgroup or month. You can also make the program write only part of each posting to output, e. g. only bodies, only headers or only signatures.
This tool has one mandatory parameter, a regular expression. It will read postings from standard input, test each line from each posting and output each matching line to standard output.
The regular expression matching part of this program could of course also be done by the
normal grep program that is available on about any platform.
However, newsgrep not only finds the matching lines, it also prints a one-line
summary of each newsposting that contains at least one matching line so that it will be
much easier to find out quickly which postings contain a match.
Will read postings from standard input and write one-line summaries for each posting to standard output. You can configure what information will be include in the one-liners.
Will read all postings from standard input into an internal list, sort them by their message ID values
and write the sorted postings to standard output.
This program might need more memory than the virtual machine has with its standard settings.
Give the VM more memory using the -mx switch.
Example: java -mx128m newssort will give 128 MB to it.
This program is particularly useful in combination with newsuniq to remove duplicates.
This program assumes that its input is sorted by message ID values.
Reads sorted postings from standard input and writes postings to standard output.
If a posting appears more than once, it will be written only once to output.
As the postings are sorted, this is particularly easy to implement - the program
will compare a posting simply to its predecessor and can decide whether it is
equal by comparing the message ID values.
To be used in combination with newssort.
How does one get an input file to be used with the various tools? I can only describe how I do it with Forte Agent.
YYYYMM.txt (e. g. 200108.txt for August 2001, note the leading zero before the 8)
you will not have to specify the month.Juta is distributed under the GNU General Public License version 2.
The Jakarta ORO library
that is included in juta.jar is distributed under the
Apache
Software License, version 1.1.
The copyright statement of that library is as follows:
Copyright (c) 2000 The Apache Software Foundation. All rights reserved.
Download the latest version: juta.jar (383 KB). This .jar file contains
.class files),.java files),tld.txt,LICENSE andREADME.
Make sure that a Java Runtime Environment (JRE) version 1.2 or higher is installed and
juta.jar (the file with all the command line programs) is in your classpath.
Open a shell (console, prompt, whatever you call that on your system) and optionally go to the
directory with the input text files you want to create statistics for.
Let's say the file cljg200108.txt contains all postings of newsgroup
comp.lang.java.programmer for August 2001.
The following will create a statistics file cljg-stat200108.html:
java -jar juta.jar -i cljg200108.txt -n comp.lang.java.gui -m 8 2001 -o cljg-stat200108.html
If the file all.txt holds all your newspostings, the following
command will count all postings from group comp.lang.java.programmer:
java newsfilter -i all.txt -n comp.lang.java.programmer | java newscount
Assuming that file 2001.txt holds all postings from the year 2001 for
a specific newsgroup, the following command will extract postings from May of that year
and write them to a new file 200105.txt:
java newsfilter -i 2001.txt -m 5 2001 > 200105.txt
Let's say you have three files with postings of May 2001 for a specific newsgroup.
The files will contain mostly duplicates.
This will create a single file 200105.txt which will hold all unique
postings:
cat 200105a.txt 200105b.txt 200105c.txt | java newssort | java newsuniq > 200105.txt
This section requires you to create a directory structure that reflects the names of the newsgroups you are archiving. Example: you want to store postings from the two newsgroups alt.tv.friends and comp.compression. Let's further assume that the main directory of your Usenet archive on disk is c:\usenet\.
Then you'd have to create directories c:\usenet\alt\tv\friends\ and c:\usenet\comp\compression\. In each directory you'd store all postings of one month in one file that reflects the month. Example: postings for the Friends newsgroup for May 2001 would be put into c:\usenet\alt\tv\friends\200105.txt (note the leading 0 for the month), postings from March 1999 for the compression newsgroup would have to be stored as c:\usenet\comp\compression\199903.txt.
Now you can make juta process a complete tree of Usenet postings in one run:
java juta -u c:\usenet\
juta will load all text files YYYYMM.txt and create the corresponding YYYYMM.html file with statistics. It will not do that if there is a YYYYMM.html with a last modification date newer than the text file. You can force the creation of all statistics files (e.g. when you get a newer version of juta with modified features) using the -f (force output) switch. If you don't want the HTML files in the same tree as the postings, specify an output root directory after the -h (HTML output directory) switch. Example: -h c:\usenet-stats\. All necessary subdirectories will be created by juta (e.g. c:\usenet-stats\alt\tv\friends).
If you have newsgroups with a lot of postings, you might want to give more memory to the JVM. I use java -mx192m juta OTHERPARAMETERS to give it 192 MB.
% java -jar juta.jar --help
juta - Java Usenet Traffic Analyzer
Creates HTML output with statistics from Usenet postings.
Usage: java <VM args> juta <PROGRAM-ARGUMENTS>
--help show this usage screen and exit
-f process all newsgroup files
-i FILE input file ("-" for standard input)
-m MM YYYY month and year to be examined
-n NEWSGROUP newsgroup for which the statistics are created
-o FILE output file ("-" for standard output)
-u DIR Usenet archive root directory
Output switches
-a do not include authors
-c ATTR CSS file name attribute
-C do not include clients
-D do not include top level domains
-e do not include email addresses
-g do not include Google Groups links
-G do not include gender distribution
-k keep original text file (when in -z mode)
-L do not include thread lengths
-O do not include operating systems
-s NUM number of threads in Top threads section
(default is 50, use 0 for all threads)
-t do not include tags
-T do not include top threads
-U do not include top URLs
-x do not include explanations
-y ignore "X-No-Archive: Yes" header field
-z gzip uncompressed text files after processing them
This program is distributed under the GNU General Public License (GPL)
version 2. See http://www.gnu.org/copyleft/gpl.html or the LICENSE file
within juta.jar (use a Zip utility to access that file).
Version: 0.4.3
Juta's homepage is at http://schmidt.devlib.org/software/juta.html
Also see the Freshmeat entry at http://freshmeat.net/projects/juta
newssort sort by other criteria than just message IDs.=?iso-8859-1?Q?encodedword?=).juta can now also read from stdin and write to stdout (use - instead of file names)