Theodoros Emmanouilidis

Notes & Thoughts

Substitute CSV Column Values Plus One With awk

September15

Suppose you have a three column data csv file that the values of one column, the third par example, must be substituted with current value plus one. This is very easy to do using awk.

Just type:

1
awk -F, '{$3++; print $1","$2","$3}' input-file.csv

awk will print by default the result to the console. To save the result you should output it to a file using:

1
awk -F, '{$3++; print $1","$2","$3}' input-file.csv > output-file.csv

The third column of the output file contains the input file values plus one.

Convert PDF Document To Text From Command Line

September15

pdftotext is a nifty command line utility that can be used to convert a PDF document to text. Most Linux distributions include pdftotext as part of the poppler-utils package. Installation in Ubuntu is very easy using apt. Just type

1
sudo apt-get install poppler-utils

and pdftotext is installed.

Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -r <fp>           : resolution, in DPI (default is 72)
  -x <int>          : x-coordinate of the crop area top left corner
  -y <int>          : y-coordinate of the crop area top left corner
  -W <int>          : width of crop area in pixels (default is 0)
  -H <int>          : height of crop area in pixels (default is 0)
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -listenc          : list available encodings
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -v                : print copyright and version info

The simplest way to use it is by typing

1
pdftotext file-to-convert.pdf

and the utility will create a text file with the same name inside the directory that file-to-convert.pdf resides.
Wildcards (*), for example:

1
pdftotext *.pdf

for converting multiple files, cannot be used because pdftotext expects only one file name. Instead, a loop can be used for batch conversions like:

1
2
3
4
for f in *.pdf
do
pdftotext "$f"
done

Running LDA Algorithm With Mahout

September15

The following article explains the usage of Apache Mahout’ s implementation of the Latent Dirichlet Allocation learning algorithm. The idea behind the example is to extract given number of topics from a collection of text files located in a given directory. For simplicity we will use Mahout’ s ability to run locally, but there is no real difference in the procedure if it’ s applied to a Mahout build configured to use a Hadoop cluster.

The first step is to download and install Mahout. Download from here the latest stable release and extract the archive to your preferred working directory. The only thing that needs to be configured for a minimal run is the JAVA_HOME environment variable.

For ubuntu just type:

1
export JAVA_HOME=/usr/lib/jvm/java-6-sun

to the console or change the declared path with the appropriate path for your release / set-up. Since the export command works only for the current session, you can edit the {mahout directory}/bin/mahout file and paste the export command there.

The second step is to create the directory that will contain the text files we want to extract topics from and populate it.

1) Converting text documents to SequenceFile format

Mahout has a built in utility to convert text files contained to a given directory to SequenceFile format:

1
seqdirectory

The options of the command are as follows:

  --input (-i) input                             Path to job input directory.
  --output (-o) output                           The directory pathname for
                                                 output.
  --overwrite (-ow)                              If present, overwrite the
                                                 output directory before
                                                 running job
  --chunkSize (-chunk) chunkSize                 The chunkSize in MegaBytes.
                                                 Defaults to 64
  --fileFilterClass (-filter) fileFilterClass    The name of the class to use
                                                 for file parsing. Default:
                                                 org.apache.mahout.text.PrefixAd
                                                 ditionFilter
  --keyPrefix (-prefix) keyPrefix                The prefix to be prepended to
                                                 the key
  --charset (-c) charset                         The name of the character
                                                 encoding of the input files.
                                                 Default to UTF-8
  --help (-h)                                    Print out help
  --tempDir tempDir                              Intermediate output directory
  --startPhase startPhase                        First phase to run
  --endPhase endPhase                            Last phase to run

but the absolutely minimum to run the utility are (every command is executed from {mahout directory}/bin directory):

1
./mahout seqdirectory --input {full path to the folder containing text files} --output {full path to folder that we want to save sequence files} -c UTF-8

or for me

1
./mahout seqdirectory --input /home/tgi/Desktop/ruby --output /home/tgi/Desktop/1 -c UTF-8

the document id generated is {prefix}{relative path from parent}/document.txt

2) Creating vectors from SequenceFile

Again, Mahout has a built in utility for this:

1
seq2sparse

All options of the command are:

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default
                                      Value: 2
  --analyzerName (-a) analyzerName    The class name of the analyzer
  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000 MB
  --output (-o) output                The output directory
  --input (-i) input                  input dir containing the documents in
                                      sequence file format
  --minDF (-md) minDF                 The minimum document frequency.  Default
                                      is 1
  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the DF.
                                      Can be used to remove really high
                                      frequency terms. Expressed as an integer
                                      between 0 and 100. Default is 99.
  --weight (-wt) weight               The kind of weight to use. Currently TF
                                      or TFIDF
  --norm (-n) norm                    The norm to use, expressed as either a
                                      float or "INF" if you want to use the
                                      Infinite norm.  Must be greater or equal
                                      to 0.  The default is not to normalize
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood
                                      Ratio(Float)  Default is 1.0
  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.
                                      Default Value: 1
  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams to
                                      create (2 = bigrams, 3 = trigrams, etc)
                                      Default Value:1
  --overwrite (-ow)                   If set, overwrite the output directory
  --help (-h)                         Print out help
  --sequentialAccessVector (-seq)     (Optional) Whether output vectors should
                                      be SequentialAccessVectors. If set true
                                      else false
  --namedVector (-nv)                 (Optional) Whether output vectors should
                                      be NamedVectors. If set true else false
  --logNormalize (-lnorm)             (Optional) Whether output vectors should
                                      be logNormalize. If set true else false

The minimum command to issue, continuing the example, is:

1
./mahout seq2sparse -i /home/tgi/Desktop/1 -o /home/tgi/Desktop/2 -wt tf

3) Invoking LDA algorithm

With the documents prepared we can now invoke the LDA algorithm. options to run the algorithm are:

  --input (-i) input                      Path to job input directory.
  --output (-o) output                    The directory pathname for output.
  --overwrite (-ow)                       If present, overwrite the output
                                          directory before running job
  --numTopics (-k) numTopics              The total number of topics in the
                                          corpus
  --numWords (-v) numWords                The total number of words in the
                                          corpus (can be approximate, needs to
                                          exceed the actual value)
  --topicSmoothing (-a) topicSmoothing    Topic smoothing parameter. Default is
                                          50/numTopics.
  --maxIter (-x) maxIter                  The maximum number of iterations.
  --help (-h)                             Print out help
  --tempDir tempDir                       Intermediate output directory
  --startPhase startPhase                 First phase to run
  --endPhase endPhase                     Last phase to run

Continuing the example we issue the command:

1
./mahout lda -i /home/tgi/Desktop/2/tf-vectors -o /home/tgi/Desktop/3 -k 50 -v 200000

choosing to compute 50 topics in our corpus. The numWords parameter must exceed the total number of words in the previously computed dictionary. The easiest way to do this is to run it with a fictional number the first time, find this in the job initialization log outputted in the console

1
INFO: record buffer = 262144/327680

and re run LDA with -v > 327680 in the example.

The input directory must point to the output directory of the previous face /tf-vectors or /tfidf-vectors depending on our previous choice.

4) Output the computed topics

After running LDA you can obtain an output of the computed topics using another Mahout utility:

1
ldatopics

All options of the command are:

  --dict (-d) dict                         Dictionary to read in, in the same
                                           format as one created by
                                           org.apache.mahout.utils.vectors.lucen
                                           e.Driver
  --output (-o) output                     Output directory to write top words
  --words (-w) words                       Number of words to print
  --input (-i) input                       Path to an LDA output (a state)
  --dictionaryType (-dt) dictionaryType    The dictionary file type
                                           (text|sequencefile)
<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;">and to print out the topics of the example corpus we type:  </span>
1
./mahout ldatopics -i /home/tgi/Desktop/3/state-9/ -d /home/tgi/Desktop/2/dictionary.* -o /home/tgi/Desktop/4 --dictionaryType sequencefile

Be careful here to use as the input directory the last state before convergence of the algorithm (sate-9 for the example). The output should be a set of files that each represent a computed topic and contain the words of that topic. Something like this:

end [p(end|topic_47) = 0.02054285450740693
class [p(class|topic_47) = 0.019173433558138983
you [p(you|topic_47) = 0.011081809083779152
ruby [p(ruby|topic_47) = 0.010939372894063723
code [p(code|topic_47) = 0.009912782417548813
pm [p(pm|topic_47) = 0.009565600712295945
07 [p(07|topic_47) = 0.008512880092545057
x [p(x|topic_47) = 0.007979504372069835
3 [p(3|topic_47) = 0.007552053207748147
from [p(from|topic_47) = 0.007501607551193776
your [p(your|topic_47) = 0.0074405804416695824
use [p(use|topic_47) = 0.007428773869438666
page [p(page|topic_47) = 0.007026473801340893
def [p(def|topic_47) = 0.006772484477458351
chapter [p(chapter|topic_47) = 0.006635243376794015
10 [p(10|topic_47) = 0.00600798063831032
have [p(have|topic_47) = 0.005999418317965101
2 [p(2|topic_47) = 0.005959277207972052
need [p(need|topic_47) = 0.005849514834292144
method [p(method|topic_47) = 0.005676919181434945

If you run LDA using a Hadoop cluster, you should substitute all paths used in the example with hdfs paths.

Currently (mahout version 0.5) LDA implementation is not completely integrated with the other Mahout clustering applications. There is no classification step to match an initial input document with a computed topic.

nginx Installation On Amazon Linux AMI

September11

Since Amazon offers its Amazon Linux AMI for free if used with a micro instance, it is a nice idea to use such a micro instance as a reverse proxy front-end for your actual web server. This setup offers many advantages, with enhanced security and fail over capabilities being two of them.

1)Set up AMI

You have to select an Amazon Linux AMI to run as a micro instance. Store the Key Pair in order to be able to login to the machine and associate an elastic IP to it. Then alter the default security group opening port 80 and 443 to the world and 22 to the IP / IPs you will use to connect via ssh to the machine.

2)Login and become root

By default you can login to the machine via ssh using as user name “ec2-user” and the Key Pair that you downloaded upon AMI creation. In order to continue the installation it is convenient to became root. Type

1
sudo su

3)Update

1
yum update

4) Install needed packages

1
2
3
yum install pcre-devel zlib-devel openssl-devel
yum install gcc
yum install make

5) Download latest stable release of nginx from here

1
wget http://nginx.org/download/nginx-1.1.0.tar.gz

6) Extract

1
tar xzf nginx-1.1.0.tar.gz

7) Install

1
2
3
4
cd nginx-1.1.0
./configure --sbin-path=/usr/local/sbin --with-http_ssl_module
make
make install<span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;"> </span>

8) Start nginx

1
/usr/local/sbin/nginx

9) Test

Navigate to your AMI’ s IP address and check if nginx responds showing it’ s default web page.

10) Make a start / stop script

1
nano /etc/init.d/nginx

Copy the following code to the newly created file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
#!/bin/sh
#
# processname: nginx
# config:      /usr/local/nginx/conf/nginx.conf
# pidfile:     /usr/local/nginx/logs/nginx.pid
 
# Source function library.
. /etc/rc.d/init.d/functions
 
# Source networking configuration.
. /etc/sysconfig/network
 
# Check that networking is up.
[ "$NETWORKING" = "no" ] && exit 0
 
nginx="/usr/local/sbin/nginx"
prog=$(basename $nginx)
 
NGINX_CONF_FILE="/usr/local/nginx/conf/nginx.conf"
 
lockfile=/var/lock/subsys/nginx
 
start() {
    [ -x $nginx ] || exit 5
    [ -f $NGINX_CONF_FILE ] || exit 6
    echo -n $"Starting $prog: "
    daemon $nginx -c $NGINX_CONF_FILE
    retval=$?
    echo
    [ $retval -eq 0 ] && touch $lockfile
    return $retval
}
 
stop() {
    echo -n $"Stopping $prog: "
    killproc $prog -QUIT
    retval=$?
    echo
    [ $retval -eq 0 ] && rm -f $lockfile
    return $retval
}
 
restart() {
    configtest || return $?
    stop
    start
}
 
reload() {
    configtest || return $?
    echo -n $"Reloading $prog: "
    killproc $nginx -HUP
    RETVAL=$?
    echo
}
 
force_reload() {
    restart
}
 
configtest() {
  $nginx -t -c $NGINX_CONF_FILE
}
 
rh_status() {
    status $prog
}
 
rh_status_q() {
    rh_status >/dev/null 2>&1
}
 
case "$1" in
    start)
        rh_status_q && exit 0
        $1
        ;;
    stop)
        rh_status_q || exit 0
        $1
        ;;
    restart|configtest)
        $1
        ;;
    reload)
        rh_status_q || exit 7
        $1
        ;;
    force-reload)
        force_reload
        ;;
    status)
        rh_status
        ;;
    condrestart|try-restart)
        rh_status_q || exit 0
            ;;
    *)
        echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-reload|configtest}"
        exit 2
esac

Make the file executable.

1
chmod 755 /etc/init.d/nginx

Test.

1
2
3
/etc/init.d/nginx stop
/etc/init.d/nginx start
/etc/init.d/nginx reload

Finally, make the service start automatically at every reboot.

1
/sbin/chkconfig nginx on

Now you can configure your newly installed server.

« Older EntriesNewer Entries »