Theodoros Emmanouilidis

Notes & Thoughts

Install Yahoo! LDA In Ubuntu 11.04 Server

August9

The following tutorial guides you through installing Yahoo! LDA code to a newly installed Ubuntu 11.04 server. Apart from the default installation, the only package that is assumed to be installed from the installation menu is open-ssh server.

1) Install JAVA

1
2
3
4
sudo apt-get install python-software-properties
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jdk

JDK can now be found in /usr/lib/jvm/java-6-sun witch is actually a symlink on Ubuntu.

2) Download source code

In order to continue with the installation make a directory to your home folder (or wherever you like). This folder will be the folder that the application will reside after the installation. I have an apps folder inside my home so i created an LDA folder inside apps.

1
2
mkdir ~/apps/LDA
cd ~/apps/LDA

Download source code from Github and extract.

1
2
3
4
5
6
7
8
wget https://github.com/shravanmn/Yahoo_LDA/tarball/master
extract code
tar -xzf master
cd shravanmn-Yahoo_LDA-*
mv * ../
cd ..
rm -rf shravanmn-Yahoo_LDA-*
rm master

3)Install system prerequisites

Install build-essential

1
sudo apt-get install build-essential

Install emacs

1
sudo apt-get install emacs

Install ant

1
sudo apt-get install ant

4) Make

Inside the LDA directory we created

1
make

5)Install Ice

The only problem with make is Ice, this must be compiled seperatelly.

Install system prerequisites.

Install libbz2-dev package.

1
sudo apt-get install libbz2-dev

Install xml parser.

1
sudo apt-get install expat libexpat-dev

Also install libssl-dev.

1
sudo apt-get install libssl-dev

Install mono-develop and some needed packages.

1
sudo apt-get install mono-complete

Finally install mandotory dev packages.

1
sudo apt-get install python-dev ruby ruby-dev php5-dev

Download third party sources for Ice.

1
2
3
4
5
wget http://www.zeroc.com/download/Ice/3.4/ThirdParty-Sources-3.4.2.zip
sudo apt-get install unzip
unzip ThirdParty-Sources-3.4.2.zip
rm ThirdParty-Sources-3.4.2.zip
cd ThirdParty-Sources-3.4.2

Install derby db.

1
2
3
4
5
6
unzip db-4.8.30.NC.zip
cd db-4.8.30.NC
cd build_unix
../dist/configure --prefix=/usr/local/berkeleydb --enable-compat185 --enable-cxx --enable-debug_rop --enable-debug_wop --enable-java
make
sudo make install

Locate the db.jar file in order to have the extra java classes. It should be here:

1
/usr/local/berkeleydb/lib/db.jar

Copy the jar file to your classpath.

1
sudo cp /usr/local/berkeleydb/lib/db.jar /usr/lib/jvm/java-6-sun/lib

Also will need classes from these packages included in third party sources directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
cd ../../
unzip jgoodies-common-1_2_0.zip
cd jgoodies-common-1.2.0
sudo cp jgoodies-common-1.2.0.jar /usr/lib/jvm/java-6-sun/lib
cd ../
unzip jgoodies-forms-1_4_1.zip
cd jgoodies-forms-1.4.1
sudo cp jgoodies-forms-1.4.1.jar /usr/lib/jvm/java-6-sun/lib
cd ../
unzip jgoodies-looks-2_4_1.zip
cd jgoodies-looks-2.4.1/
sudo cp jgoodies-looks-2.4.1.jar /usr/lib/jvm/java-6-sun/lib
cd ../

Manually have to compile manually some slice libraries in the build/Ice-3.4.1/cpp/src directory. These are slice2php, slice2cs, slice2freezej and slice2java.

1
2
3
4
5
6
7
8
9
10
11
12
13
cd ~/apps/LDA/build/Ice-3.4.1/cpp/src
cd slice2php/
make
cd ../
cd slice2cs/
make
cd ../
cd slice2freezej/
make
cd ../
cd slice2java/
make
cd ../../../

Point your classpath to the directory you copied the needed jar files

1
export CLASSPATH=/usr/lib/jvm/java-6-sun/lib/jgoodies-common-1.2.0.jar:/usr/lib/jvm/java-6-sun/lib/jgoodies-forms-1.4.1.jar:/usr/lib/jvm/java-6-sun/lib/db.jar:/usr/lib/jvm/java-6-sun/lib/jgoodies-looks-2.4.1.jar

Now build Ice

1
make

When this ends copy all created files from the lib folder to systems lib folder

1
2
cd ../../
sudo cp lib/* /usr/lib

This is all you need to run Yahoo! LDA code in a single machine.

6) Test installation (Batch Mode)

Follow the example referred in the documentation that accompanies code (inside the docs folder). Commands are a little bit altered in order to work for the single node example.

Phase 1 – Tokenization and Formatting
1
2
3
4
cd ut_out
cp ../Tokenizer.java .
javac Tokenizer.java
cat ydir_1k.txt | java -classpath . Tokenizer | ../formatter
Phase 2 – Learning the topic mixtures
1
../learntopics --topics=100 --iter=500

If everything is ok, you will be able to see the word mixtures for each topic

1
cat lda.topToWor.txt

and the topic assignments

1
cat lda.worToTop.txt
Phese 3 – Testing

Go to ut_test directory.

1
cd ../ut_test/

Copy Tokenizer class.

1
cp ../ut_out/Tokenizer.class

Format test data.

1
cat ydir_1k.tst.txt | java -classpath . Tokenizer | ../formatter --dumpfile=../ut_out/lda.dict.dump

Learn test topics.

1
../learntopics -test --dumpprefix=../ut_out/lda --topics=100

Output files are created inside ut_test.