Posts

Showing posts from 2012

Apache Hbase unit test error

Building and testing Hbase 0.94.3 in Linux (RedHat and Sles) I came across with a error a couple of times, and solved it with 3 different approaches that I think might be good to share. Several testcases were failing due to it, such as TestCatalogTrackerOnCluster and TestLogRollAbort. I ran  mvn clean compile package , and got the following error: java.io.IOException: Shutting down at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:203) at org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:76) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:632) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:606) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:554) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:541) at org.apache.giraffa.TestRestartGiraffa.beforeClass(TestRes

Dependencies on a Hadoop Ecosystem

Image
When building a Hadoop cluster and all other Apache projects related to it, it might be tricky to know what to install and how to install. You first should understand your data, and what you want to do with it. If you have log-like data, that keeps increasing all the time, and you have keep updating it to the Hadoop cluster, you might want to consider Flume. Apache Flume is a distributed deamon-like software ( and for that presents High Availability) that can keep feeding data to the Hadoop cluster. If you need a no-sql radom read/write database you can use Hbase , implemented based on Google's BigTable database. If you have relatively structured data and you want to do query-like analyses on it, you might consider Pig or Hive . Both work on top of the Hadoop cluster, executing commands instead of Java written MapReduce jobs. They both furnish their own language to execute such commands. Pig uses a textual language called Pig Latin, and Hive uses a syntax ver

Duine Open Source Recommender

Image
Duine   is a open source Recommender System . It is a collection of software libraries developed by Telematica Instituut/Novay that intends to predict how interesting is an information to a user. It provides the collaborative filtering and content based recommender and other features, such as an Explanation API (explanations to why such recommendations has been made). Its result recommendations are quantified by a number, ranging from -1 to +1, being that the greater the result, the more interesting the item should be to the user. One of the main advantages of Duine is its well formed architecture. When it performs a recommendation, it can incorporate the user feedback to its systems. Also, it possess a switching engine, being able to analyse which method (content or collaborative) is better in the data situation, and dynamically change it. a . Architecture The following picture describes the main concept of Duine framework. b. Installation To install

Open Source Recommendation Systems Survey

Image
Here follows a survey I did back in 2010 when I was studying Recommender Systems. Hope it is useful. The growth of web content and the expansion of e-commerce has deeply increased the interest on recommender systems. This fact has led to the development of some open source projects in the area. Among the recommender systems algorithms available in the web, we can distinguish the following:   Duine , Apache Mahout , OpenSlopeOne , Cofi , SUGGEST and Vogoo . All of these projects offers collaborative-filtering implementations, in different programming languages. The Duine Framework supplies also an hybrid implementation. It is a Java software that presents the content-based and collaborative filtering in a switching engine: it dynamically switches between each prediction given the current state of the data. For example if there aren't many ratings available, it uses the content-based approach, and switches to the collaborative when the scenario changes.

Introduction to Apache Hive

Image
Hive is a distributed data warehouse that runs on top of Apache Hadoop and enables analyses on huge amount of data.  It provides its own query language HiveQL (similar to SQL) for querying data on a Hadoop cluster. It can manage data in HDFS and run jobs in MapReduce without translating the queries into Java. The mechanism is explained below: " When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs. Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the “job plan.” In other words, these generic modules function like mini language interpreters and the “language” to drive the computation is encoded in XML. " This text was extracted from Programming Hive . Hive was initially developed by Facebook, with the intention to facilitate running MapReduce jobs on a Hadoop cluster, since sometimes writing Java programs can be challenging for non-Java developers (and for some Java develope

Working and Testing with Ant

Recently I've been working with Hive and had some troubles working with Ant . For this reason, I bought the b ook " Ant in Action " and I'll share some thing I've learned with it, and some other experiences I had working with Ant. Ant is a tool written in Java to build Java . By build I mean compile, run, package it and even test it. It is designed for helping software teams develop big projects, automating tasks of compiling code . To run Ant in you project you should have a build.xml file that describes how the projects should be built. There should be one build per project, except if the project is too big. Then you might have subprojects with " sub build " files. These sub builds are coordinated by the main build file. The goals of the build are listed as targets . You can have, for example, a target init that creates initial necessary directories. You can also have a target compile that depends on the previous ini

Altering a FOREIGN KEY Data Type

Today another curious issue happened regarding MySQL.  I had a scenario where the rows in a table grew greatly, so that the previous PRIMARY KEY type, a SMALLINT (up to 32767 different ids), could not bear the new amount of data. So I had to modify the PRIMARYKEY type. When trying to alter the field, the following error appeared: ERROR 1025 (HY000): Error on rename of '.\temp\#sql-248_4' to '.\temp\country' (errno: 150)input Checking this error number in a shell: $ perror 150 MySQL error code 150: Foreign key constraint is incorrectly formed Basically this error was a consequence of trying to modify a field that was a FOREIGN KEY, with the command ALTER TABLE mytable MODIFY COLUMN id INT NOT NULL AUTO_INCREMENT; The problem is that in MySQL there is no way of updating a field's type "on cascade". So to update a field that is a FOREIGN KEY one should drop the FOREIGN KEY Relation, change the field type, and then bring the relation b

How to Deal with Duplicate Primary Keys on MySQL

Recently, I came across with a problem where I had to copy and insert some MySQL data from one database to another. The issue regarded the same table structure in both databases, but with different data content in each. It was vital not to lose any information when importing the dump from one table to the other. One of the most important details was that it was possible to have the same tuple repeated in both tables and I could not afford overwriting any content. So, here follows the command used: mysqldump -u user -p  --no-create-info --insert-ignore mydatabase mytable > file.dump The --no-create-info makes sure no DROP or CREATE info are added, so your table is not deleted and recreated and you don't lose any information. With the --insert-ignore parameter, the data is inserted with the INSERT IGNORE method. Using so, when trying to insert a tuple with an already existent  primary key it is simply ignored. The duplicated key tuple is discarded and the

Recommender System Implementation

Hello, I have just posted an implementation of the SlopeOne Recommender System in Java. I designed it to work with Movielens Dataset (a bunch of ratings on movies). I hope that helps everybody that is starting to develop their own recommender system. https://github.com/renataghisloti/SlopeOne-with-Movielens-Dataset

Articles about Recommender Systems, Mahout and Hadoop Framework

Seeing that Recommender Systems has drawn a lot of attention in this past year, I would like to recommend further reading to those who want to obtain greater knowledge in the subject. I will indicate some articles that have helped me study the matter: G. Adomavicius and A. Tuzhilin Towards the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. 2001 This article written by Adomavicius introduces Recommender Systems very well. It explains the main three types of these systems (Content-Based, Collaborative Filtering and Hybrid Recommendation). I also gives a formal mathematical definition of a Recommender Systems, which for some people can be great. I greatly recommend any other article you may find of Adomavicius. Laurent Candillier , Frank Meyer , Kris Jack, Françoise Fessant.A State-of-the-Art Recommender Systems. This paper also provides great overview of Recommender Systems and a very interesting comparison between