tag:blogger.com,1999:blog-64617233969865711032024-03-09T18:46:51.019-08:00Just a girl in TechRenata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.comBlogger45125tag:blogger.com,1999:blog-6461723396986571103.post-17256465598711627092023-07-07T17:29:00.002-07:002023-07-07T18:16:00.923-07:00An Introduction to Statistical Learning - in Python<p> The day we have all been waiting for is here! </p><p>The authors of <a href="https://www.statlearning.com/" target="_blank">An Introduction to Statistical Learning with Applications in R </a> have finally launched the <a href="https://hastie.su.domains/ISLP/ISLP_website.pdf" target="_blank">Python</a> version of the book. </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidZnOs_C-USNoOJGVqqlPfSyvy3ldQHph_CmEP6Hh_Hxvwh73UsFEWLJZtLPxhhkxydcGWf-thkGesxHOQ7dR45XGiy3N7hznbTx_HiFsQDs2EBZsgju6dKiRMIZBEdbLN0S3WRfo2D0B92jlC4ZMH1z6v7amt0Sapi4kh2cMchcxOomxvpr7rpohMQF4/s784/Screenshot%202023-07-07%20at%209.06.32%20AM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="784" data-original-width="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidZnOs_C-USNoOJGVqqlPfSyvy3ldQHph_CmEP6Hh_Hxvwh73UsFEWLJZtLPxhhkxydcGWf-thkGesxHOQ7dR45XGiy3N7hznbTx_HiFsQDs2EBZsgju6dKiRMIZBEdbLN0S3WRfo2D0B92jlC4ZMH1z6v7amt0Sapi4kh2cMchcxOomxvpr7rpohMQF4/s16000/Screenshot%202023-07-07%20at%209.06.32%20AM.png" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><p></p><p>I am a big fan of this material, as well as the FREE online course made available by the authors (you can find it in <a href="https://learning.edx.org/course/course-v1:StanfordOnline+STATSX0001+1T2022">here</a>). Honestly, there is no better introduction to Machine Learning with such a solid footing in statistics as this one. </p><p>The book contained exercises and examples in R, and now they just released a Python version of it!!!</p><p>Chapter 10, on Deep Learning, was actually slightly changed to use PyTorch instead of Tensorflow (as it was done in the previous R version). </p><p>When I was studying with this book, I implemented a Tensorflow Python version of the labs and exercises they made available in R. If you are curious and wants to <b>check the Tensorflow Python version of the Deep Learning chapter</b> you can find it in my <a href="https://github.com/renataghisloti/ILSR-Python-DeepLearningChapter/tree/main" target="_blank">github</a>.</p>Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-46192840979533290082022-05-22T06:52:00.001-07:002022-05-22T06:53:16.739-07:00Linux/POSIX commands that every Data Scientist should know<span style="color: #666666; font-size: x-small;">Sometimes, we face the challenge to work on legacy projects or systems that have very little documentation if any.<br />I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho). </span><div><br /><div><br />
First remember that you can always type<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$<b> </b>man <i>command</i> </span> </span><br />
<br />
to get <b>more information </b>on the <i>command</i>. This should tell you what the command is and how you can use it. For example, the following should give you the manual of the awk command.</div><div><br /></div><div><span style="font-family: courier;"><span style="background-color: #f3f3f3;">$</span><b> </b><span style="background-color: #f3f3f3;">man awk</span></span><span style="background-color: #f3f3f3;"><span style="font-family: courier;"> </span> </span><br /><br /></div><div>Let's say you have a <b>File/Library not found error.</b> One thing you can try is the <i>locate</i> command.<br />
<span style="background-color: #f3f3f3;"><br /></span></div><div><span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ locate </span><i><span style="font-family: courier;">pattern </span> </i></span><br /><br /></div><div> <i>Locate</i> will return any repo that matches the pattern passed. With this, one can check if a file is on your computer, and where it is.</div><div> <b>whereis</b> <i>file</i> is also a good tool to find programs, but with <b>whereis</b> you have to specify the exact name of the program you want found. For example</div><div><br /></div><div><span style="background-color: #eeeeee;"><span style="font-family: courier;">$ whereis </span><span style="font-family: courier;">python<i> </i></span></span></div><div><br />will show you where the program <i>python</i> (the one in your PATH, what you execute when you type "python" in the command line on your terminal) is located.</div><div><br /></div><div>
Let's say one realizes that the they do have the file you were looking for, but still gets an error. In this case, they <b>might not have the right permission to access it</b>. You can change its permissions rights with:<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ chmod 755 <i>file</i> </span> </span><br />
<br />or</div><div><br /></div><div><span style="background-color: #f3f3f3; font-family: courier;">$ chmod u+x </span><i style="background-color: #f3f3f3; font-family: courier;">file</i><span style="background-color: #f3f3f3; font-family: courier;"> </span></div><div><br /></div><div>
Let's say the program you want is not installed at all on your system. If you are on an <b>ubuntu</b> environment, you should be able to install it with:<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ apt-cache search </span><i><span style="font-family: courier;">pattern </span> </i></span><br /><br /></div><div>With this you will get a bunch of distinct results matching <i>pattern</i>. See in the list the program you want to install. This is the program you are going to install next<i></i><br /><br /><span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ sudo apt-get install </span><i><span style="font-family: courier;">program </span> </i></span></div><div> </div><div>On a <b>mac os</b>, we usually use <a href="https://brew.sh/" target="_blank">brew</a>:</div><div><br /></div><div><span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ brew install </span><i><span style="font-family: courier;">file </span></i></span><br /><br />If what you need is a <b>python</b> package, you can run:</div><div><br /></div><div><span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ pip install <i>package</i></span><i><span style="font-family: courier;"> </span></i></span></div><div> </div><div>BTW if you ever want to check the list of python packages installed on your computer, you can run:</div><div><br /></div><div><span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ pip freeze</span><i><span style="font-family: courier;"> </span></i></span></div><div><br /></div><div>Let's say you are compiling a program and getting "Error 1" as output, but you have no idea what error 1 is, or where it could be in the code should be. You can type:<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ grep -r "Error 1" . </span> </span><br />
<br />
This will look recursively for the string starting from you current directory, and output all files that present this string.<br />
If there are too many and, you can type instead<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ grep -r "Error 1" . | less </span> </span><br />
<br />
This will give you the ability to scroll the screen up and down and see results better.<br />
<br />
Ok, so you ran your program, but it is still not working properly. Let's say some application is getting stuck. If you have the program you want to kill on your terminal, you can stop its execution by pressing <b>CRTL + C</b>. If not, or if it is on background, you can look for it's run id with<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ ps -e </span> </span><br />
<br />
Look for your application <i>pid</i> (process id, the number beside your program's name) and type<br />
<br />
<span style="background-color: #f3f3f3;"><span style="font-family: courier;">$ kill </span><i><span style="font-family: courier;">number </span> </i></span><br />
<br />
Another great resource is the <i>find</i> command. You can find files by name or size! For example:<br /><br /><span style="background-color: #eeeeee; font-family: courier;">$ find . -name "*.jar" </span></div><div><br /></div><div>Will find all files with the .jar extension, in any repo located under your current directory. You can also use it to find large files, like:</div><div><br /></div><div><span style="background-color: #eeeeee; font-family: courier;">$ find / -size +100M </span></div><div> </div><div>The above command finds all files with size equal or greater to 100MB in your computer!</div><div><br /></div><div>Last, my favorite of all time. <i>nohup</i>. Nohup is a great tool to let a script or program run in a remote system even if you get disconnected from it! So let's say you have sshed to whatever system you need to ssh to, and need to execute a program that takes hours to finish. With nohup, you can exit the system and the program continues to run!</div><div><br /></div><div><span style="background-color: #eeeeee; font-family: courier;">$ nohup python potato.py & </span><br />
<br />will leave the potato.py executing while you can go and finish your business elsewhere.</div><div><br /></div><div>Of course you can still be an absolutely amazing data scientist without knowing any of these, but they can definitely be life savers and might be worth taking the time to learn them! </div><div><br /></div><div>:D<br />
<br /></div></div>Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-41954361401146900642022-05-21T11:00:00.010-07:002022-05-22T06:12:56.142-07:00Top 5 Best Data Science and Machine Learning Courses<div><span style="background-color: white; color: #666666; font-family: inherit;">New Data Science enthusiasts usually wonder what are the what are the best resources to best master this area. I am a huge fan of online courses (specially if they are free 😆) and decided to share my top 5 favorite ones. All courses below should have their main content available for free, so you can learn Machine Learning without investing too much!</span></div><div><span style="background-color: white; color: #666666; font-family: inherit;"><br /></span></div><div><span style="background-color: white; color: #666666; font-family: inherit;"><br /></span></div><h3><ul style="text-align: left;"><li><span style="background-color: white; font-family: inherit; font-size: large;"><a href="https://www.edx.org/course/statistical-learning" target="_blank">Statistical Learning</a> </span></li></ul></h3><div><span style="background-color: white; color: #444444; font-family: inherit; font-size: medium;"><span> </span><span> </span>This course from Stanford University, taught by <span>Trevor Hastie and </span><span>Robert Tibshirani is an </span><span>absolutely</span><span><b> amazing introduction to Machine Learning</b>. </span><span>You might have heard about Prof. Tibishirani for being responsible for developing the Lasso method. </span><span>The classes are a great mix of practical intuition and theoretical concepts. Besides the Professors are funny and adorable (if you don't mind me saying). </span></span></div><div><span style="background-color: white; font-family: inherit; font-size: medium;"><span style="color: #222222;"><br /></span></span></div><div><ul style="text-align: left;"><li><a href="https://www.coursera.org/learn/python-machine-learning?specialization=data-science-python" target="_blank"><span style="background-color: white; font-family: inherit; font-size: large;"><b>Applied Machine Learning in Python</b></span></a></li></ul></div><div><span style="background-color: white; color: #444444; font-family: inherit; font-size: medium;"><span> </span><span> </span>Here we have a much more <b>practical introduction to Machine Learning</b> and Data Science, with amazing examples in Python and details about arguments to be used in specific packages. This course from University of Michigan by <div class="_1qfi0x77" style="-webkit-box-align: center; -webkit-font-smoothing: antialiased; align-items: center; box-sizing: inherit; display: inline;"><div class="_1qfi0x77" style="-webkit-box-align: center; -webkit-font-smoothing: antialiased; align-items: center; box-sizing: inherit; display: inline;"><span style="display: inline;"><span style="-webkit-font-smoothing: antialiased; box-sizing: inherit;">Kevyn Collins-Thompson</span> </span></div></div>is a great resource if you want to have a good idea of what a Machine Learning project could be in many industry scenarios. Also recommended for people with less Mathematical backgrounds. The Professor does a great job covering complex methods in a very simple way!</span></div><div><span style="background-color: white; font-family: inherit; font-size: medium;"><br /></span></div><div><div><ul style="text-align: left;"><li><a href="https://work.caltech.edu/telecourse" target="_blank"><span style="background-color: white; font-family: inherit; font-size: large;"><b>Learning from Data</b></span></a></li></ul></div><div><span style="background-color: white; color: #444444; font-family: inherit; font-size: medium;"><span> </span><span> </span>The great Yaser Abu-Mostafa Professor from Caltech shares <b>great ideas about Machine Learning, Statistics and Probability</b> in this course. To me, it feels like seeing Machine Learning from a different angle. Even experienced professionals can learn a lot with this course!</span></div></div><div><span style="font-size: medium;"><br /></span></div><div><ul style="text-align: left;"><li><a href="https://openlearninglibrary.mit.edu/courses/course-v1:MITx+6.036+1T2019/about" target="_blank"><span style="background-color: white; font-family: inherit; font-size: large;"><b>Introduction to Machine Learning</b></span></a></li></ul></div><div><span style="color: #444444; font-size: medium;"><span style="background-color: white; font-family: inherit;"><span> <span> </span></span>Another valuable introduction to Machine Learning by MIT. This course covers the most important topics, with g</span><span style="background-color: white; font-family: inherit;">reat content! I specially loved the <b>exercises</b> available at the platform.</span></span></div><div><span style="font-size: medium;"><span style="background-color: white; font-family: inherit;"><br /></span></span></div><div><ul style="text-align: left;"><li><a href="https://www.coursera.org/learn/machine-learning" target="_blank"><span style="background-color: white; font-family: inherit; font-size: large;"><b>Machine Learning</b></span></a></li></ul></div><div><span style="color: #444444; font-size: medium;"><span style="background-color: white; font-family: inherit;"><span> </span><span> </span>This Stanford University course by </span><span style="background-color: white;"><span style="font-family: inherit;">Andrew Ng was the main go-to Machine Learning course for a long time. The Professor explores the mechanics of each Machine Learning method, discussing not only how they are used, but actually <b>how they work </b></span><b>underneath</b><span style="font-family: inherit;"><b> the hood</b>. </span></span></span></div><div><br /></div><div><br /></div>Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-41827412243583229262018-05-24T19:20:00.003-07:002022-05-22T07:13:29.409-07:00Apache Hadoop Admin Tricks and Tips<span style="font-family: inherit;">In this post I will share some tips I learned after using the Apache Hadoop environment for some years, and doing many many workshops and courses. The information here considers Apache Hadoop around version 2.9, but it could probably be extended to other similar versions.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">These are considerations for when building or using a Hadoop cluster. <i>Some</i> are considerations over the Cloudera distribution. Anyway, hope it helps! </span><br />
<br />
<ul>
<li><span style="font-family: inherit;">Don't use Hadoop for millions of small files. It overloads the namenode and makes it slower. It is not difficult to overload the namenode. Always check capability vs number of files. Files on Hadoop usually should be more than 100 MB.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">You have to have a 1 GB of memory for around 1 million files in the namenode.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Nodes usually fail after 5 years. Node failures is one of the most frequent problems in H<span class="il" style="background-color: white; color: #222222;">adoop</span><span style="background-color: white;"><span style="color: #222222;">. Big companies like facebook and google should have node failures by the minute.</span></span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">The MySQL on Cloudera Manager does not have redundancy. This could be a point of failure.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Information: the merging of fsimage files happens on the secondary namenode.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Hadoop can cache blocks to improve performance. By default it caches 0. </span></li>
</ul>
<ul>
<li><span style="font-family: inherit;"><span style="color: #222222;">You can set a parameter that sends an acknowledgment message from datanodes back to the namenode after only the first </span><span style="background-color: white;"><span style="color: #222222;">or second data block has been copied to the datanodes. That might make writing data faster. </span></span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Hadoop has rack awareness: it knows which node is connected to witch switch. Actually, it it the Hadoop Admin who configures that.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Files are checked from time to time to verify if there was any data corruption (usually every three weeks). This is possible because datanodes store files checksum.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Log file stores by default 7 days.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">part-m-000 are from mapper and part-r-000 are from reducer jobs. The number in the end corresponds to the number of reducers that ran for that job. So part-r008 had 9 reducers (starts from 0).</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">You can change the log.level of mapper and reducers tasks yo get more information.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">mapreduce.reduce.log.level=<wbr style="background-color: white; color: #222222;"></wbr><span style="background-color: white; color: #222222;">DEBUG</span></span></li>
</ul>
<ul>
<li><span style="background-color: white; color: #222222;"><span style="font-family: inherit;">yarn server checks what spark did. localhost:4040 also shows what has been done.</span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">It is important to check where to put the namenode fsimage file. You might want to replicate this file.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Y<span style="background-color: white; color: #222222;">ou have to save a lot of disk space </span><span style="background-color: white; color: #222222;">(25%)</span><span style="background-color: white; color: #222222;"> to dfs.datanode.du.reserve, for the shuffle phase.</span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">This phase is going to be written in disk, so there needs to be space!</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">When you remove files, they stay on the .Trash directory after removing for a while. The default time is 1 day.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">You can build a <i>lamdba</i> architecture with flume (consume data in one way and save it on disk for example).</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;"><span style="color: #222222;">Regarding hardware, </span><span style="background-color: white; color: #222222;">worker nodes need more cores for more processing. The master nodes don't process that much.</span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;"><span style="color: #222222;">For the</span><span style="background-color: white; color: #222222;"> namenode you want more quality disks and better hardware (like raid - and raid makes no sense on worker nodes).</span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">The rule of thumb is: if you want to store 1 TB of data you have to have 4 TB space.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">H<span class="il" style="background-color: white; color: #222222;">adoop</span><span style="background-color: white;"><span style="color: #222222;"> applications are typically not cpu bound. </span></span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Virtualization might give you some benefits (easier to manage), but it<span style="background-color: white; color: #222222;"> hits performance. </span><span style="background-color: white; color: #222222;">Usually it brings between 5% and 30% of overhead.</span></span></li>
</ul>
<ul>
<li><span style="color: #222222; font-family: inherit;"><span style="background-color: white;">H</span></span><span class="il" style="background-color: white; color: #222222; font-family: inherit;">adoop</span><span style="background-color: white; color: #222222; font-family: inherit;"> does not support ipv6. You can disable ipv6.</span><span style="background-color: white;"><span style="color: #222222; font-family: inherit;"> You can also </span><span style="color: #222222;">disable</span><span style="color: #222222; font-family: inherit;"> selinux inside the cluster. Both give overhead.</span></span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">A good size for a starting cluster is around 6 nodes.</span></li>
</ul>
<ul>
<li><span style="font-family: inherit;">Sometimes, when the clusters is too full, you might have to remove a small file to remove a bigger file.</span></li>
</ul>
<br />
<br />
<div>
<span style="background-color: white; color: #222222;"><span style="font-family: inherit;"><br /></span></span>
<span style="background-color: white; color: #222222;"><span style="font-family: inherit;">That is it for now. I will try to write a part 2 soon. Let me know if there is anything I missed here!</span></span></div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com15tag:blogger.com,1999:blog-6461723396986571103.post-55912356971966018972017-11-10T19:23:00.001-08:002018-05-24T19:27:33.088-07:00BigData White Papers<span style="font-family: inherit;">I don't know about you, but I always like to read the white papers that originate OpenSource projects (when available of course :) ).</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">I have been working with BigData quite a lot lately and this area is mostly dominated by Apache OpenSource projects.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"> So, naturally (given the nerd that I am) I tried to investigate their history. I created a list of articles and companies that originated most BigData Apache projects.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Here it is! Hope you guys find it interesting too. :)</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><br /></span>
<br />
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<b><span style="font-family: inherit;">Apache Hadoop </span></b></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Based on: Google MapReduce and GFS </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers:</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf"><span style="font-family: inherit;">https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf"><span style="font-family: inherit;">https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache Spark</b> </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Created by: University of California, Berkeley </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers: </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf"><span style="font-family: inherit;">http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf"><span style="font-family: inherit;">http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf"><span style="font-family: inherit;">http://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf"><span style="font-family: inherit;">http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://www.jmlr.org/papers/volume17/15-237/15-237.pdf"><span style="font-family: inherit;">http://www.jmlr.org/papers/volume17/15-237/15-237.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<b><span style="font-family: inherit;">Apache Hive </span></b></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<b><span style="font-family: inherit;"><br /></span></b></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Created by: Facebook</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers: </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://infolab.stanford.edu/~ragho/hive-icde2010.pdf"><span style="font-family: inherit;">http://infolab.stanford.edu/~ragho/hive-icde2010.pdf</span></a></div>
</div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<b><span style="font-family: inherit;">Apache Kafka </span></b></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<b><span style="font-family: inherit;"><br /></span></b></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Created by: Linkedin</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers:</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><a href="http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=DAD03005F46187B58030A748A87A13FE?doi=10.1.1.233.1726&rep=rep1&type=pd">http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=DAD03005F46187B58030A748A87A13FE?doi=10.1.1.233.1726&rep=rep1&type=pd</a>f</span></div>
</div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;"><b><span style="font-family: inherit;"><br /></span></b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;"><b><span style="font-family: inherit;">Apache Impala </span></b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;"><b><span style="font-family: inherit;"><br /></span></b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;"><span style="font-family: inherit;">Based on: Google F1</span></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers:</span><br />
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf"><span style="font-family: inherit;">http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b><br /></b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache HBase</b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Based on: Google BigTable</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers:</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf"><span style="font-family: inherit;">https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache Drill </b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Based on: Google Dremel</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers: </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf"><span style="font-family: inherit;">https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache Pig </b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Created by: Yahoo!</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers: </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="http://infolab.stanford.edu/~olston/publications/sigmod08.pdf"><span style="font-family: inherit;">http://infolab.stanford.edu/~olston/publications/sigmod08.pdf</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache Oozie </b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Created by: Yahoo!</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Papers: </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://dl.acm.org/citation.cfm?id=2443420"><span style="font-family: inherit;">https://dl.acm.org/citation.cfm?id=2443420</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b><br /></b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache Sqoop</b> </span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="background-color: white; font-family: inherit;">Started as a module for Apache Hadoop on issue <a href="https://issues.apache.org/jira/browse/HADOOP-5815">https://issues.apache.org/jira/browse/HADOOP-5815</a> by <span style="color: #222222; white-space: pre-wrap;">Aaron Kimball.</span></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="color: #222222; font-family: inherit;"><span style="background-color: white; white-space: pre-wrap;">Links:</span></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator"><span style="font-family: inherit;">https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator</span></a></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><b>Apache Flume</b></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<span style="font-family: inherit;">Links:</span></div>
<div style="font-variant-alternates: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; line-height: normal;">
<a href="https://blogs.apache.org/flume/entry/flume_ng_architecture"><span style="font-family: inherit;">https://blogs.apache.org/flume/entry/flume_ng_architecture</span></a></div>
<div>
<br /></div>
</div>
</div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com2tag:blogger.com,1999:blog-6461723396986571103.post-58949057686053537592017-08-11T17:00:00.002-07:002017-08-11T17:05:01.109-07:00Deep Learning, TensorFlow and Tensor CoreI was lucky enough to get a ticket to the <a href="https://events.google.com/io/">Google I/O 2017</a> on a <a href="https://code.google.com/codejam/contest/12224486/dashboard">Google Code Jam for Women</a> (for girls that don't know, Google has some programming contest for women and the best classified win tickets to the conference).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_PrlFHpWSThCPVUbUQI4ypBLlgKkjR-qbDr9SB5WuIq7bd5x9S-2-hTBiP2xCkAFpvvLrssTztwAIXPps4jye63aHd1A9U2uMoBB8L5lwkwLqvf_PLKvIOazaQNTxTXeUXvE1x41Q0SU/s1600/Screen+Shot+2017-08-11+at+6.49.53+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1146" data-original-width="1184" height="309" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_PrlFHpWSThCPVUbUQI4ypBLlgKkjR-qbDr9SB5WuIq7bd5x9S-2-hTBiP2xCkAFpvvLrssTztwAIXPps4jye63aHd1A9U2uMoBB8L5lwkwLqvf_PLKvIOazaQNTxTXeUXvE1x41Q0SU/s320/Screen+Shot+2017-08-11+at+6.49.53+PM.png" width="320" /></a></div>
<br />
<br />
One of the main topics of the conference was for sure its new Deep Learning library <a href="https://www.tensorflow.org/">TensorFlow</a>. TensorFlow is Google's OpenSource Machine Learning library that runs both on CPU and GPU.<br />
<br />
Two very cool things were presented at Google I/O:<br />
<br />
<ul>
<li> TPU (Tensor Processing Unit) - a GPU optimized specifically for TensorFlow that can be used on the Google Cloud Engine</li>
<li> TensorFlow Lite - a TensorFlow low weight version to run on Android and make developer's lives easier</li>
</ul>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbwwsKZgjsqKFFkYofJ3NzDsXNZICoY5CqBqkV-jH1-OyfQLDG0IYafMPKnW31UTMP96bI5teHnWrZiXqml5lqNsFCl8nOuDN1iI8rS05VhYR3-dRs88SdISSXJpHkiZes2jsL65zZlfg/s1600/Screen+Shot+2017-08-11+at+6.49.41+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1190" data-original-width="1190" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbwwsKZgjsqKFFkYofJ3NzDsXNZICoY5CqBqkV-jH1-OyfQLDG0IYafMPKnW31UTMP96bI5teHnWrZiXqml5lqNsFCl8nOuDN1iI8rS05VhYR3-dRs88SdISSXJpHkiZes2jsL65zZlfg/s320/Screen+Shot+2017-08-11+at+6.49.41+PM.png" width="320" /></a></div>
<br />
<br />
Last week, at a BigData meetup in Chicago, I discovered that Nvidia also created a specific GPU hardware for processing Deep Learning, the <a href="https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/">Tensor Core</a>.<br />
<br />
With all this infrastructure and APIs being made available, Deep Learning can be done considerably easier and faster. At Google I/O, Sundar Pichai mentioned that at Google they have been using Machine Learning for almost everything, and even Deep Learning to train the Deep Learning networks!<br />
<br />
TensorFlow's API is so high level, that even someone with little technical background can develop something interesting with it. Sundar also shared a <a href="http://www.businessinsider.com/google-showcases-teen-programmer-helping-diagnose-cancer-2017-5">story</a> of a high school guy that used the library to help detecting some types of cancer.<br />
<br />
It seems that Data Science is becoming attainable.<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com2tag:blogger.com,1999:blog-6461723396986571103.post-72656992546290243222017-08-02T12:46:00.004-07:002020-09-01T17:32:18.807-07:00Errors when using the neuralnet package in R<span style="font-family: inherit;">Ok, so you read a bunch of stuff on how to do Neural Networks and how many layers or nodes you should add, and etc... But when you start to implement the actual Neural Networks you face a ton of dummy errors that stop your beautiful inspirational programming.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">This post talks about some errors you might face when using the neuralnet package in R.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">First, remember, to use the package you should install it:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">install.packages("neuralnet")</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">Then</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">library("</span>neuralnet")</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">to load the package.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit; font-size: large;"><b>Error 1</b></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">One error that might happen training your neural network is this:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;">nn <- neuralnet(formula1,data=new_data, hidden=c(5,3))</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="background-color: white;">Error in terms.formula(formula) : invalid model formula in ExtractVars</span></span><br />
<br />
This happens when the name of the variables in formula "formula1" are in a non desired format. For example if you named your columns (or variables) as numbers you would get this error. So change your column names and re-run the model!<br />
<br />
Example:<br />
<br />
label ~ 1 + 2 + 3 + 4 + 5<br />
<br />
Change to:<br />
<br />
label ~ v1 + v2 + v3 + v4 + v5<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black; margin-left: 1em; margin-right: 1em;"></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYgWKR0M1XbO-pEVHCJ1pWr_RbMg1g5DspZjaC9W7Tse9-k76yr0QEVfgDga3qyQTmPxD34HL6zm-OgA7t_G5JQe3i-yHAptzAr5ClzGZleJrSnnx9hJUItX-KyfwslQEMOvsnqRqDQOE/s1600/Screen+Shot+2017-08-06+at+7.41.28+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="348" data-original-width="1600" height="86" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYgWKR0M1XbO-pEVHCJ1pWr_RbMg1g5DspZjaC9W7Tse9-k76yr0QEVfgDga3qyQTmPxD34HL6zm-OgA7t_G5JQe3i-yHAptzAr5ClzGZleJrSnnx9hJUItX-KyfwslQEMOvsnqRqDQOE/s400/Screen+Shot+2017-08-06+at+7.41.28+AM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<br />
<br />
<span style="font-size: large;"><b>Error 2</b></span><br />
<br />
Another error you might get is the following:<br />
<br />
<span style="background-color: white; font-family: "courier new" , "courier" , monospace;">nn <- neuralnet(f, data=train[,-1], hidden=c(3,3))</span><br />
<span style="background-color: white; font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="background-color: white; font-family: "courier new" , "courier" , monospace;">Warning message: algorithm did not converge in 1 of 1 repetition(s) within the stepmax</span><br />
<br />
<span style="background-color: white;"><br /></span>
<span style="background-color: white;">To solve this</span>, you can increase the size of "stepmax" parameter:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">nn <- neuralnet(f, data=train[,-1], hidden=c(3,3), stepmax=1e6)</span><br />
<br />
If that doesn't work, you might have to change other parameters to make it converge. Try reducing the number of hidden nodes or layers. Or changing your training data size.<br />
<br />
<br />
<span style="font-size: large;"><b>Error 3</b></span><br />
<br />
The third error I want to discuss happens when actually computing the output of the neural network:<br />
<br />
<pre class="lang-r prettyprint prettyprinted" style="border: 0px; line-height: inherit; margin-bottom: 1em; margin-top: 0.5em; max-height: 600px; overflow: auto; padding: 5px; vertical-align: baseline; width: auto; word-wrap: normal;"><code style="background-color: white; border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline; white-space: inherit;"><span style="font-family: "courier new" , "courier" , monospace;"><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">net.compute </span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"><-</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> compute</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">(</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">net</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">,</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> matrix.train2</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">[,</span><span class="lit" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">1</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">:</span><span class="lit" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">10</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">])</span></span></code></pre>
<pre class="lang-r prettyprint prettyprinted" style="border: 0px; line-height: inherit; margin-bottom: 1em; margin-top: 0.5em; max-height: 600px; overflow: auto; padding: 5px; vertical-align: baseline; width: auto; word-wrap: normal;"><code style="background-color: white; border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline; white-space: inherit;"><span style="font-family: "courier new" , "courier" , monospace;"><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">Error </span><span class="kwd" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">in</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> neurons</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">[[</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">i</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">]]</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> </span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">%*%</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> weights</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">[[</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">i</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">]]</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> </span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">:</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"> non</span><span class="pun" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">-</span><span class="pln" style="border: 0px; font-style: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">conformable arguments</span></span></code></pre>
This error occurs when the number of columns in the dataframe you are using to predict is different from the columns used to train the neural network. <b>The data frames used in neuralnet and compute should have the same columns and the same names!</b><br />
<br />
<br />
That is it! If you faced any other dummy error with the neuralnet package send me and I can add it to the post! Good luck! :D<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-27290300968052747182016-11-08T13:40:00.001-08:002017-08-06T05:51:30.058-07:00Running k-Means Clustering on Spark with Cloudera in your MachineHere are some steps to start using Spark. You can download a VirtualBox and a Cloudera Hadoop distribution and start testing it locally on your machine.<br />
<br />
<b>Steps</b>:<br />
<br />
Download <a href="https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/k_means_example.py">kmeans.py</a> example that uses MLLIB furnished by Spark.<br />
<br />
Create a kmeans_data.txt file that looks like this:<br />
<br />
0.0 0.0 0.0<br />
0.1 0.1 0.1<br />
0.2 0.2 0.2<br />
9.0 9.0 9.0<br />
9.1 9.1 9.1<br />
9.2 9.2 9.2<br />
<br />
Download <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a>.<br />
<br />
Download <a href="http://www.cloudera.com/downloads/quickstart_vms/5-8.html">Cloudera</a> CDH5 trial version.<br />
Open VirtualBox, import the downloaded Cloudera's Virtual Box and run it.<br />
<br />
Inside VirtualBox:<br />
<br />
1 - (needs internet access) Install python numpy library. In a terminal, type: <br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">$ sudo yum install numpy</span><br />
<br />
2 - Copy kmeans_data.txt and kmeans.py to /home/cloudera/ (or wherever you want)<br />
<br />
3 - Launch Cloudera Enterprise Trial by clicking on an icon on Cloudera's Desktop or run this command: <br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">$ sudo cloudera-manager --force --enterprise</span><br />
<br />
4 - Open Cloudera Manager Webinterface on your browser. Here are the credentials for that: <br />
<br />
user: cloudera <br />
password: cloudera<br />
<br />
5 - Start HDFS on ClouderaManager Webinterface (on your browser)<br />
<br />
6 - Start Spark on ClouderaManager Webinterface (on your browser)<br />
<br />
7 - Put the kmeans_data.txt into HDFS. Run: <br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">$ hadoop fs -put kmeans_data.txt</span><br />
<br />
8 - Run the Spark job kmeans.py locally with 2 threads: <br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">$ spark-submit --master local[2] kmeans.py</span><br />
<br />
7 - Get the result from HDFS, and put it in your current directory: <br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">$ hadoop fs -get KMeansModel/*</span><br />
<br />
8 - The result will be stored in parquet. Read the result with parquet-tools: <br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">$ parquet-tools cat KMeansModel/data/part-r-000..</span><br />
<br />
<span style="font-family: inherit;">Here is an example output of what this command should give:</span><br />
<span style="font-family: "courier new";"></span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj09mdPsXSK_R9X6xmBWKL86tiSu6tNUMYerI1NJL1KIYJNHVEJ2Q6hNewsJhRXZo9Lhlc74TDCNQtnYM68ASdA_r7JYKA_NwPjAgf-5ruIuXrvKilINoyFmLZXN6bLz25RH9xsxWAWPKU/s1600/2016-10-27_15h14_27.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="159" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj09mdPsXSK_R9X6xmBWKL86tiSu6tNUMYerI1NJL1KIYJNHVEJ2Q6hNewsJhRXZo9Lhlc74TDCNQtnYM68ASdA_r7JYKA_NwPjAgf-5ruIuXrvKilINoyFmLZXN6bLz25RH9xsxWAWPKU/s320/2016-10-27_15h14_27.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<span style="font-size: xx-small;">Small note: While running these steps, <i>errors might appear in some part of the process due to initialization timing issues</i>. I know that is a annoying advice, but if that happens just try running the command again in a couple of minutes. Also, you have to change the location of the kmeans_data.txt file inside kmeans.py to point it to your data, and also maybe change where the output will be written (target/org/apache/spark/PythonKMeansExample/KMeansModel<span class="pl-pds"></span>).</span><br />
<div>
<span style="font-size: xx-small;"><br /></span></div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-36952404681607739682016-05-16T14:22:00.001-07:002017-12-21T14:46:26.403-08:00Error when using smooth.spline<span style="font-family: inherit;">When trying to interpolate a series of data the cubic <a href="https://en.wikipedia.org/wiki/Spline_(mathematics)" target="_blank">spline</a> is a great technique to be used.</span><br />
<span style="font-family: inherit;">I choose to use the <a href="http://smooth.spline/">smooth.spline</a> function, from the R stats package.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">> smooth.spline(data$x, </span><span style="font-family: "\22 courier new\22 " , "\22 courier\22 " , monospace;">data$y</span><span style="font-family: "courier new" , "courier" , monospace;">)</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">Nevertheless, while running smooth.spline on a collection of datasets with different sizes I got the following error:</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">Error in smooth.spline(<span style="font-family: "courier new" , "courier" , monospace;">data$x, </span>data$y), :<br /> 'tol' must be strictly positive and finite</span><br />
<span style="font-family: inherit;"><br />After digging a little bit I discovered that the problem was that some datasets were really small and smooth.spline wasn't being able to compute anything.<br />Hence, make sure your dataset is big enough before applying smooth.spline to it.</span><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">> if(length(data$x) > 30) { <span style="font-size: small;">smooth.spline(<span style="font-family: "courier new" , "courier" , monospace;">data$x, </span>data$y) </span>}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">UPDATE: </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">A more generalized solution would be:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">> if(IQR(data$x) > 0) { </span><span style="font-family: '"courier new"', '"courier"', monospace;">smooth.spline(</span><span style="font-family: '"courier new"', '"courier"', monospace;">data$x, </span><span style="font-family: '"courier new"', '"courier"', monospace;">data$y) </span><span style="font-family: 'courier new', courier, monospace;"><b>}</b></span><br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com3tag:blogger.com,1999:blog-6461723396986571103.post-8646194442026569612015-09-26T13:05:00.002-07:002017-08-06T05:48:52.501-07:00Working with Big Datasets in RWhen dealing with a significant amount of data in <a href="https://www.r-project.org/" target="_blank">R </a>the are some points to consider.<br />
<br />
<b>How do I know if my data is too big?</b><br />
<br />
Well, the term "BigData" can be thought of as a data that is too big to fit in the available memory.<br />
<br />
As R works with the entire dataset in memory (unless you specify it not to do so), the first thing is to <b>check how large is the dataset in question, and if it does fit in memory</b>.<br />
<br />
Remember that you actually should have at least double memory of the size of your dataset.<br />
So for example if you dataset has a size of 2 GB, you should have at least 4 GB of memory.<br />
<br />
If you don't have enough memory, you should consider breaking your data into smaller chunks and working with them separately.<br />
<br />
You can use the command split to do this in Linux:<br />
<br />
<pre style="background-color: #eeeeee; border: 0px; color: #111111; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, sans-serif; font-size: 13px; margin-bottom: 1em; max-height: 600px; overflow: auto; padding: 5px; width: auto; word-wrap: normal;"><code style="border: 0px; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, sans-serif; margin: 0px; padding: 0px; white-space: inherit;">split -l 10000 file.txt new_file</code></pre>
<br />
This should create several new files (new_filea, new_fileb, etc..) with ten thousand lines each.<br />
<br />
Well, once you know your date will fit into memory, you can read it with the commands <i>read.table</i> or <i>read.csv</i>. The difference between them is that <i>read.csv</i> sets the parameter sep (from separator) as ",".<br />
<br />
If your data does fit in memory, but even so, it occupies almost the entire available space, <b>there are some parameter you can tune to make R faster</b>.<br />
<br />
We know that not all parameters are mandatory when calling the read.table command. When we leave some parameters blank, R is going to try to discover automatically what are those. Setting them previously will spare R some calculation, which for large datasets, can be a considerable time.<br />
Some of these parameters are:<br />
<br />
<br />
<ul>
<li>comment.char - define the comment character in your text. If there are none, you can set it to the empty string ""</li>
</ul>
<br />
<br />
<br />
<ul>
<li>colclasses - define the class of each column on your data.frame. If they are all numeric, for example, just put "numeric"</li>
</ul>
<br />
<br />
If <i>colclasses </i>is not specified, all columns are read as characters and then converted to the appropriated class.<br />
<br />
For more information:<br />
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html<br />
<br />
<br />
<br />
<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com1tag:blogger.com,1999:blog-6461723396986571103.post-20993498742933711602015-07-31T16:06:00.000-07:002017-08-06T05:51:50.074-07:00Removing Outliers to Plot Data<span style="font-family: "times" , "times new roman" , serif;">I am currently working a lot with <a href="https://www.r-project.org/" target="_blank">R</a>. One simple thing that helps me to better visualize data is to plot it excluding outliers.</span><br />
<span style="font-family: "times" , "times new roman" , serif;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif;">To do so, first read the data</span><br />
<div style="color: black;">
<span style="font-family: "times" , "times new roman" , serif;"><br /></span></div>
<div style="color: black;">
<span style="background-color: #eeeeee; font-family: "courier new" , "courier" , monospace;">data = read.table(“myfile.txt”) </span></div>
<div style="color: black;">
</div>
<div style="color: black;">
<span style="font-family: "times" , "times new roman" , serif;"><br /></span>
<span style="font-family: "times" , "times new roman" , serif;">Then, you can check how data is distributed</span><br />
<span style="font-family: "times" , "times new roman" , serif;"><br /></span></div>
<div style="color: black;">
</div>
<div style="color: black;">
<span style="background-color: #eeeeee; font-family: "courier new" , "courier" , monospace;">quantile(data, c(.02, .05, .10, .50, .90, .95, .98)) </span></div>
<div style="color: black;">
<span style="font-family: "times" , "times new roman" , serif;"><br /></span></div>
<div style="color: black;">
<span style="font-family: "times" , "times new roman" , serif;">An example output would be</span></div>
<div style="color: black;">
<div style="margin: 0px;">
<br /></div>
<div style="margin: 0px;">
<span style="font-family: "times" , "times new roman" , serif;"> </span><span style="font-family: "courier new" , "courier" , monospace;"> 2% 5% 10% 50% 90% 95% 98% </span></div>
<div style="margin: 0px;">
<span style="font-family: "courier new" , "courier" , monospace;"> 189 190 190 194 241 275 316 </span><br />
<br />
<span style="font-family: "times" , "times new roman" , serif;">Now, to plot your data discarding the 1% lowest values and 1% higher values, you could use</span><br />
<span style="font-family: "times" , "times new roman" , serif;"><br /></span>
<span style="background-color: #eeeeee; font-family: "courier new" , "courier" , monospace;">x <- quantile(data, c(.01, .99)) </span></div>
<div style="margin: 0px;">
<span style="font-family: "times" , "times new roman" , serif;"><br /></span></div>
<div style="margin: 0px;">
<span style="font-family: "times" , "times new roman" , serif;">And then</span><br />
<span style="font-family: "times" , "times new roman" , serif;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><span style="background-color: #eeeeee;">plot(data, xlim=c(x[[1]], x[[2]])) </span> </span></div>
<div style="font-family: Menlo; font-size: 11px; margin: 0px;">
<br /></div>
<div style="font-family: Menlo; font-size: 11px; margin: 0px;">
<br /></div>
</div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-49667386685539906152015-02-11T10:27:00.004-08:002022-05-21T11:07:20.303-07:00SVM in PracticeMany Machine Learning articles and papers describe the wonders of the <a href="http://en.wikipedia.org/wiki/Support_vector_machine" target="_blank">Support Vector Machine</a> (SVM) algorithm. Nevertheless, when using it on real data trying to obtain a high accuracy classification, I stumbled upon several issues.<br />
I will try to describe the steps I took to make the algorithm work in practice.<br />
<br />
This model was implemented using <a href="https://cran.r-project.org/bin/windows/base/" target="_blank">R </a>and the library "e1071".<br />
To install and use it type:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> install.packages("e1071")</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> library("e1071")</span><br />
<br />
When you want to <b>classify data in two categories</b>, few algorithms are better than SVM.<br />
It usually divides data in two different sets by finding a "line" that better separates the points. It is capable to classify data linearly (put a straight line to differentiate sets) or do a nonlinear classification (separates sets with a curve). This "separator" is called a <a href="http://en.wikipedia.org/wiki/Hyperplane" target="_blank">hyperplane</a>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDwhowVGgfWyrCPy-ZibeSxhyjQT2JGZRheOQG0-bteX0JEnSXQqPZ9hS19dW6vRxoHL8xGIo12kZCFjNVJ3HsfqkyycJUwinGUilnr1NRZpsbvE-5XEQVNYp9Bh_zkiWKKELVgAGO6Xk/s1600/Screen+Shot+2015-02-11+at+4.14.28+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDwhowVGgfWyrCPy-ZibeSxhyjQT2JGZRheOQG0-bteX0JEnSXQqPZ9hS19dW6vRxoHL8xGIo12kZCFjNVJ3HsfqkyycJUwinGUilnr1NRZpsbvE-5XEQVNYp9Bh_zkiWKKELVgAGO6Xk/s1600/Screen+Shot+2015-02-11+at+4.14.28+PM.png" width="320" /></a></div>
<div style="text-align: center;">
<i>Picture 1 - Linear hyperplane separator</i></div>
<div style="text-align: center;">
<br /></div>
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">Normalize Features</span></b><br />
<br />
Before you even start running the algorithm, the first thing needed is to <b>normalize your data features.</b> SVM uses features to classify data, and these should be obtained by analyzing the dataset and seeing what better represents it (like what is done with SIFT and SURF for images). <span style="font-family: "courier new" , "courier" , monospace;"><span style="font-family: courier new, courier, monospace;"> </span><span style="font-family: inherit;">Remember: <b>t</b></span></span><b>he</b> <b>best these features describe you data, the best your classification is going to be</b>. You might want to use/combine the mean value, the derivative, standard deviation or several other ones. When parameters are not normalized, <b>the ones with greater absolute value have greater effect on the hyperplane margin</b>. This means that some parameters are going to influence more your algorithms than others. If that is not what you want, make sure all data features have the same value range.<br />
<br />
<br />
<span style="font-size: large;"><b>Tune Parameters</b></span><br />
<br />
Another important point is <b>to check the SVM algorithm parameters</b>. As many Machine Learning algorithms, SVM has some parameters that have to be tuned to gain better performance. This is very important: <b>SVM is very sensitive to the choice of parameters</b>. Even close parameters values might lead to very different classification results. Really! In order to find the best for your problem, you might want to test some different values. A great tool to help this job in R is the <b>tune.svm() </b>method. It can test several different values, and return the ones which minimizes the classification error for the 10-fold <a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank">cross validation</a>.<br />
<br />
Example of tune.svm() output:<br />
<div style="font-size: 11px;">
<span style="letter-spacing: 0px;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">> parameters <- tune.svm(class~., data = train_set, gamma = 10^(-5:-1), cost = 10^(-3:1))</span></span></div>
<div style="font-size: 11px; min-height: 13px;">
<span style="font-family: "courier new" , "courier" , monospace;">> summary(parameters )</span></div>
<div style="font-size: 11px; min-height: 13px;">
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">Parameter tuning of ‘svm’:</span></span></div>
<div style="font-size: 11px; min-height: 13px;">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="letter-spacing: 0.0px;"></span><br /></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">- sampling method: 10-fold cross validation </span></span></div>
<div style="font-size: 11px; min-height: 13px;">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="letter-spacing: 0.0px;"></span><br /></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">- best parameters:</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;"> gamma cost</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;"> 0.1 1</span></span></div>
<div style="font-size: 11px; min-height: 13px;">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="letter-spacing: 0.0px;"></span><br /></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">- best performance: 0.1409453 </span></span></div>
<div style="font-size: 11px; min-height: 13px;">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="letter-spacing: 0.0px;"></span><br /></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">- Detailed performance results:</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;"> gamma cost error dispersion</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">1 1e-05 0.1 0.2549098 0.010693238</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">2 1e-04 0.1 0.2548908 0.010689828</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">3 1e-03 0.1 0.2546062 0.010685683</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">4 1e-02 0.1 0.2397427 0.010388229</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">5 1e-01 0.1 0.1776163 0.014591070</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">6 1e-05 1.0 0.2549043 0.010691266</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">7 1e-03 1.0 0.2524830 0.010660262</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">8 1e-02 1.0 0.2262167 0.010391502</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">9 1e-01 1.0 0.1409453 0.009898745</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">10 1e-05 10.0 0.2548687 0.010690819</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">11 1e-04 10.0 0.2545997 0.010686525</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">12 1e-03 10.0 0.2403118 0.010394169</span></span></div>
<div style="font-size: 11px;">
<span style="letter-spacing: 0.0px;"><span style="font-family: "courier new" , "courier" , monospace;">13 1e-02 10.0 0.1932509 0.009984875</span></span></div>
<br />
<div style="font-size: 11px;">
<span style="letter-spacing: 0px;"><span style="font-family: "courier new" , "courier" , monospace;">14 1e-01 10.0 0.1529182 0.013780632</span></span></div>
<div>
<span style="letter-spacing: 0.0px;"><br /></span></div>
<br />
The <span style="background-color: white; color: #222222; font-family: "times new roman"; font-size: 22px; text-align: center;">γ </span><span style="background-color: white; color: #222222; font-family: "times new roman"; text-align: center;">(</span>gama) has to be tuned to better fit the hyperplane to the data. It is responsible for the linearity degree of the hyperplane, and for that, it is not present when using linear kernels. <b>The smaller </b><span style="background-color: white; color: #222222; font-family: "times new roman"; font-size: 22px; text-align: center;">γ</span><b> is, the more the hyperplane is going to look like a straight line</b>. <b>If </b><span style="background-color: white; color: #222222; font-family: "times new roman"; font-size: 22px; text-align: center;">γ</span><b> is too great, the hyperplane will be more curvy</b> and might delineate the data too well and lead to <a href="http://en.wikipedia.org/wiki/Overfitting" target="_blank">overfitting</a>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI_513Sz_UxEHl-3atuPpde04pfHbn7kPat2JB0zRosNHxgFIVMuMosr-PZh20ov692plxMXfTaOMcqmbHIScRdmpyfg3A8mmkNaEiZjg5jhu3MqIKH5xgOKmmuH78txTbdSX-NF-Eamc/s1600/Screen+Shot+2015-02-11+at+4.18.15+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="244" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI_513Sz_UxEHl-3atuPpde04pfHbn7kPat2JB0zRosNHxgFIVMuMosr-PZh20ov692plxMXfTaOMcqmbHIScRdmpyfg3A8mmkNaEiZjg5jhu3MqIKH5xgOKmmuH78txTbdSX-NF-Eamc/s1600/Screen+Shot+2015-02-11+at+4.18.15+PM.png" width="320" /></a></div>
<div style="text-align: center;">
<i>Picture 2 - great value of <b> </b><span style="background-color: white; color: #222222; font-family: "times new roman"; font-size: 22px; text-align: center;">γ</span></i></div>
<br />
<br />
Another parameter to be tuned to help improve accuracy is C. It is responsible for the size of the "soft margin" of SVM. The soft margin is a "gray" area around the hyperplane. This means that points inside this soft margin are not classified as any of the two categories. <b>The smaller the value of C, the greater the soft margin</b>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKemCoQ0vHpOuZG7PbWez5iuENkpryISc-7qBVNDT4UQkAyjjRjoSjWQ5mv7TMG9GUvWrTeRaPct_NnXIA8UAhLq511lIyAXyovCnUY5G89oGRx1HnQkeJSqJVTgfLX38haiT2xaO2eEw/s1600/Screen+Shot+2015-02-11+at+4.23.44+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="217" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKemCoQ0vHpOuZG7PbWez5iuENkpryISc-7qBVNDT4UQkAyjjRjoSjWQ5mv7TMG9GUvWrTeRaPct_NnXIA8UAhLq511lIyAXyovCnUY5G89oGRx1HnQkeJSqJVTgfLX38haiT2xaO2eEw/s1600/Screen+Shot+2015-02-11+at+4.23.44+PM.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Picture 3 - Great values of C</i></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaP7C12-cxbNJ0bTt86uPWPSD2DZNGmsthvHvufE3heHe_o4hRRQddjHtquNEGCtaHv-UEpwuI5XIGanDwCwnvvWIlMp0X8zdKs7oBjFawtw4_zwdkotISnjetWukGC8WY6AVXbuITrcw/s1600/Screen+Shot+2015-02-11+at+4.23.29+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaP7C12-cxbNJ0bTt86uPWPSD2DZNGmsthvHvufE3heHe_o4hRRQddjHtquNEGCtaHv-UEpwuI5XIGanDwCwnvvWIlMp0X8zdKs7oBjFawtw4_zwdkotISnjetWukGC8WY6AVXbuITrcw/s1600/Screen+Shot+2015-02-11+at+4.23.29+PM.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<i>Picture 4 - Small values of C</i></div>
<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit; font-size: large;"><b>How to Prepare Data to SVM </b></span><br />
<span style="font-family: inherit;"><br /></span>
The <b>svm() method in R expects a matrix or dataframe with one column identifying the class of that row and several features that describes that data</b>. The following table shows an example of two classes, 0 and 1, and some features. Each row is a data entry.<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> <b> class f1 f2 f3</b></span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">1 0 0.100 0.500 0.900</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">2 0 0.101 0.490 0.901</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">3 0 0.110 0.540 0.890</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">4 0 0.100 0.501 0.809</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">5 1 0.780 0.730 0.090</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">6 1 0.820 0.790 0.100</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">7 1 0.870 0.750 0.099</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">8 1 0.890 0.720 0.089</span><br />
<table border="0" cellspacing="0" cols="4" frame="VOID" rules="NONE"></table>
<br />
<span style="font-size: medium;">The input for the svm() method could be:</span><br />
<span style="font-size: medium;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> svm(class ~., data = my_data, kernel = "radial", gamma = 0.1, cost = 1)</span><br />
<span style="font-size: medium;"><br /></span>
<span style="font-size: medium;">Here "class" is the name of the column that describes the classes of your data and "my_data" is obviously your dataset. The parameters should be the ones best suitable for your problem.</span><br />
<span style="font-size: large;"><b><br /></b></span>
<br />
<span style="font-size: medium;"><b style="font-size: x-large;">Test Your Results</b></span><br />
<span style="font-size: medium;"><b style="font-size: x-large;"><br /></b></span><span style="font-size: small;">Always separate a part of your data to test. It is a common practice to get 2/3 of data as training set (to find your model) and 1/3 as test set. Your final error should be reported based on the test set, otherwise it can be biased.</span><br />
<span style="font-size: small;"><br /></span>You can divide your data in R like the following example:<br />
<span style="font-size: small;"><br /></span><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> data_index <- 1:nrow(my_data)</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> testindex <- sample(data_index, trunc(length(data_index)*30/100))</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> test_set <- my_data[testindex,]</span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> train_set <- my_data[-testindex,]</span><span style="font-size: x-small;"></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">So when you would actually run the svm() method you would do it:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><span style="font-size: x-small;">> my_model <- svm(class ~., data = train_set, kernel = "radial", gamma = 0.1, cost = 1)</span></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: inherit;">And then to test the results on the test_set:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">> my_prediction <- predict(my_model, test_set[,-1])</span><br />
<br />
test_set[,-1] removes the first column (the class column) to make the predictions only based on the features of the data. You should remove the column that labels your data.<br />
<span style="font-size: large;"><b><br /></b></span>
<span style="font-size: large;"><b>Final Considerations</b></span><br />
<br />
<ol>
<li>The tune.svm() method might take a while to run depending on your data size. Nevertheless, usually it is worth it.</li>
<li>We usually use logarithmically spaced values for the SVM parameters, varying from 10^-6 to 10^6. Here is some explanation: <a href="http://stats.stackexchange.com/questions/81537/gridsearch-for-svm-parameter-estimation">http://stats.stackexchange.com/questions/81537/gridsearch-for-svm-parameter-estimation</a></li>
<li>If your label classes are numeric (as in our example 0 and 1) your prediction results will probably be real numbers indicating how close this test input is of one class or the other. If you want to receive the integer and original class values, set the parameter "<span style="font-family: "courier new" , "courier" , monospace;">type</span>" to "<span style="background-color: white; font-family: "andale mono" , "lucida console" , monospace; font-size: 14px; line-height: 21px;">C</span><span style="background-color: white; font-family: "andale mono" , "lucida console" , monospace; font-size: 14px; line-height: 21px; margin: 0px; padding: 0px;">-</span><span style="background-color: white; font-family: "andale mono" , "lucida console" , monospace; font-size: 14px; line-height: 21px;">classification" </span><span style="background-color: white; font-size: 14px; line-height: 21px;"><span style="font-family: inherit;">when calling th</span></span><span style="background-color: white; font-family: "andale mono" , "lucida console" , monospace; font-size: 14px; line-height: 21px;">e svm() </span><span style="background-color: white; font-size: 14px; line-height: 21px;"><span style="font-family: inherit;">method.</span></span> Be aware that this is a SVM parameter, and changing this will change your classifier.</li>
<li>If you try to run tune.svm() with a dataset of less than 10 rows, you will get this error:</li>
</ol>
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> Error in tune("svm", train.x = x, data = data, ranges = ranges, ...) : </span><br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;"> ‘cross’ must not exceed sampling size!</span><br />
<br />
So make sure you add more lines to this data test.<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-size: large;"><b><br /></b></span>
<span style="font-size: large;"><b>More about SVM</b></span><br />
<br />
<a href="https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/SVM">https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/SVM</a><br />
<a href="http://rischanlab.github.io/SVM.html">http://rischanlab.github.io/SVM.html</a><br />
<a href="https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf">https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf</a><br />
<a href="http://pyml.sourceforge.net/doc/howto.pdf">http://pyml.sourceforge.net/doc/howto.pdf</a><br />
<a href="http://neerajkumar.org/writings/svm/">http://neerajkumar.org/writings/svm/</a>Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com5tag:blogger.com,1999:blog-6461723396986571103.post-7920752441521779542014-08-05T11:22:00.003-07:002017-08-06T05:52:20.925-07:00Lecture on Recommender Systems<span style="font-family: "times" , "times new roman" , serif;">Great lecture on Recommender Systems by <span style="color: #222222;">Xavier Amatriain, Researcher on Netflix. </span></span><br />
<span style="color: #222222;"><span style="font-family: "times" , "times new roman" , serif;"><br /></span></span>
<br />
<h1 class="yt" id="watch-headline-title" style="border: 0px; color: #222222; font-weight: normal; margin: 0px 0px 5px; overflow: hidden; padding: 0px;">
<span style="font-family: "times" , "times new roman" , serif; font-size: small;"><a href="https://www.blogger.com/goog_168213146">https://www.youtube.com/watch?v=bLhq63ygoU8</a></span></h1>
<div>
<span style="font-family: "times" , "times new roman" , serif;"><a href="https://www.blogger.com/goog_168213146"><br /></a></span></div>
<div>
<span style="font-family: "times" , "times new roman" , serif;"><a href="https://www.youtube.com/watch?v=mRToFXlNBpQ">https://www.youtube.com/watch?v=mRToFXlNBpQ</a></span></div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-57755990354720875402014-07-16T17:25:00.002-07:002018-05-24T19:28:52.220-07:00Genetic Algorithm for Knapsack using Hadoop<style type="text/css">p { margin-bottom: 0.1in; line-height: 120%; }a:link { }</style>
<br />
<div align="center" style="line-height: 100%; margin-bottom: 0in;">
<i><span style="font-size: x-small;">Development
of Genetic Algorithm using Apache Hadoop framework to solve
optimization problems</span></i></div>
<div align="center" style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div align="center" style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div align="center" style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="color: #3d85c6;"><span style="font-size: medium;"><b>Introduction</b></span></span></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
This project I developed during a course on my Master intends to construct a Genetic algorithm to solve optimization
problems, focusing on the Knapsack Problem. It uses as base the
distributed framework Apache Hadoop.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
The idea is to show
that the MapReduce paradigm implemented by Hadoop is a good fit for
several NP-Complete optimization problems. As knapsack, many problems
present a simple structure and converge to optimal solutions given a
proper amount of computation.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="color: #3d85c6;"><span style="font-size: medium;"><b>Genetic
Algorithm</b></span></span></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
The algorithm was
developed based on a Genetic paradigm. It starts with a <i>initial
random population</i> (random instances to the problem). Then, the best
<i>individuals are selected</i> among the population (instances that
generate the best profits for the knapsack). A phase of <i>crossover</i> was
then implemented to generate new instances as combination of the
selected individuals. In this phase, parts of each selected
individual are randomly selected to form a new instance.
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
The next step of
the process is the <i>mutation</i> phase. In this phase, “genes” are
selected also randomly, and changed. In the knapsack context this
means that random items are substituted from the solutions found.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
This cycle can be
repeated several times, until good individuals are generated.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
The steps are
reproduced below:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div align="center" style="line-height: 100%; margin-bottom: 0in;">
<span style="background-color: #444444;"><b><span style="color: #444444;"><span style="background-color: white;">Initial
Population → Selection → Crossover → Mutation</span></span> </b></span><span style="color: #444444;">
</span></div>
<span style="color: #444444;">
</span>
<br />
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="color: #3d85c6;"><span style="font-size: medium;"><b>The Knapsack Problem</b></span></span></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<img align="left" border="0" height="78" name="Image1" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASgAAABOCAIAAAAcvxy9AAAQVElEQVR4nO2dfVBU1RvHl9IQELEUSSigCQUiEtNCHUEkyHJ0FHI00BRFRcs0zPClmDQbNVHI3nxLEFdEexELxRcqwaIXRBQtYn0bXhQ0QTFWpDT29517Zu7c375cdpe7e1l8Pn8wZ88959yz957veZ7n3OWeLhqNRkEQhHXpIncHCOJehIRHEDJAwiMIGSDhEYQMkPAIQgZIeAQhAyQ8gpABEh5ByAAJjyBkgIRHEDJAwiMIGSDhEdKTnZ197ty5hISEGzdu5OfnNzQ0REdHP/XUU3L3qwNBwjMKtVrt6+sbExOzbt26jtOURfHy8qqurh48eLCTk5PWIY1Gc+fOnevXr1dVVbW0tLDMnTt3Tp48GYnTp087Ozs/8MADENvMmTPnzZt36tSpCRMmnD171trfoQNDwjMKDLLa2to//vijQzVlUTZs2BAVFRUQELB9+3ZDZe7evQtRKZXKzZs379ixgwmvtLQ0Njb2iy++6Nev3/Tp05Fz+/btW7duWa3nNgEJzyg8PT1VKpW7u3uHasqijB8//rXXXvv0008jIyOZonTp0qXLYI65c+eOGTPm8uXLHh4ecXFxMIlHjx7NzMxkxb7//vuQkBAr9t0GIOEZS//+/TtgUxYlJSXl2LFjc+bMCQ4O9vHxESnp5+eXk5MDK5eYmIiPFy9evHLlypAhQxScX7pnz57ly5dfu3YN/qeLi4uVet+xIeERBnFwcNi9ezcMGiLSoqIiyEakcGBg4L///svSBQUFQ4cOdXR0RPrPP/+EJRw7diyM5+zZs63Rb1vAloQ3cuTIxsZGBBW4rwiTWLCem5ubnJw8bty4wsLCffv29ezZE3/h+QjvcVVVFQtUWHXMviNGjBA2y6eDgoLS0tLKysreeOMNloMRs3DhwoSEhOrq6t69eyOeaWdndJuCS4buwQzy/ifaxF94a3wtlDly5AjK4ERoTdh/i/LEE08g2MMZly1b1uZq0KBBg1ji119/fe6551j6oYce8vLyys7OhhfavXv39neppaUFXUJ4ef78+W3btt13333IxMV8++23f/nlF5xFby1MCp999hlqVVRUjBo1Co5xampq165dcT0XLFiAWaP9HTMJWxIeBiKuOCSB0Yz7+u677yLT2dkZ0ciuXbtwZaEZBSce5ODi4n7jY11d3bBhw958803oBx9xoYcPH56eno5LzzcLZaI8Qi+0jJwBAwagkYMHD27evJkNcSSgNEze7eyM3qYqKyvDwsJYCwCixdSADvMFoFV0+8CBA7AhGDf+/v6HDh1Cmxa70v/HzJkz8/Pz169fHxER8cILLxhTZe3atbzGHn74Yfirt2/fdnNzk6Q/uLCzZs3CpGZvbx8fH4+7qeDUWFNTc/LkSUPCe//99zED9u3bt6mpCT1BIPree+/BFC9durRHjx4QoSR9Mx5bEp6CG8f4e+HCBTasARPGpk2bMF5ZDgwO/sKGsKGMIQuLxAsPRgOxx+uvv84LT8EtnW/dujU0NDQpKQnCQE5zc/NPP/0Eu8SXQcUTJ060szN6m0IPs7KyWFqtVmN8+Pr6YliwnG+++WbLli0QJ/PcEE1ByXDbrCY8Ozs7dKC4uHjatGnwBSCkNqs8+OCDwo89OCTpDG4lrgPuC+5Oa2srL2/MDhkZGa6urnprYeJAmArVIQ2H+Z9//unVqxcaQQKTGgaDJH0zCRsTHgPznFYOG996gRjgAbKLzkAaOVrFQkJClnPA+EAJ8KyEqpOqM3oRrrXAU4UxhAfLZAY+/PBDLw6+DASfkpJi0inaCcwLzDgmpqlTp8LYMu9OFjDNTZgwQcH53gEBAfBNWD5mBzi6cIz11qqvr580aRJLwypCsWzagnmE/2mVjmtjk8LDONDKEQ8emHFDmIFpG4HZmTNn9BaDUWIeP3xI4UCXtjN6z8sSMCyQGWIVfjwpuHgPsd+KFSuEOdZ/LAZ3HX145513Tp8+zUy9LLDHErBUX375JbxE6I0/5MzB0nDmX3nllRkzZrCPMTExfDE4I5jUnn32Wb3ta1W0HDYpPFOBt4Zg7Pnnn0fs4eLiAu9Cy2nkWbVqFYwhBCC8VdYBmkfHvL29lyxZonUIJlroqQrT1qRbt24Qnoyq40GAffPmzejoaD4HoYHwQQW8R0PrTxAe1GtohVakorR0fuEhbkaAN3v2bBa8iQNXEz7V2LFjMbtbeXwzm7Zv3z5mMCF+thaKeA+jypo90cuRI0d+/PHHvXv3Gl/FVOuBrwlP283NDa5gQ0MDolwHBwe9JWHzH3nkkX79+vE56FhUVBT/UahJIXfv3sW3gE/B58CrF04lhipKTucX3sqVKxXcOpvuIX5wMzD0YfFgXpAJBYaHh1vt9xZwgyF4TA1CJ5MB24vOYIIXzuiw2PzavRU4f/48+nDw4EGTojtTrce8efMiIiJiY2ORTk9PT0xM3LRpk96SV69eFQbG//3337lz56ZMmYL0b7/9Bnv4zDPP8M8zNBrNDz/8EBgY2KdPH8QaarV66NCh7FBLSwsUy4SnW9GidH7h4a4oBK4IrjvmPN1iUB1uAFuDgQBwh2bNmoWbYYVfWlRVVb366qthYWHCWWD//v3sI8x1RkYGYj+hBc7JybGa8JqamqZNmwYlmHopTLIe165dy8zM5Jf1R48ePWfOnA8++EDvSWHrSktL+Y8bN26Mi4tDorq6+vjx4y+++OJLL72EyYIdPXToEFrDlIqYEHGEk5MT3+Ynn3yCu2yookWxJeGxx1kKziGMjIxctmwZ7lN+fr6Cs10wAhBMVlZWXl6egrMhtbW1yPn4448R4GFkK5VKODAYwUgMHjyYX65As/D7VSoVhj5bkMSgv3DhAgZccHAw5mw0gjKHDx+ur69Hgrms5nWGVdRqatSoUY2NjQMHDkRFBbcEh2mY1xU8T7h5kyZNwhwP64eTYpzNnTvXOtccXh/GNIasn59fmyU/+ugj9sMDM6xHSUnJ/fffz69Uubq63rlzB+oS/ryBB3cTcsIdDA0NhWCGDx/+2GOPKbjYLz4+HrdY6DggbIa7/uSTT6alpSHOR0kkpk+fXlRUhMYfffRRQxUtii0JTzdIW8ghzJnMwT8WA5itMY1BCfDl3N3d4XliKOM2I4dNybrNjhkz5u+//xY/tXmd0VuxoqJCKwcyFn6E5E6ePAlDDWV6eXnxjw2tAAKtp59+WvjM0xApKSks6NK1Hrdu3cJdQHxlqC5kcP36deGCB0RoZ2f3119/6S0PfWKuvHjxIiZHeKRdu3Zl+TgL9L9t27YNGzbwhf39/XG7cZEROKAixDZ+/PhLly4tWrTI3t5epKJFsSXhmQ2UNoKDzxnEIWOXzCCEw5pnhENbXl6+Z88e8WLQFYwPJoUrV64o9FkPeHdfffWVeCOff/65bqaIViHLxx9/XDe/oKAAYQWcCER9/OoLBgB8HL5MHw5jKlqOe0J4hBn8/vvvMOBbt26F36h7FJKAe1xZWQlj8u233964cQNGo1u3bgpzrUePHj3431gruMhco9GY8XuX3bt3owNwU+GuL1682AoVzYOER+hnzZo1Li4u8MfaLOnJgVHL52hZD6TRGtRoqAUEyQEBARj0arWaPU2BkhXcT7RN7baPj8/ly5dXr149f/5861Q0DxIeoZ+dO3eaXVfLejg4OLz88ssiwvPw8ED0FRgYWFxcHB4ejhwkhgwZoteZFCcpKammpsbNzU38n5gkrGgeJDxCerSsB+IxY2xXZmbmkiVLnJ2d2QJpenq6eWdnC5XWrGgGJDxCesyzHgMHDszJyfn5558hVCQM/Wylc0DCIyyCedbD0dExIiJC8s50QEh4BCEDJDyCkAESHkHIAAmPIGSAhEcQMkDCIwgZIOERhAyQ8AhCBkh4BCEDHUV4hYWF7EXocr1CiyCsSUcRXkFBQXZ2tkqlalN4N2/eDAoK8vb2Fm4tQBC2RUcRHvTWt2/fhISENktevXq1srJS2je67t+/35i3G5hdniC06CjCM57+/ftDeNL+09SJEydMEpKp5QlCC9sTnoLbY0TC1tiuJpYrTxC62KTwpAX+rUlbtJpaniB0kVJ4K1asaGpqYhtH9OzZs6amhm1lOHLkyLq6OpVKpdFoFFyAtH79enhrMTExel+rvmXLFpRvbGycMmWK1rvAUDc3Nxc2BycqKSkRHhLfurGwsBDNMsGgb/Hx8d27d09NTd21axd6griRhZehoaGG9vtWcG+DFy+P7n399dfe3t4wicOGDRO+hoQghEgmPIw/iIRfk8zOzi4vL2fpo0ePLlq0CMJjHxEdhYSEGHoXAErOnz/f09OT7RTn6uoq3DRwDEdUVBTbM5VHfOvGhQsX5uXllZaW4mhzc3NiYiJaViqV7E2YdnZ26I8xOyuIl2fvsc3IyGAfMQ3xr6wlCC0kEx5MwcSJE/mPsGawWvxHLd/MxcVFuGGdkOTkZPaGbVikjRs3wn6Gh4drrWQEBQWxrVsZ4ls3wtalpaVBqPyW3Cgs+Us1y8rKIDNh7Ic5yMPDAxOQ9TceIjo+kgnP3d198eLFra2tkZGR7A1to0ePNqMd4bvy0Q58OUhIfAlRfOtGuLVOTk685wnJwVn18fExo28irF27FmZWazbBlLFjxw4SHqGLZMKDLxcWFsZeiu7r67t06VJJIhyYL7ajgAjiWzdCZlr2zRJPAvD1dd9FBxtbVFQk+bmIToBkwhswYMCpU6dWrVpVXFyMRFxcXENDg9ZeApZD9q0bGxsb9eY3NTVZuSeETSCZ8NiPOdhaQn19/YwZM9asWdN+4TU3NxvaUZ5HfOtGGEPjN3bU2jHP+PI4i94CbXaeuDeRTHjJycm8C9e7d2+lUqm7ObiQs2fPGrPCcfz4ceFOn3oR37px4sSJCAItvbEj28tSKxNnabPzxL2JZMKDe1lYWMivYcDI+Pv780dZuq6uji0/wFC0trbqGiInJyfhzyAPHz586dIl4ca5ehHfunHlypVIJyUlCVf2hRs7wmDCRPPdbvOb6i3P+pCVlcU/1oPqKioq2vMidKITI+UD9O3bt+MvtAeBxcbGCnc/DgkJwdBEJswCRmSvXr38/PwwKKEr4T8ZpKam4qiCW/+AjBcsWFBaWurp6Sl+XvGtG/mjaByuL/qGjgk3doQg0bGysjJvb28Yxja/pt7y7CzIR2LcuHHoPKLcAwcOtLmfI3FvIpnwCgoKIDnIBtYMgU1ubi57qMCzbt06tVqtUqmmTp0KXzQ4ONjR0VEoqujoaOQrOFuBRhA1sUfexpxdfOtGdhT2k/WN7U3JH0W3oaK9e/eWl5e/9dZbbZ7LUHmc5dixY/CNWR/OnDmjdQUIgkcy4TEnU3zDRwxE/qiuKWCqa7MREcS3buR/yKILTm3Smoqh8pgmtHbAJAi92MyPpCHU2trakpIS9mtMYQBJEDaHzQjPwcEBkaG9vT3Cqry8PKVSKXePCMJ8bEZ4iN8QNyJOQ6D43Xff2dwO5gQhxGaEh/iQfvRIdBpsRngE0Zkg4RGEDJDwCEIGSHgEIQMkPIKQARIeQcgACY8gZICERxAyQMIjCBkg4RGEDJDwCEIG/gfQHqMcH4l+rgAAAABJRU5ErkJggg==" width="296" /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<img align="left" border="0" height="76" name="Image2" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAKoAAABMCAIAAAByCbtKAAAP4ElEQVR4nO2de1QU5R/GtyJBC00S8dZPSkUISynvlRlmloGK2cnI0i4GhlYSh4umiZmKeSMlBTWVOKZpKiqJdlK8YKYhaSYoaWIiGIoJapkpv8/Z9zRn2uvM7IKuu88fe4bZeed9532+l+c7O/PiVl1drXPBWeF2vQfgwvWEi36nhot+p4aLfqeGi36nhot+p4aLfqeGi36nRg3S/8UXXxQVFUVERJw7d+6bb745e/bswIEDH3zwwZrr0QW1sEJ/y5YtT5w40bFjxzvuuMPgq+rq6itXrlRUVBQXF//1119iZ0ZGxksvvcTGgQMHPD0969SpA+VvvPHGyJEjf/zxx0GDBh05cqQmLuOmwYYNG5jSW2+99dq1a3///XebNm0ef/xx9hcUFOzevfv2228X086e1157TTQ5c+bM6tWr3d3db7vtNr5ikpl5hd1ZoT85OTksLCwwMHDJkiXmjvnnn3+g9vPPP09NTU1PTxf079u3Lzw8/Msvv+QCXn31Vfb8+eefFy9eVDismxUlJSVEQQshsHnz5vv373///feZtylTprRo0ULsr1+/fpMmTQYPHlxVVTV79uyHH35YaoKPNWjQgCjbvXv3t956y8PDQ/l4rNA/YMCAqKiolJSU3r17C15NnMLNraMeI0aMCAkJ4Qq5hmHDhmGnW7duXbp0qTjs22+/feyxx5SP7CYDeXDatGmbNm1KSkqyQH9QUNCdd94J/Z07d37uueek/c31wCbwtFGjRt1yyy3SV1gGHs+ZFyxYQABQNSrruf/jjz/evn17ZGRkly5dWrdubeFIf3//NWvW4PGjR4/mz2PHjpWVlXXt2lWnzxQrVqyYMGFCeXm5sFZVo3RoEAinTp3KZ3R09CeffFK3bl3Lx3t5efFJCjDYzxwSaK9evVpZWWkwgcTmmJgYtdzrlNDPcJcvX45zv/jii7m5uZBn4eAHHniAjCW2c3JyunXrVq9ePZ0+dREVQkNDCSRvvvmm2lE6ImALtyGAc+FxcXHLli0jTCppeNddd+lM0Z+ZmUkCZQMpLaf//PnzpaWl999/v4ZBKhoQp0YEQNuYMWOmT59u+WApLSFVevXqJbaxaFQktQARjOCmYaAOBFRbVlYWxGMBCQkJJESknPLmODEWYED/5cuXCfvBwcFoZ77y9fWVvpo3bx4pX9tQlRZ+qHeKtxkzZjz55JNPP/20kiakOolpZAuugPH6+PhoG6gcFBqYIyr36NGjixYtkiY3NjYWg+vTpw/bhNk5c+bk5eWRGs2dh0D16aefElELCwtpBU8zZ85EXTPF77zzDpFM7cA4FTmOUN+sWbPJkycj2uVJWjnuvvtuA/q5TCiAad1/A8Mvv/zipYeGXnTK6ecy0tLS9uzZM3ToUKQpdFpt0rBhQ/mf9fXQMkYjQBKhCBeh2nn99dcfffRRdp46dQqZQpISx1y6dImpOXTokBAfJjFp0iTkatOmTZHT2CVydeLEicRqXJah0ovyIeHxzA+hEe2G2n3ooYdsuUDo/PXXXzmnsOzff/8dS6UKwCx0/6V/8eLFKCrNHam47cN0k8B69OjxyiuvZGdnqwpodgQ0E1QaNWq0Y8cOJki6IUGVwSfDE38SCfAYb29vc+chmCFm4V6nr52Irkwup2WDIgp1rWpUWFtGRgbnocrFAjRe27+Afi6NpC5cCEkv1LTwcnK/OGzLli1cL+FKc0fq7vpRWSYmJlKWHDhwoEOHDpp7tQUEfFERMd1+fn7SMLZt29a2bVspLGGdRIV7771XavjEE0+8/PLL8rslL7zwgtjOz89nukXWQJ2QEeQ9GjQ0CSxyhx6kfBRSfHw8xZhCrWcMSfxDP1NNwSUUtLwooASglv7oo4+0dSGgenweHh7Qf724B+LmAcGQCvO9996TkivaQtwgkwCR8hCFQ8sPoJCRtjEd5pdS22SPBg3NgZH00AONRvofP348wyNXqroPIyDRfN99961cuZKUZLCfz/T0dMKw2jMbQB39mzdvxsBXr16tvIkS15GDKDpu3DgyMe549uxZrtxkoUyo/+OPP6QbI6dPnz58+DAzLh3AHBkIlIEDB5rrFPqxKnM1rYWGJoFvUCqjPBC/2AGynGyiSvdINFNBIEglE5eCP3mBSybaqRqYMVTQz/WgMjZu3Kgq6yt0HQkjR46kuAgPD2f7s88+I+fNnz/f+LCcnBzY9ff3F3/u2rVL929gEFi/fn1oaKjY/v777zGXTp06SYWoHMh1bHrs2LHSHtxXhDfLDS2DiI0YRKnMmjWrXbt2eOrbb7/duHFjJW2FxCsrKztx4gTOYLAfs0hNTcWk1A7JGErpRxsTx+BD7Q07Va5TXl6ObJYkd9++fSMjI5OSkow75chWrVpJbvHbb78RvaU75BTcxcXFLVu2ZJsZ3Lt37zPPPEOowIKlA9BNlHbwQTlz4cKFbt26ia+oKglv0G+yoVpQAVKPoAZSUlIwo379+sXFxUnjNAfh5XPnzl2zZo18v1CCVF7PPvusQWGlDYroJw5TFFEOSd5m4UgK7nfffVenyXV++OEHcdND/Ilup7jft28fGcTgSJgjAMAiFsDnzp07UewXL14UdxrWrl0r3ZxgDBSHs2fPbt++vdScygXbojTnoogTlA+ShTHpw4cPN9dQG6AKwRQdHU0xkpmZGRUVZfl4QT/Ry8BQSE9cINkQV7RxSAKK6CcBU8iShKweiaW3adNGZ+RzEMOICbPmGlIv9e/fn3AnT8CYAuxS9RofT+iDHrQVrRB9REj4o4iHuYKCAqZP0nH0i1Ey78nJyVJzX19fEicxmcj81FNPUSCwwRhyc3MxtXvuucdcQ1tAfFJYTzJ+iI+JiTH5FRapuaYwgPWzEH8OHTq0YsUKy4dBMDUh2Q4KdUaug3utWrXKal8UM8Y7TRoNchoXP3nyZGlpKV7F+SmO2SYLUM5J8UOAOIGipK4rKioS1hkQEECkKSwsRC5wMJQPGDCAszHj7u7uFhrWDjBNJlwUewagKEUb2asjK/QfPHiQyWVmieTG30IM8vv48eNM5bp161CkUp2jzXWQx9IvRjq9NRDYLWjmFnpIfzbVw/gwdDjjIY+Q1Em9YidRVLpFCBrroaRhLQAX7969u8mvyPp27MgK/dQtBFWTUcgA/9NDnpPkrkMJzqkwCHPNe/fuTYEQGBjIXCPERAoXt7e0/ZYlByK8pKRkypQpaO8aalhZWYlTym3XHChqYmNjVQ2j5mCF/oyMDM2nlrsOFzx48GAL9GMfOj3TaDqkeHBwMH+y0bVrVxS+5jEI0DtJwcfHx/Kv1bY0JER9/fXXFsSNhBvqWYcafNRT7jooOCVOzGEUfvHx8Z6enqKIoNS0y2CEmqvRho0aNdLWxXVEDdKvzeeCgoIQm7t27cIU2LD6bIwLtqBmn/PX5nMoXjuKWxcswPWah1PDRb9Tw0W/U8NFv1PDRb9Tw0W/U8NFv1PDRb9Tw4Hpz8rKWrp0aYMGDUJDQ/v161cLPRYUFKSlpR0/frxnz54REREanuG80eCo9CckJCxcuFA8sFVrnQYEBMyaNauioiIkJGTlypX0rvY3pBsNDkk/BCQlJSUmJhpzf+3atZSUFPET86lTp1q0aBEdHa321ddLly5NnDhRvGht/K2Xl9f06dMfeeSR7Ozs2ok6NQeHpL+ysrK6utrkkx2jR4++evXqnDlzxFOgMTExRGnihMIz07y0tJS2y5cv/+CDD8wdJro2+RSaY+F60n/u3LmSkhLpN3I3N7fAwEBt70QK5OXl4fplZWXSScaNG+ft7T106FCFK0uMHTu2YcOGVVVV0G/hMFsGeUPhOtCPdy5evHju3LnMsq+vr/SKGiE6PT1dPMquDYsWLfL395f/7o4wbNeuHd0ppN8Rf7O3BbVNP2k7LCyMlIx7WX1sXC22bdsmnu2Xg762b99u345uGliiX/50fYcOHRC9+/fvF8/w6/RPoaOqVHV25coVWg0ZMmTEiBEaxmoZiD4UH75usL9+/frHjh0jxdjr4eibCZZmZOvWrcXFxX369Dl8+PDatWvZ0759e+xg48aNqampql7dEpg3bx61k+3cC7lg8K7ZhQsXMC9jjsViaOgMCy97q4XI/RYeXXQUWF/Xb8GCBT169IiNjYVynb4o2rlzp7YcyRmysrK0DPO/2L17N5/yx7R1+nd+daZEmbASJc/gKgfKH2Pau3evo69TZD0eIpom6NGzZ88jR46MGTNGG/fnz5/HQT08PMR7ICaG4uZm9czIxvXr11POJScnGyyPJog3/t80Yo99F6OoU6fOqlWrBg0a1KpVq1GjRhmveekoUJQOqYDFcjdMurG2ErD6Ijf0UyhbWBcI5b9p0yYLFlBQUPDhhx9mZ2fHx8cbL2YkODB+1Frssft6UkTEjIyMYcOG5eTkREVFSW8TOxaUqqHJkyfjdmlpafJVEeSw+iJ348aNvby88vPzNRfN6IZly5adPn26b9++O3bsyMzMlPs0cYXzGy8cSsjx9PS0O/0zZsyYNGnS5s2bO3XqZN8z1yaU0k/wZ+qx8cTERJO3w6y+yA09fn5+3333nbnXlxTCx8dn5syZZCKm3iCWIEuNMwt7goKC7HujBiMjCaKHHJp7nUL6oRzvR++gdLCD4OBg+V0U5S9yx8XFkbapzm1ZjUinf6GMz5MnTxrsDwsLowv0v3R+Iv/Bgwfl9lpRUcFOhcssmEN5eTlaUgzDoWGdfrjHq8RdbqT7nj17hg8fDuXiZSVVayDgrGTu8PDwJUuW2CKXzLkymXjq1KkkBUSZ2LNhwwYiv3hfX6dXjoGBgfhuYWFhs2bNzJ1fLFB++fJlbcNwIFiiPyIiAk+l6CfS9u/fX6efzaNHj1ZVVXXp0oVMjzWoXQOBuE3KxJ7QCqRwhKSNkUAOEjwjhGxEAMPLzc1FKq5bt056RxitgMsSNrZs2TJkyBDjMxDSMeiff/65Xr16CxcuLCoqqlu3bkJCgu2vmd6YsES/KPTlCAkJqayslO9R+yI3BIwfP54AgIrks6SkRHqn393dPS8vz8a77hgWJpuVlYWdNW/eHFkuX9wefyVunTlzBhlrsjk5zpbeHQ52uA+qYQ2E1q1bT5s2zfauTQLHff755y0c8NNPP9n95wYHhR3or/01EMzd4VEIEpYtC6HqborbvQJ2oF/z4gma0aRJE9J5fn6+hraIg86dO9t4E5AkxScq0paT3AiwA/2aF0/QDFSCWNOybdu2ah+5xFhtifwoFYoXDJ1iUloLznFhn99ANS+eoBlUItSZX331FbK8V69eSlYdE7CF+4KCgvnz5xN7UJd+fn6az3PjwIF/Avf29o6MjKzNHgMCAuy1yNsNAgem3wXb4aLfqeGi36nhot+p4aLfqeGi36nhot+p4aLfqeGi36nhot+p4aLfqfF/2UqdCW9vE1sAAAAASUVORK5CYII=" width="170" /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br />
The goal is to
maximize the objective function, where vi represents the value of
each element i, and variable xi represents whether element i is going
to be taken or not. The maximization has to take into account the
weight of each element represented by the first constraint.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
This problem was
chosen because its solution only require enough amount of processing,
and this processing can be done in parallel. The space of possible
solutions can be seen as a set of local maximums that are going to be
tested by the genetic algorithm. Hopefully, with enough computation, the global
maximum can be achieved.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="color: #3d85c6;"><br /></span>
</div>
<span style="color: #3d85c6;">
</span>
<br />
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="color: #3d85c6;"><span style="font-size: medium;"><b>Validation
Results</b></span></span></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
To demonstrate that
the algorithm find a correct result with high probability, it was
tested against two validation datasets. One was composed of a set of
10 items and another with 15 items.
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
With the validation
input P01, we have:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
P01 is a set of 10
weights and profits for a knapsack of capacity 165.
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<img align="left" height="188" name="Object1" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAARMAAAC8CAIAAADU9t1gAAAgZElEQVR4nO2deUBM6x/G32maVu2LFE0lskQhS6mouJdEiy3SpihtQtmKuJaoqFCh7EsKFfcm/MjNFdnXSwpX4WohdNub5TeV0jKnctw7Ted8P/4wnTkz877f933mnDnnPOcRZLPZCACA70SwqxsAAN0SUA4A4AGUAwB4wKWcyjtL+o+M+Ht8/Pt0W0UKu/jkBOVZ6QyhScnvz1nJUhh5UaPUvO4PDM5+uFKL1vx1jL8iR2j4PjJM+HB1lhyF+3tjr8P6cjt2581B7ouM5Kl4Wv2fwHwTa0hfmKXkmfVi12ix+kVVd5f119v+ZvTuvEw31VYt7UwFuhLWpzv7g9ZEJqQ/KapBVGlNQxuPdZu8xivR2ntN07hI5/8LvWuvRLUFV3auXheddO3lFxYSlB80wW75ll/sdcTyvu9z/5VRwKUcUa0pYyUjEh6dz66wVRSvyD73gMFZWnPvXHallYFY+Z8XsxFSNjNVbV1vas+pEUnKX+QNJHA0l1V4drnXmvfrZizkJ+VQVSw8xwlm/X7mxNPQ0XqinCVVz08lvUHUcR5TVfinmZ2C/TnDb8z48BwkO3KWl7OG4LvME0f3LzVJu598/4BVT6zONBsXuR8Z3w5hFZ9z15uy/x1SGjt3sZEqyss4Hh/pdPl2yeOLVt/3Tj80DxvBt7fWY6j1CGrClRtX82uM++ZduV6C6IYqedf+yHhTa0B/mf6gEvWYYi53c8vMJTvOPXhfITloRtCBON9R4oW/+to0ab3yWdzC2cuOPi6V0nFaNy93mX/miH3vshwaPoFVmOpn6B+eWSQ2yD7qbOzs0kCt4WH5nCcCBwonRry4NPqMj8fWU/eLGEiUPs5lc1zoXE0R/FX4AQSUfvaaKPJ7WlJidrDeMBFU/eL0yddI+CePSUq1ucd8nFfsy3xXjaS1Z66J2+s7ukezV1beXdpfL7zUNv1dvEkPzoZ6pJrXA4NjRdfmKrCKfw/1aFk6KfQpK8LjP+1y7Yu4xXWysU16etS6XiistU5eQybEHPHe4vNziOQOHa2VuWN/CVQ6Fnb6eZnkEMfI07tty9Y0G5ewyw6HbfwaxlcyN0RXa8Uby90bxSNWHc9mac6KTPSr3jBnefJLZt/Zu88fnq8pxPmayT22rJ0SNacme/dSjmx6zj//JPZnOQHOEubqOZ42W4o/vvrErF+j1Zxx7CuMWKX34/x9Nh29ll+BxFTHzl29c9uCYZKsFvOQVZIZ7um19eSDYqaIqrHr5tiQuf1FuQ2BQIv24FMORXaUlTa68ujivU8rZe5ezEWafg6Gh68dPn+3xE/87tV8JDB2xH2XnwMyZO32/+EtlTR/+jKLHjq5ezWajVNujN2Co497jA88uqTfo9DFB1kI0YQFKai6/ulboTFabgFBPUPWJR3x9Js97Zj3sbA/jP1uKrsejfcx/BhusuxE2cyoC746Vb+vcw5asmysSZJtry75jqfIm3qa90hLOnUqZ8OwoZRXKYkvkbilp5n0q5jJ86Lv9fU5de3nx77T1i+b2WdczlbZjt+QkX9w9s+rrrQsXYx86Nz/tsusjzeSHiLUa+6SSY3bFwEZI59FWjGBz9MuvN40R5izC8HI3Pti38lbfg9Xm7sfcp1vaHqu+biMU0g/3FQWzmAi9M9vYVlhwVHSAc7RiW4m92wCgvfS1ztEJCxdu3DmcVORFzG2nS0R68Ot1OcIqdguMpb7OoWpilN3X5taV7G/cur+bjVnLJMnVyc7jXdLrtH3OxA1jn15vXuEu2mpcm785Gbvy3yfYP+T3zmhyZsSFylnrl0QOc+8x6AHi+61HYL9ZlLNN1E4jxAI9jKepIYePj7/rFgp9TFbetrkn6Y9EjqY9dufRQoXnyE0cExZ0rYaNMA70M5wAK1/oFHE3NN7b2/f0qwQmYn3EVLzigi00xG2UXkQrxf+BlGamtZ7bkx0gK7Qe8m0JOeb2Q8LBacMHSTHGboeaiNGD+n5+CBn9/Bz9s1b2X0m2p4sXC3dlQc6KDJGHpYySccST+euHSh0Nv45krL1GCcnKGB94O44AQWtQX2EdGYPWH/nz9uvylHHymEVXY690qZ0Icur/+MuM0sLvnD+k1OXa7aTTZPvK4/Q8y9/lzIaxkbbN8h+jCZNZ5Xd+pRdN0/eZ7iafBsXpXfprd+1v+s6zxka+W+2Ry99LD9nvc/MgUWfIiPc7j7P/cw0VenZ+RIxywr/4fwnS5fB/M3Vas4UVOndjUwpReqrYjY66QgjU7mb8QZ7kndd+zxJp+k1rOLL0WkVSN13y7KZQ4Wn6WqMz/ooJVlwicsQ7DCbINHs0/DWX0RzspHM1iO3Lv2hcqeUqjtlcC9d84Howq3z6Yq3y5Cq6TB0gbNWdsBA4YDGl+TmlDC/FaK0oBTVj1NdC4R6DlBE6E3zKgzrLcyZlOKKspxCMRmslh+uu/LQL7nuWw+vcTm8hvNqrblRZ/a5anXN7hoHyTHuMxSPxSakPLcVP/4UyTu760shdvHL5CDvXReffaj5uhqL2c5Z56anGF/efUJtSlcx5L/uMlVKpW7OFucU1aC+jdOitiinmPOfrGrjfJVpeESVVJJE6P2nDxUs7u/WiJyarCASEJetO3gip84ZTGrDYxaDxUbsqs6XiCqpIsP578PLD7XfmoeYNUyqUON2t/WcYZT8VcR5N4V+CvWNp8lpyCFUWPS2lPlNOYxPr4s563C+IOrWEe5jbNWHs2P4LJPLEJQwJ0g028Tj/uYS17YaJXTkQtKxMy+R1sLhMkJKBuNVUOSZo2fzkNTcaSP73eGs1G/VpRSnPg2fISCiqMw81FQICcU6AZe8/shAvak1Bc8KW70/BevHW11lqQomgSnZgbUfc+78nhTmufq4t7/dzLPmUng786OIj1g4t3dsROKxI7RHqJfnwpESqPLOZqd1v/09MuRu7lLtgq06AwOyW72IIiDI2e2o/FxWd3iF8fGvjw2LBaX7cAa4TemEpOn/bZcFZMfMGI6u3YvfkhJ4em7vuk9mlWRE7uHsCPW1nkz/qhzOQNUgOaHa4pcf6uaignjTzj/GjG8xji3HtPJ+RyVq3jy5MTa6KPNBfMSFNUcbjlcwi8446Npljd56NpjLZ3EqKaOmQEHPC7M53wXKIqimOJfzLSDcS1Wq2aTnVLvu+6LkdUktUhaufnVmX0qelNEEFW5D0HLPGLdyKNJ6VkPQhbtns1FPd6PeNCSq+dMI0cizF/9CtAnWuuqjXEyEblw5cey6qYvqi30B2+6qrUw5bPqtEPIG1tro1pNdy7boLNa4F3LgbYcfKCjKGb3X50+lGhp/2D5z4ZVB6w8HW/RR6NOLI0GmXI8uPTMlOsTZQT1icwhnd1TVz0WX861aUf1PFWfu1ZR/yLkcExaXx1np/Z17r7W/fUcL9dLRoKL89M1hp4QG3Q870bDJZQsomLQt3eltPZz1bC//p12maTjvWhlrsOWsnfaYFIeJmtS3104c/aMAqS+KWaItjPLqV8rduXSLlnffB5viPyKaie0IiWbjMm66RgdboJawOyxRi+b1nR+5ZM+48BPW2nlz7MapCby9euxoZpHgqDHj6MKXub1CQHHiYkvJaymR7msGrzWuPR90sBgpOPoaSaPPzdYxc5sgcuPSzhXhw7xVb66fvzWr55KbTr5cZu9xF+nm746/+NSehlM00d0XSHjE5AF1G2CJoRba6Oxtzr6wlZ40TdXpxLm/F3pvWzDxABJSGeu8YZ2tGq3o28uFtLyO7rg9Y3nSOtv7w5393bWvbX2C2jtIKD5ikevQ1Mhra+etCs2MDffwCAixNgji9EBx2OyQA2FjxXH35N9AeIC968DNAc+QprPjoLp9KDG9lSF2mV7H1v88Udtu65GTivOnRW1zWDvIu+klAj2tdoTbWPsnbZo5b7R3aOD4Kx6/11Yx2IJcStdXlRoS7uH633aZIqm/KfNe/6DV246l7NxSi6iyWmbuazdsXKjP+U3OaFhnhJfhk6BZQc8rZIYvjN43R4WKWE3jspx2yvp7Pq/jErVsnrRx2I2bWmsCIxJORt5gcP7uP9F7S8hGR22RPK7K4VTY5uCVmGVem3Y6TglD4uom3keigifKUJjflIOovecdT32z0GvbKptzSKi3oceRuE0jJUVHcJm9Ld/8B762hLXX57LXN2/DglvsBd/+VjRbk/J0TYuXqC9+yF7c+EcPbZfd6dMPqSj3EKh5tnHoKkSTlBCiCLZYR9L8XE3jboCsWcTDyojGZ6Iy7KLwt/3fR2jA6qfs1c0WCGs6H33mfLTp7+fMXQ0P5jftO4sP8T794ts8cVrU+IhL6VA/Ox50WUBqmHNEmnME5gqiAxclZAcmNn9Ji3FZyfb/+qj5OCo5ZrEdvz6WnXG16bdMxyVq1TzZUW5RF93aVAFzziCp4e5x193jWq7eco5RFUwDk58GtlyFym0IWr5JO8/9p7AKjk+m2/1PdHJofMDQV5E7n3M2/rN0sQ7mAwCf0WXKEVCauS/xgcuyGH9zQ0SRHWy57vBu+97d7Kw7QF668Hc1rY9lyEXLkK5rAPAdtNzDAeBaaQDABSgHAPAAygEAPPBOObVvf13j4rvr4qtyqrzOjFW7Y3zHyNSdgK7JT1mzYPne9NzPbJnBlstj9vobyRHwQEFNfnLA/KXRl19XitFNfaKPbDDvVV97rLKQm5oXu0y0A1RS3yaaSXS8dpfAK+Uw8g/Ntd2vFHWr1L5f1Y2NP02YHqCfG60vVpu7y8YuYdihO19s6BWZ683MZq40fBlr2LWnNf99anN2WtslDj/2Z8U0xfwjDka287SeXnTqTcUqS1e3t+tglr9/cmH3Kp+tN6qFZ3R1Y9qDV8phsxWsN8ROtxskQUUSo2yn00NSX/zD1hdjC6o5bD9ka91XTACJjZptSQ+7/LqMbSjOh37JH4BZdDXhqbpHvIWaKBVp2W1aGjp876VCBydlrLIQq/udh/X+xEz9deUmC6OPKrm7lXV1c9qDV8qh0S2XLP36uCon6WS+qqWuNGd+CKnb+LjVLWRVvL9zbPPBzya/GPOjzfjHoNRdV8RishpObVMllSRq8x+8q0bKohhlISsCvezOvbbjPKi44dbVbekAnh8hqHkdv8B8M9v3f8u0hRuXsQoOGfRyuomoWvOPps1RJd5RCwFFo1kDPcODU+bsma749sym0Hu1tUNrm19dzK0sAD/D01nKLr0bMdv8l1K31CtBBs0MdgJKjlnMWcXZ6Xt8HPXthR7G22C63rsptP7epw7lO/npyC6QHGzpbT9War9Sk9kDqywAP8M75bC/3NwwafKBvpG3k+2bDPTMjzf2xT7R83AdLimqMGjKknXTdk49/rjCpie/HlHBjbD6jPCM2TvqdFF1z39w2FAD1fqNC9eyAPwPr5TDfH9qvkWkQsT9A/bNb4lDpVVnhXtHl2leCDJRZLxKjTv/ib6UTrwJxPgr2kg3yjQtc8NodHfHkjj2nFTDOlMaRlkA/odHymEWpIYmfShBLprirg1LaP18r94JGSFpHJ4a5r1oLn1zQTWlh5qx88GTnv2IN4cE1Rz3bM6wnSQXXCmkOGxe+Lkt+nWXhWOWRbRLW9uFlKZZalimlSJ2bS0DoQmyQjSRwYHXs9YO4btffzxSDlXF9RbbldszAlJ6Xodvex3m9hyBoIgP9Ux46tlqKXZZyIrk5DNNNyXgb4h3HAsAeAEoBwDwAMoBADyAcgAAD6AcAMADKAcA8ADKAQA88E457TrYuoGT6cf4cnaivOUlliCNWn9dmqhe2L0Mn760xhN/DdSd/hscmnPfj4Dngr+PbjAfeKUcTAdbt3Ey/RCs8o+lIqMP5l53VGph+Gx+4o/x9oiV/u4JlnRSy6bbzAeeOdu4O9hEC7qNk+mHYJYVlVNl5MWwfdLMwjO+q/KcfnPrJ8TDdvEb4GxrA4aDrRs5mX4IZllxeVV2yCSNObfzamR1bFZGRy1ucb+B8jtblmca7dijQ9or1urpRvOBp0cICO9gw4QipmEyxWyI59q0cT0/Z2yynGK+UDMnYar8V+2wCn4NOkRbkDGBeHZYwsLT2Ut8BxsWwgM89p/++ljM1D/Yctfs+McVU00a7qPNeJMcnkl3ixrAdxcEA5jwymVAGgcbV9jlL6+m56lNNG2wHrFZDCQoSWvcvrAK04886TPLHBw63QleuQxI4mDDovZVnItV7uJLqSvGSHzI2BaQJmLx25DGe0NVPL/wXEwvuDcIpzvBq701LAdb93Ey/QgUabMdKevd3C2UAksYwspj7CPObx/feMcBVtnbvAqZSfIgHNSd5gPPfudgONi6j5PpxxCQMfBLfOTH9SklhxuVDrxuEJ/SfeYDiY5vAcC/CCgHAPAAygEAPIByAAAPoBwAwAMoBwDwwA/JU9wjmYgJ8/3JWQNnpZlcKEz+qf76CXJ1v7OAP6cJzOQpjEgmHjWLxzALknyX3xITaqo6ubrfGcCf0wqMiCVhrEgmIgb+MQtSlgSV+0dO3+ry9usSMnW/M4A/pw0YyVPYkUw8ahfvYBamLFlbvjzNnh6xtXEZibrfOcCfg02riKUOI5kIArPwzJKgihVp81QFH35b2nEiFcCvdHXyVLuRTISBWXh2aVDlirS5fQRRZfMnyNF9QtLFyVMIO5KJSPxzKy4t54+0oYqLOXWo+fS5HFn1Nwm+enFxPxoZuk9Iujh5CjOSiVhIT00tqf76uPLu0gET/orNqT8qTY7uE5KuTp4S5R7JRBYwEqnIC/hzWoEdscQ9konAiI7Ynvep6S/Sdb8DwJ8DAMQGlAMAeADlAAAeQDkAgAdQDgDgAZQDAHgA5QAAHniunDbWrg6WEwbmxz9CXRcE/5pTIabx04oDR1YZydYb+/KSV7ssP/zwC5strG4RELvTTUcCbssOzrbWtLF2dbCcMDDfxdvb7FXZ/bDMWvGv2Fk/e2+fkrVRV4Txau9s+7PG53MuGkpVPdluMtpy3cjsbSPIc9vgNoCzjRttrV3tLycOjLxT22/qBj+zVhMRQAPdfsv/6j6pyrv+QmzM1mHSnM2PWL+JE5QCr+b+wx4hQtatDjjbuMHN2tXeciJRkXP5tXTf/BCLQcez/mb3meAbtcffUE4AiWnPGo+8D/7vrb6VcuWDpNRPw+brSZNVNgicbdzgbu3CXk4o2NUlHz+9upGheu5a/nDqo502Jpbug3ISp8oJKFjs2POrkXWfHhKitf+IT45Md+0Ld2bvFvDqWmkMaxem5YtYUATFRITlLFa4jJTlFHy4a8CUjban/qyYaky5u/bnBc8XZH30Hy1T/eLEgnETnVQfnbBSJOt9CLoRPFIOlrXLMRvT8sWbhvEIUfVRKrX3i6vYSJxSdzsTFqLW5bvX5J1PytVY7jBSlsr5naNp4WzgMSfpWaWVonhXNxjoCB4pB9Pa1Q9jOcEQGejgqr4rcEu6YbCJ6NMDm9MEzBIHiyKakq622Nb4szlz3AaIsoqvn8qq1gyg850VBeACYQ8D8xlCA5edOVA4z1FN5O9qsf5TV53eMbHuSIDMpN3JAR6LTVUCGYIUirTuvMPxzmpkHhNwtrVDS2tXx8sJghB9RsQfMyJaLaXKj1uR+GBFl7SIHwFnGwAQG1AOAOABlAMAeADlAAAeQDkAgAdQDgDgAZQDAHjgmXK+nJ0ob3mJJVh30QkHUb2wexk+dVc31uSnrFmwfG967me2zGDL5TF7/Y3kiHhPcgxnG7M4feP8ReHnX1ZL0EfNCz4YOkud70768ZTuUhBeKYdV/rFUZPTB3OuOSi2uZqzN3WVjlzDs0J0vNvSKzPVmZjNXGr6MNSTcdVsYzjZW0ZmFM6NFIu6UzlP758qq8da2McOv+WoS66K976H7FIRnd2QvKyqnysiLtb4ImC2o5rD9kK11X84zYqNmW9LDLr8uYxuKE8yjguFsY5dcjbks5359lqYoBYmO9wvQ3bMl+bWnP8Gud+083aggvFNOcXlVdsgkjTm382pkdWxWRkctrkvYFVK38amfRayK93eObT742eQXYzmCyQZhOttq/r7/iq3u07thd0RAst9Qqb9v5lUhPpwovKEbFYRXyqGIaZhMMRviuTZtXM/PGZssp5gv1MxJmCpfvw1iFRwy6OV0E1G15h9Nm6NKvKMWWM42scov1bQeIrX3/enDIwenvg2WoFU/r2R1dXO7Dlb3KQivZqnwAI/9p78+FjP1D7bcNTv+ccVUk/rMCwElxyzmrOLs9D0+jvr2Qg/jbXoS6xgBlrNttKiUcG1uJZLXt3OpUFYVrPxcIywlSmJbm0D3KQiPlMMuf3k1PU9toim9/rYubBYDCUrSKIj58ca+2Cd6Hq7DJUUVBk1Zsm7azqnHH1fY9CSYRQfD2SakotdPIPF2geKGzXE2iPlu373PvWerkzVel0M3Kgivtjm1r+JcrHIXX0pdMUbiQ8a2gDQRi9+GiCEqszor3Du6TPNCkIki41Vq3PlP9KV04t00CcPZRhE19p78Zf76BMcj9vTPF4ODnw3zs6YTb2+101Bku01BeNQoirTZjpT1bu4WSoElDGHlMfYR57ePr8vYlTQOTw3zXjSXvrmgmtJDzdj54ElPPvw5+MNgONuQ3OTo5CUuLiMkHMuE6SZeR0+4kNrYhijdpiA8a5WAjIFf4iO/tsul9LwO3/Y6zKt2dBncnW1IQHbsyuTslV3RIv6kuxSEP/UMAPwOKAcA8ADKAQA8gHIAAA+gHADAAygHAPAAygEAPPBOOVgONnKFlrWJpiOLse/7gMy2JrAcbOQKLWsTTUcWY1/ngcy2VmA52MgUWsYlmo4kxr5OA5ltbcBysJEntIxrNB1JjH2dBjLbuMPFwUaW0LL2oukIb+wjJDwdJi4OtmpShJa1H01HeGMfIeFV2iGGg03mLSlCy7Ai67xk75DC2EdEeKQcKo27g40koWWYkXWl5DD2ERFe7a1hOdhIHlpGFmNfp4HMtjZgOdhIF1rWMpqONMa+TgKZbQBAbEA5AIAHUA4A4AGUAwB4AOUAAB5AOQCAB1AOAOCBh8rBCC0ji7WLa/cbT/w1UHf6b3Bozn0/Mp8Mhcy2VmCElpHF2oXR/eYn/hhvj1jp755gSSexbCCzrQ0YoWVksXZhdf8bzMIzvqvynH5z6yfUFQ3kEyCzrQ0YoWVksXZhdb+J8jtblmca7dijw5eJFzwDMttagxVa1iASwlu72u9+XQF+DTpEW5AxgYBfGt8FZLa1Biu0zLj+Bw3hrV3tdx8x3iSHZ9Ldogbw5U9hXgKZbW3ACC0jeWZbA6zC9CNP+swyV+W/XRJeA5ltbcAILaMyyGHtwuh+AxXPLzwX0wvuDcKBzDYuYISWkcXahZXZxoFV9javQmaSPAF7/f1AZhsXuIeWkcbahZHZVvcrz+FGpQPvG8SnQGYbABAZUA4A4AGUAwB4AOUAAB5AOQCAB1AOAOABlAMAeOBhZhv3bLYvZyfKW15iCX69GEVUL+xehg8B4ww6iKZrk+VGWsDZ1hKsbDZW+cdSkdEHc687KvHldX3/Eh1E07XJciMt4GxrDVY2G7OsqJwqIy9GZNkg7O7Xb3W4ZLmRFXC2tQErm41ZVlxelR0ySWPO7bwaWR2bldFRi8fIEE5H7UTTcc1yIyvgbGsDVjYbRUzDZIrZEM+1aeN6fs7YZDnFfKFmTsJUeYJpBzOarr0sNxICzrY2VGBkswkP8Nh/+us6Yqb+wZa7Zsc/rphq0oNH7eIRGN2XK24vy42EgLOtNTV53LPZFNDLq+l5ahNNGzw5bBYDCUrSCGcpxuq+DkaW22I+3DvhCeBsaw1mNlvZqzgXq9zFl1JXjJH4kLEtIE3E4rchYrxpFO/A6j5mlhtZAWdbayhY2WzSZjtS1ru5WygFljCElcfYR5zfPl6KcNsczO4DrQBnWxuwstkEZAz8Eh/58aoZXUXH0XQts9zICzjbAIDIgHIAAA+gHADAAygHAPAAygEAPIByAAAPoBwAwEOXO9tQTX5ywPyl0ZdfV4rRTX2ij2ww70VEOXPvPmS2tQGcbS3BsnbV5uy0tkscfuzPimmK+UccjGznaT296NSbWFEG2N2HzLZWgLOtNRjWLlrR1YSn6h7xFmqiVKRlt2lp6PC9lwodnJT58vJY/LTrbKsHMtvqAGdbGzCsXXX/EIvJYtevRJVUkqjNf/CuGinz5eWx+GnH2dYAZLbVA862NmBZuxSNZg30DA9OmbNnuuLbM5tC79XWDq1l86hRvAPT2dYAZLZ9BZxtbcByttH6e586lO/kpyO7QHKwpbf9WKn9ShJE+5WD3f2GnVLIbGsEnG2twbJ2KYojYfUZ4Rmzd9R92Vbd8x8cNtRAlXATqJ3uI8hsawY421qD6Wxj/BVtpBtlmpa5YTS6u2NJHHtOqqEUb9rEQzC7Xw9ktjUBzrbWYFu71Bz3bM6wnSQXXCmkOGxe+Lkt+gS7BUEd7TrbILOtGeBsawOWtYsiPtQz4aknr5rRVbTjbIPMthaAsw0AiAwoBwDwAMoBADyAcgAAD6AcAMADKAcA8ADKAQA8dL2zrfbtr2tcfHddfFVOldeZsWp3jC8BU0CwHWyk6P73AM62lmBZuxj5h+ba7leKulVq36/qxsafJkwP0M+N1ifanaWxHGwk6X7nAWdba7CsXWy2gvWG2Ol2gySoSGKU7XR6SOqLf9j6YoS92r6lg4103e8AcLa1AcvaRaNbLln6dZ2qnKST+aqWutIEnjetHGxk635HgLOtDR1Yuzg1ex2/wHwz2/d/y7T5crf2XwHbwUaK7ncMONva0K61i116N2K2+S+lbqlXggyIlwHSBIaDjSzd7wTgbGtNO9Yu9pebGyZNPtA38nayvaZIx2/VfeHqYCNP9zsDONtag2ntYr4/Nd8iUiHi/gF7wlsiuTjYyNT9zgDOttZgWbuY71JDkz6UIBdNcdeGNWn9fK/eCRnBl18zPwYXBxuzgDzd7xzgbGsDd2sXVcX1FtuVV23oWrg42MjU/c4CzjYAIDKgHADAAygHAPAAygEAPIByAAAPoBwAwMP/ARweUMKTO9PsAAAAAElFTkSuQmCC" width="275" /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br />
<br />
<br />
The optimal result
value is 309, by selecting items 0, 1, 2, 3 and 5 .</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
The algorithm finds
the correct answer:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Best 5 answers:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<b>Value</b> <b>Items
Selected</b></div>
<div style="line-height: 100%; margin-bottom: 0in;">
270 0 1 2 9
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
276 0 2 3 6
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
277 0 1 3 4
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
284 0 1 3 6
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
309 0 1 2 3 5
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
The input P02 has
optimal profit value 1458, and the algorithm also finds the correct
answer.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
These dataset were found on:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<a href="http://people.sc.fsu.edu/~jburkardt/datasets/knapsack_01/knapsack_01.html">people.sc.fsu.edu/~jburkardt/datasets/knapsack_01/knapsack_01.html</a></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="color: #3d85c6;"><span style="font-size: medium;"><b>Setting
up the Hadoop Cluster on a Single Machine</b></span></span></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
For testing purposes, here follows how to start on Hadoop, and set it up on a single machine. </div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Download hadoop 1.21
from:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
http://hadoop.apache.org/releases.html</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Install a Java 6:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ sudo apt-get
install oracle-java6-installer</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Inside the hadoop
home directory, update a configuration file in conf/ :</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ vim
conf/hadoop-env.sh
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
export
JAVA_HOME=/usr/lib/jvm/java-6-oracle/
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
export
HADOOP_HOME=/home/renata/hadoop-1.2.1
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
export
HADOOP_VERSION=1.2.1
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Create an input
directory, to place your input file:
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ mkdir input
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Put some content in:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ cp conf/*.xml
input
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
To test if your
hadoop is working properly:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ bin/hadoop jar
hadoop-examples*.jar grep input output 'dfs[a-z.]+'
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
Don't forget to
remove the output file after running a Job. Hadoop doesn't do this
for you.</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ rm -rf output
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
Now, to test the
Genetic jar file:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ export
HADOOP_HOME=/home/renata/hadoop-1.2.1
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ export
HADOOP_VERSION=1.2.1
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
To compile the code
run :</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ javac -classpath
${HADOOP_HOME}/hadoop-core*.jar -d genetic_classes Genetic.java
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
And create the jar:</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
$ jar -cvf
genetic.jar -C genetic_classes/ .<br />
<br />
Now, to run the code once you put the input data on "input" directory:<br />
<br />
$ bin/hadoop jar genetic.jar org.myorg.Genetic input output </div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="font-size: large;"><b><br /></b></span></div>
<div style="line-height: 100%; margin-bottom: 0in;">
<span style="font-size: large;"><b>The code can be found here:</b></span><br />
<br />
<a href="https://github.com/renataghisloti/GeneticKnapsackHadoop"><span style="font-size: small;"><b>https://github.com/renataghisloti/GeneticKnapsackHadoop</b></span></a><i><br /></i>
<br />
<i><br /></i>
<br />
<i><br /></i>
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<i><span style="font-family: "liberation" serif , serif;">PS: A large large dataset consisting of
10.000 items, with optimal profit value of 7145 and a weight capacity
of 431 can be found on:
<a href="https://github.com/jaredtking/knapsack-lisp/blob/master/datasets/large01">https://github.com/jaredtking/knapsack-lisp/blob/master/datasets/large01</a>.</span></i>
</div>
<div style="line-height: 100%; margin-bottom: 0in;">
<br /></div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com2tag:blogger.com,1999:blog-6461723396986571103.post-47304405726832742092014-01-31T11:55:00.000-08:002015-02-11T10:06:13.546-08:00Dealing with NP-Hard Problems: An Introduction to Approximation Algorithms<br />
<i><span style="font-size: xx-small;">This is just a quick overview on approximation algorithms. It is a broad topic to discuss. For more info<span style="font-size: xx-small;">rmation</span> go to References.</span></i><br />
<br />
The famous <b>NP-Complete class is known for its possible intractability.</b> NP means <i>non deterministic polynomial</i> and for a problem to be NP-Complete it has to be<br />
<ul>
<li> NP (verified in polynomial time) and</li>
</ul>
<ul>
<li> NP-Hard (as hard as any other problem in the NP class). </li>
</ul>
<br />
Among the several important problems that are NP-Complete or NP-Hard (on
its optimization form) we can name the Knapsack, the Travel
Salesmen, and the Set Cover problem.<br />
<br />
Even though no efficient optimal solution might exist for NP-Complete problems we still need to address this issue due to the amount of practical problems existent in the NP-Complete class.<br />
<br />
Considering that even for medium volumes of data exponential <i>brute-force </i>is impractical, the option is abdicating the optimum solution as minimum as possible and pursuing an efficient algorithm.<br />
<br />
<span style="color: #3d85c6;"><b>Approximation algorithms are a way to deal with NP-Hard problems</b></span>. They are <span style="color: #3d85c6;"><b>polynomial-time algorithms, that return guaranteed near optimal solutions</b></span> for any instance of the problem. The objective is to make the output as close to the optimal value as possible. We say that an <i>α-approximation</i>, is an algorithm that is within a α distance from the optimal value.<br />
<br />
<h3>
<span style="font-size: large;">Measures of Approximation</span></h3>
<br />
Among the measures of approximation it is possible to highlight:<br />
<ul>
<li>Absolute approximation</li>
<li>Approximation factor</li>
</ul>
We say that an algorithm A is an <b>absolute approximation</b> for an
instance I if it stays under a absolute distance k of the optimum
algorithm OPT:<br />
<br />
|A(I) − OPT(I)| ≤ k ∀ instance I.<br />
<br />
An algorithm has an <b>approximation factor</b> of α if:<br />
<br />
A(I) ≤ α · OPT(I) ∀I, for minimization<br />
A(I) ≥ α · OPT(I) ∀I, for maximization<br />
<br />
being that α>1 for minimization and α<1 for maximization problems.<br />
<br />
<h3>
<span style="font-size: large;">Approximation Techniques </span></h3>
<br />
<b>Greedy approaches</b> are a highly intuitive and usually easy to implement approximation algorithm. It basically consists of making the "best" local choice at each step. For example, for the <span style="color: black;"><a href="http://en.wikipedia.org/wiki/Set_cover_problem">Set Cover problem</a>, a greedy algorithm would be:</span><br />
<br />
"At each round choose the set that minimizes the ratio weight of its weight to the number of currently uncovered elements it contains."<br />
<br />
You can check a proof of approximation here [1].<br />
<br />
<br />
Another way to create an approximation is by making a <b>linear relaxation</b> of a integral program. Remembering that many problems can be written in a more mathematical form, were we have a <i>objective function</i> to maximize (or minimize), <i>constraints</i> and <i>variables</i>.<br />
The Knapsack is a good example:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6iy_WRI1_yQqZwXDuUJAh0pnHsPdC-EPnj7vGUttEQG7lmsCvEEgcdH-wtPo7NEIw0n64iQ6kDLafVmdZKnkSQiPVb4GZTGALN9L-65jHiNeX0JqgJ-nK7YxTj_A9sXJi0W0iJlGRbcY/s1600/knapsack1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6iy_WRI1_yQqZwXDuUJAh0pnHsPdC-EPnj7vGUttEQG7lmsCvEEgcdH-wtPo7NEIw0n64iQ6kDLafVmdZKnkSQiPVb4GZTGALN9L-65jHiNeX0JqgJ-nK7YxTj_A9sXJi0W0iJlGRbcY/s1600/knapsack1.png" /></a></span></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq15nM_mU24ZnIx0tBtdHTjDbeUjDmQjHO6mVdOwTrr1bxyvtAkY9ES5zypupxoF1VwGedGLLKPHmwPZ34U4s1QP9ra5U1sJ-k-Vm_xywtDPzFD2cgQkVKH-OMqX_uyY7dMaD6RB2gg_0/s1600/knapsack2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgq15nM_mU24ZnIx0tBtdHTjDbeUjDmQjHO6mVdOwTrr1bxyvtAkY9ES5zypupxoF1VwGedGLLKPHmwPZ34U4s1QP9ra5U1sJ-k-Vm_xywtDPzFD2cgQkVKH-OMqX_uyY7dMaD6RB2gg_0/s1600/knapsack2.png" /></a></span></div>
<br />
<br />
The goal here is to maximize the objective function, where <i>v<span style="font-size: xx-small;">i</span></i> represents the value of each element <i>i</i>, and variable <i>x<span style="font-size: xx-small;">i</span></i> represents whether element <i>i</i> is going to be taken or not. The maximization has to take into account the weight of each element represented by the first constraint.<br />
<br />
Approximation through linear relaxation can be done in a very simple way. While Integral programming is NP-Hard, linear programming is not. If we relax (substitute) the constraint<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrWZJ4khrcaM1-sx3M2a3ocCsY99GvT31I2HJa8gKUHFIOUC7tkICYS0gj0YWlXgsNWNUuCpezcWWdzkGxNrjaNcFBnovc6C5nxFijDuZxYgr-1y4igxjRRilzBSseW0902e1vihuD7c0/s1600/k5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrWZJ4khrcaM1-sx3M2a3ocCsY99GvT31I2HJa8gKUHFIOUC7tkICYS0gj0YWlXgsNWNUuCpezcWWdzkGxNrjaNcFBnovc6C5nxFijDuZxYgr-1y4igxjRRilzBSseW0902e1vihuD7c0/s1600/k5.png" /></a></span></div>
<div class="separator" style="clear: both; text-align: center;">
to </div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3BiOXvRpNys3SJ6tkeyapEAj8Yush8FjZ0L-D1AyRJEtEHwTgKNmT4466yjWCDSShs8aCmAS2l3hyphenhyphenHKrSNWNOKQVfsvSRHSlSkhnMNs9Ks95bFUxhF4xe60Ly0DUzELEmMKB5DEIVrSU/s1600/k6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3BiOXvRpNys3SJ6tkeyapEAj8Yush8FjZ0L-D1AyRJEtEHwTgKNmT4466yjWCDSShs8aCmAS2l3hyphenhyphenHKrSNWNOKQVfsvSRHSlSkhnMNs9Ks95bFUxhF4xe60Ly0DUzELEmMKB5DEIVrSU/s1600/k6.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black;"></span></div>
we can have a solution (not optimal!) in polynomial time.<br />
<br />
Here is what is would look like:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHPYdSEgGEUaER5MZ5T4SHj7ZTr1HoOOfLtnF6fvZ5oVVItG2rV8eHEqy9nzyRwCwsf2BdBbIjEsKT-uOqFz2uFf-5iRm3LnYfcGcStqZ60cEwNEg9AzhIK4iQXog37JJIBgM7YcYnSZo/s1600/knapsack1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHPYdSEgGEUaER5MZ5T4SHj7ZTr1HoOOfLtnF6fvZ5oVVItG2rV8eHEqy9nzyRwCwsf2BdBbIjEsKT-uOqFz2uFf-5iRm3LnYfcGcStqZ60cEwNEg9AzhIK4iQXog37JJIBgM7YcYnSZo/s1600/knapsack1.png" /></a></span></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="color: black;"></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWcF74y1WFHpcaS_HYzpgc8oPsRRBZWFDMiA7KsGHHln7NF0ymAL02CDMwQM2bWknI7RkExuKvC54X2iHvSyQyclciZ3rTIYMyuW9HLUDu1FOkOEs2qIYK22FOJdF8ap_Yku7ZP_DhdoY/s1600/k7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWcF74y1WFHpcaS_HYzpgc8oPsRRBZWFDMiA7KsGHHln7NF0ymAL02CDMwQM2bWknI7RkExuKvC54X2iHvSyQyclciZ3rTIYMyuW9HLUDu1FOkOEs2qIYK22FOJdF8ap_Yku7ZP_DhdoY/s1600/k7.png" /></a></div>
<br />
<br />
The linear relaxation serves as a <i>lowerbound</i> of the optimal solution in minimization problems and <i>upperbound</i> in maximization problems.<br />
<br />
<br />
There is also a very powerful method called <b>Primal-dual schema</b>.<br />
<br />
It is well known that, in some cases, optimal solutions satisfy the <span style="color: black;"><a href="http://en.wikipedia.org/wiki/Linear_programming">complementary slackness conditions</a>.</span><br />
<br />
The Primal-Dual schema works with a relaxed version of the complementary slackness conditions. It starts with a integral feasible solution for the primal and a feasible solution for the dual, for example. Iteratively, it satisfies the slackness conditions so that no constraint problem is broken in the process. When all conditions are satisfied the program ends.<br />
<br />
It can be a little complex, so I am going to do a post about it on the future. For now, to understand it in details please read [2]. <br />
<br />
<h3>
<span style="font-size: large;">Classes of Approximation</span></h3>
<br />
Algorithms as the mentioned above can be classified by their approximation "degree", or how much it is possible to approximate them of the optimal solution. <br />
Following there are listed the classifications: <br />
<br />
<ul>
<li>PO - problems for which there are exact polynomial algorithms. This an extension of the P class to optimization problems. </li>
</ul>
<div>
<ul>
<li>PTAS (Polynomial Time Approximation Scheme) - It presents an approximation scheme which is a (1 + <i>e</i>)OPT approximation for minimization, and, (1 - <i>e</i>)OPT for maximization, where <i>e</i> is a relative error (rational number).</li>
</ul>
</div>
<ul>
<li>FPTAS (Fully Polynomial Time Approximation Scheme) - as PTAS, FPTAS is also a polynomial scheme, but FPTAS running time is polynomial in the inverse of the error (1/<i>e)</i>.</li>
</ul>
<ul>
<li>APX - there is at least one polynomial α-approximation (for some constant α).</li>
</ul>
<ul>
<li>NPO - is an extension of the NP class to optimization problems. This means that an algorithm <i>A</i> is in the NP Optimization class if its instances are polynomial in size, and solution for the problem can be verified in polynomial time and the size of the objective function (solution) can be calculated in polynomial time. </li>
</ul>
This is the relation among the classes:<br />
<br />
<div style="text-align: center;">
PO ⊆ FPTAS ⊆ PTAS ⊆ APX ⊆ NPO </div>
<br />
If an algorithm is in the FTPAS class, this usually means that it can have better approximations than one in the APX class.<br />
<br />
The knapsack problem, for example, has been proved to be in FPTAS [3]. If by any chance someone proved it to be also in any other class of the above, than P=NP.<br />
<br />
<br />
<h3>
<span style="font-size: large;">References</span></h3>
<br />
<i><span style="font-family: inherit;"><span style="font-size: small;"><i><span style="font-family: inherit;"><span style="font-size: small;">[<span style="font-size: small;">1]</span></span><span style="font-size: small;"> D. P. Williamson and D. B. Shmoys. <a href="http://www.designofapproxalgs.com/">The Design of Approximation Algorithms. 2010. Cambridge University Press</a></span></span></i></span></span></i><br />
<span style="font-family: inherit;"><i><span style="font-size: small;"> </span></i></span><br />
<span style="font-family: inherit;"><i><span style="font-size: small;">[2] V. Vazirani. Approximation Algorithms.
2001. Springer-Verlag. </span></i></span><br />
<br />
<br />
<span style="font-family: inherit;"><i><span style="font-size: small;">[3]<b> (PORTUGUESE)</b> M.H. Carvalho, M.R. Cerioli, R. Dahab, P.
Feofiloff, C.G. Fernandes, C.E. Ferreira, K.S. Guimarães, F.K.
Miyazawa, J.C. Pina Jr., J. Soares, Y. Wakabayashi. Uma introdução
sucinta a algoritmos de aproximação. 23o Colóquio Brasileiro de
Matemática, IMPA, Rio de Janeiro. </span></i></span><br />
<br />
<i><span style="font-size: small;"><span style="font-family: inherit;">[4] </span><span style="font-weight: normal;"><span style="font-family: inherit;"><span style="font-size: small;"><span class="authorEditorList">Giorgio Ausiello, Pierluigi Crescenzi, and Marco Protasi. </span>Approximate Solution of NP Optimization Problems.<span class="authorEditorList"> </span><i>Theor. Comput. Sci.</i> <i>150(1):1-55</i> (<i>1995</i></span></span>)</span></span></i><br />
<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-33603714243671753052013-12-20T13:20:00.000-08:002017-08-06T05:53:13.319-07:00Overview of Digital Cloning <h2>
<span style="color: #444444; font-size: x-large;">Introduction</span></h2>
<br />
The growth of the image processing and editing software availability has made it easy to manipulate digital images.<br />
With the amount of digital content being generated nowadays, developing techniques to verify the authenticity and integrity of digital content might be essential to provide truthful evidences in a forensics case.<br />
In this context, <b>copy-move is a type of forgery in which a part of an image is copied and pasted somewhere else in the same image</b>. This forgery might be particularly challenging to discover due to properties like illumination and noise matching on the source and the tampered regions. An example of copy-move forgery can be seen in picture 1. First we can see the original image, followed by the tampered one, and then a picture with the indication of the cloned areas.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaWwOi_Tg4sIFlVivaTMxnCupT6cNlM-a6MiQ1-yiVt7gOn9g1GKzZcZktlKGoZkizi7F-BTVjoOcGBZVcghYjoSmg3j888hD7fxb9hnuz53C2cl66SKRuq1VtBibFORB_N_Uz4XgAtcI/s1600/image_cloned.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="176" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaWwOi_Tg4sIFlVivaTMxnCupT6cNlM-a6MiQ1-yiVt7gOn9g1GKzZcZktlKGoZkizi7F-BTVjoOcGBZVcghYjoSmg3j888hD7fxb9hnuz53C2cl66SKRuq1VtBibFORB_N_Uz4XgAtcI/s640/image_cloned.png" width="400" /></a></div>
<br />
Several techniques have been proposed to solve this problem. The <b>Block-based methods</b> [1] divide an image in blocks of pixels and compare them to find a forgery.<br />
<br />
<b>Keypoint-based methods </b>[2] on the other hand extract keypoints of an image and use these to find tampered regions.<br />
<br />
While keypoints might generate better and computationally efficient detectors [3] they also present difficulty in finding tampering in homogeneous regions.<br />
<br />
<h2>
<span style="background-color: white; font-size: x-large;"><span style="color: #444444;">State-of-the-Art </span></span></h2>
<br />
Many <b>block-based</b> algorithms have been proposed over the years. Involving mainly dividing the image in a fixed size b of pixels, they differ on how they compare the blocks.<br />
Once blocks are extracted from a NxM image, they are usually inserted lexicographically in a M - b + 1 x N - b + 1 matrix. This matrix is latter on analysed to see if any two lines match as a cloned region. Authors have proposed ways to improve this method such as using PCA to reduce the blocks dimensions.<br />
<br />
Among several approaches in the literature, the <b>cloning detection via multiscale analysis and voting method </b>can be highlighted, proposed by Ewerton Silva et al. The first step on the process here is to extract keypoints of interest in an image using SURF, robust to scaling and rotating, and then match such points between themselves. These points are then grouped based on their physic distance, to limit the search space. The image is then redimensioned,<br />
creating a type of pyramid of images, representing the several scales of the image. In each level of the pyramid, a search is made for possible duplicated regions. This search only occurs in the point groups discovered. The final decision is made based on a voting process, if a certain region is considered cloned in more than a threshold of levels of the pyramid.<br />
<br />
Another work worth mentioning is an evaluation made on copy-move forgery approaches, by Christlein et al. 15 different copy-move detection approaches were compared in their work such as SIFT, SURF, PCA, KPCA, Zerkine,<br />
DCT and DWT. They aimed to answer which algorithm performed best against realistic scenarios like different compressions and noise. Results showed that keypoints methods (SIFT and SURF) had clear computational complexity advantage, but other block based methods like Zernike achieved quite precise results.<br />
<br />
<br />
<h2>
<span style="color: #444444; font-size: x-large;">References</span></h2>
<i>[1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and<br />Dan B. Goldman. The generalized patchmatch correspon-<br />dence algorithm. 2010. <br />[2] Ewerton Silva and Anderson Rocha. Cloning detection. In<br />Elsevier JVCI 2013 (Submited).<br />[3] Christian Riess Johannes Jordan Corinna Riess Elli An-<br />gelopoulou Vincent Christlein. Evaluation of popular copy-<br />move forgery detection approaches. In IEEE Transactions on<br />Information Forensics and Security (TIFS) 2012, pages 015–<br />021, Graz, Austria, 2010. <br />[4] Andrea Vedaldi. An implementation of multi-dimensional<br />maximally stable extremal regions. 2007. </i><br />
<br />
<span style="font-size: x-small;">This topic was inspired from a class I took with <a href="http://www.ic.unicamp.br/~rocha/">Anderson Rocha</a>. See his publications on digital forensics here: <a href="http://www.ic.unicamp.br/~rocha/pub/index.html">http://www.ic.unicamp.br/~rocha/pub/index.html</a></span>.Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-72649396233291224232013-08-20T16:19:00.000-07:002014-01-28T10:13:49.438-08:00Understanding Apache Hive<style type="text/css">P { margin-bottom: 0.08in; }A:link { }</style>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;">
<span style="font-size: large;"><b></b></span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;"><span style="font-size: large;"><b>Introduction</b></span></span></div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-size: medium;"><b> BigData and Hive</b></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Apache Hive is a software application created to facilitate data
analyses on Apache Hadoop. It is a Java framework that helps
extracting knowledge from data placed on a HDFS cluster by providing
a SQL-like interface to it.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
The Apache Hadoop platform is a major project on distributed
computing and it is commonly assumed to be the best approach when
dealing with BigData challenges.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
It is now very well established that great volume of data is produced
everyday. Whether it is by system logs or by users purchases, the
amount of information generated is such that previous existing
Databases and Datawarehouses solutions don’t seem to scale well
enough.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
The MapReduce programming paradigm was uncovered in 2004 as a new
approach on processing large datasets. In 2005 its OpenSource
version, Hadoop, was created by Doug Cutting. Although Hadoop is not
set for substituting relational databases, it is a good solution for
big data analyses.
</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Hadoop facilitates large data processing, but it still requires
skillful programmers to create the Map and Reduce functions to
analyze the data. All analyzes made through Hadoop had to be
condensed on these two functions. Creating this type of applications
might be challenging and difficult to maintain. Previous data
developers had difficulty on extracting intelligence from their data.
Hive was created to overcome this issues.
</div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in;">
<span style="font-size: medium;"><b> Apache Hive</b></span></div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="margin-bottom: 0in;">
First introduced by Facebook and latter
donated to the Apache Software Foundation, it is a data warehouse
interface for Hadoop. With Hive, users can create SQL statements that
will be automatically converted to MapReduce jobs and run on a HDFS
cluster.</div>
<div style="margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Data can be inserted or dealt with on the Hadoop cluster through
command line interface using statements from the Hive Quey Language,
or HiveQL, such as SELECT, INSERT or CREATE TABLE. Users can also
create their own User Defined Functions, by extending the UDF class
already provided. Within these statements tables can be defined
using primitive types as integers, floating points, strings, dates
and booleans. Furthermore, new types can be created by grouping these
primitives types into maps and arrays. Please check
<a href="https://cwiki.apache.org/Hive/languagemanual.html">https://cwiki.apache.org/Hive/languagemanual.html</a>
for more information on HiveQL.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Although Hive presents a data warehouse interface for Hadoop, it is
still a batch processing framework. As Hive’s data is located on
Hadoop, it is limited to Hadoop’s constraints.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Hadoop does not index data it is not made for editing Data. There is
no UPDATE on Hive, because this functionality could not be executed
on data over HDFS. Hive does not support transactions. If you want
these kind of database on top of Hadoop you should look for options
such as HBase. Check <a href="http://wiki.apache.org/hadoop/HadoopIsNot"><span style="color: #1155cc;"><u>http://wiki.apache.org/hadoop/HadoopIsNot</u></span></a>
to read more about this Hadoop limitations.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Even so, Apache Hive made it possible for developers with basic SQL
knowledge to create complicated meaningful queries and quickly
extract value from big data.</div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;"><span style="font-size: large;"><b>Architecture</b></span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Users can start interacting with Hive though a Command Line Interface
(CLI), Hive Web Interface (HWI), JDBC or ODBC.</div>
<div style="line-height: 115%; margin-bottom: 0in;">
The CLI interface is
a command line tool accessed through a terminal. It can be initiated
by calling the <i>HIVE_HOME/bin/hive</i> script, inside Hive
downloaded source code. Hive also provides a Hive server, so that
users can use JDBC or ODBC to communicate with it.</div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhinoHK_y-3RCzHQbkyuWmGgVYaZr46oBLFQJ7oS65iwZxGgBlwzuq2uPSswsSURrkqNPgpkWWk_n5hHJ4rgSrQ5WixL9FwleNCR06jyNjlTJcGx8fker8ZD_wVCCC-MN0JPDHLXpTMIvo/s1600/hive1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhinoHK_y-3RCzHQbkyuWmGgVYaZr46oBLFQJ7oS65iwZxGgBlwzuq2uPSswsSURrkqNPgpkWWk_n5hHJ4rgSrQ5WixL9FwleNCR06jyNjlTJcGx8fker8ZD_wVCCC-MN0JPDHLXpTMIvo/s320/hive1.jpg" height="320" width="320" /></a></div>
<br />
<style type="text/css">P { margin-bottom: 0.08in; }</style>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
When you type a query through the CLI interface, this HiveQL
statement will be handled by the <b>Driver </b>component. The Driver
connects a bunch of modules that transform the statement into
MapReduce jobs to be run in Hadoop. It is importante to note that the
query is not transformed in Java code in this process. Its goes
direclty to MapReduce jobs. The modules involved in this process are:
Parser, Semantic Analyzes, Logical Plan generator, Optimizer,
Physical Plan Generator and Executor.</div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtNguOCqyek6hO7a06dXMSAtcW5VhMQgVtoxeXere63BvRZ4eZhopCmk20z5Oe_xOHMSfK0bdwLlY4feH4emeE6OmmSB9D7nVTmNbbznl5F0mdr7y0EfXovJHSJut5N85-GjPd-CQhzbk/s1600/Selection_003.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtNguOCqyek6hO7a06dXMSAtcW5VhMQgVtoxeXere63BvRZ4eZhopCmk20z5Oe_xOHMSfK0bdwLlY4feH4emeE6OmmSB9D7nVTmNbbznl5F0mdr7y0EfXovJHSJut5N85-GjPd-CQhzbk/s640/Selection_003.png" height="323" width="640" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<style type="text/css">P { margin-bottom: 0.08in; }A:link { }</style>
</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
First, the Driver creates a session to remember details about the
process, to maintain dates and statistics.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Some metadata (information about tables and columns) is then
collected and stored on <b>Metastore</b> as soon as the input data
(tables) are created. This metadata is actually stored in a
relational database and it is latter on used on the Semantic
Analyses.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
ANTLR software is used to create a parser on the <b>Parser</b> module
and parse the query. As in a compiler, the statement in broken down
into token values and a Abstract Syntax Tree (AST) is created.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
The following HiveQL statement</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;">FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON
(src1.key + src2.key = src3.key)</span></div>
<span style="font-family: "Courier New",Courier,monospace;">
</span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;">
INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value</span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
would became this AST</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;"><br />
</span></div>
<span style="font-family: "Courier New",Courier,monospace;">
</span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;">
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_JOIN (TOK_TABREF (TOK_TABNAME
src) src1) (TOK_TABREF (TOK_TABNAME src) src2) (= (.
(TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key)))
(TOK_TABREF (TOK_TABNAME src) src3) (= (+ (. (TOK_TABLE_OR_COL src1)
key) (. (TOK_TABLE_OR_COL src2) key)) (. (TOK_TABLE_OR_COL src3)
key)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME dest1)))
(TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) key))
(TOK_SELEXPR (. (TOK_TABLE_OR_COL src3) value))))) null</span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
The process follows on with a <b>Semantic Analyses</b> on the
generated AST. The information provided on the query is verified to
be valid by confronting the schema information from the input tables,
stored on the Metastore component.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Type checking is a example of operations performed by the Semantic
Analyzes.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
After this analyses, a operator tree is created by the <b>Logical
Plan Generator</b>, based on the parsed information and on the AST
created. This operator tree is then, passed to the <b>Optimizer</b>
procedure, which will perform a set of transformations to, not
surprisingly, optimize the operations. The improvements accomplished
by the Optimizer include column pruning (only column really needed
will be fetched) and join reordering (to make sure only small tables
are kept in memory).
</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
The <b>Physical Plan Generato</b>r gets the optimized operator tree
and creates a Directed Acyclic Graph of MapReduce jobs of it. This
physical plan is displayed in a XML file, and it is delivered to the
<b>Executor </b> to be executed into the Hadoop cluster finally.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-size: large;"><br />
</span></div>
<span style="font-size: large;">
</span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;"><span style="font-size: large;"><b>Hive and the different Hadoop versions</b></span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Hive can be built with Hadoop 1.x or with Hadoop 2.x. It presents
interfaces for this purpose, and these interfaces are defined in the
<i>Shims</i> interface.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
There are three interfaces for Hadoop: 0.20, 0.20S, 0.23. 0.20 is
supposed to work with Hadoop 1.x, 0.20s is for a secure version of
Hadoop 1.x and 0.23 is for building Hive against Hadoop 2.x.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
You can prevent a interface to be built by editing the property
<b><i>shims.include</i></b> on HIVE_HOME/shims/build.xml:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><property name="shims.include"
value="0.20,0.20S,0.23"/> </span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Hive uses a <i>Factory Method</i> to decide which Hadoop interface to use,
based on the version of Hadoop on the classpath. This is situated on
</div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in;">
HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/ShimLoader.java.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java
</div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in;">
encapsulates the
interfaces.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i>hadoop.version</i> is defined on HIVE_HOME/build.properties but
you can overwrite it by using the flag <b>-Dhadoop.version</b>.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To build Hive with Hadoop 1.1.2, use</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><br />
</span></div>
<span style="background-color: #f3f3f3;">
</span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant clean package -Dhadoop.version=1.1.2 </span></span></div>
<span style="font-family: "Courier New",Courier,monospace;">
</span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;">
</span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br />
To build Hive with Hadoop 2.0.4-alpha, use</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant clean package -Dhadoop.version=2.0.4-alpha
-Dhadoop-0.23.version=2.0.4-alpha -Dhadoop.mr.rev=2</span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;"><span style="font-size: large;"><b>Building</b></span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To use some of Hive features such as UDF, or even to adapt Hive’s
code to your own needs, you might have to build the source code from
source.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Before start, make sure you have a Java JDK, Ant and Subversion
installed on your computer.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Then, start by downloading the last stable release version from Hive
repository.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ svn checkout
https://svn.opensource.ibm.com/svn/stg-hadoop/hive/0.11.0/trunk
hive-0.11.0
</span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Enter on your Hive home directory (which from now on, we will call
HIVE_HOME):</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ cd hive-0.11.0
</span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
And finally build the code with:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant package </span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
This will automatically download and install all dependencies
required for Hive’s use.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Hive depends on (or uses) other Hadoop-related components. As from
Hive 0.11 version, these are:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Apache Hadoop</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Apache HBase</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Apache Avro</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Apache Zookeper</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
This components will be automatically downloaded by Ant and Ivy, when
you run the <i>ant package</i> command. You can check which version
of each component will be downloaded in
</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;">HIVE_HOME/ivy/libraries.properties</span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
and, as explained on last session, Hadoop version can be chekced
here:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="font-family: "Courier New",Courier,monospace;">HIVE_HOME/build.properties</span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To check all ant command possibilities with Hive, type:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant -p </span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
This should show how to built it, test it and even how to create a
tar file from the source. The testing will be explained a little bit
further next.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;"><span style="font-size: large;"><b>Unit Tests</b></span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
Hive provides several buitltin Unit Tests to verify its own modules
and features functionalities. They are constructed using <b>JUnit 4</b> and
run queries (.q files) already provided by the framework.
</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To create the JUnit classes execute:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant package </span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To run the unit tests type:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant test </span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To run a specific test run:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;"><br />
</span></span></div>
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">
</span></span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">
$ ant test -Dtestcase=TestCliDriver </span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
To run a specific query inside one Unit Test run:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="background-color: #f3f3f3;"><span style="font-family: "Courier New",Courier,monospace;">$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q </span></span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
The command described above will produce a output that will be
compared with Hive’s expected output. It will also generate a .xml
log file, very helpuf for debbugging purposes:</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
HIVE_HOME/build/ql/test/TEST-org.apache.hadoop.hive.cli.TestCliDriver.xml</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
If you are having troubles with a certain testcase, and trying to
debug it, pay attention: some java test files (all files under <b>ql</b>
module) for Hive are created on build time from Velocity Templates
(.vm). If you want to modify this tests you have to change the .vm
file, not the .java one.</div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;"><span style="font-size: large;"><b>References:</b></span></span></div>
<span style="color: #6fa8dc;">
</span>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<span style="color: #6fa8dc;">
</span></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<a href="http://hive.apache.org/"><i><span style="color: #1155cc;"><u>http://hive.apache.org/</u></span></i></a></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i><br /></i>
</div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i>
Book: Programming Hive:Data Warehouse and Query Language for Hadoop.
<span style="color: black;">Edward
Capriolo,
Dean
Wampler,
Jason
Rutherglen</span></i></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i><br /></i>
</div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i>
Article: Hive A Warehousing Solution Over a MapReduce</i></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i>Framework. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng
Shao,</i></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i>
Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham
Murthy.</i></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i>
Facebook Data Infrastructure Team</i></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br />
<a href="http://www.blogger.com/goog_110048624"><i></i></a>
</div>
<a href="http://www.blogger.com/goog_110048624"><i>
</i></a>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br />
<a href="http://research.google.com/archive/mapreduce.html"><i><span style="color: #1155cc;"><u>http://research.google.com/archive/mapreduce.html</u></span></i></a></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i><br /></i>
</div>
<i>
</i>
<br />
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<i><a href="http://ant.apache.org/">http://ant.apache.org/</a></i></div>
<i>
</i>
<br />
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in; page-break-after: auto; page-break-before: auto;">
<br /></div>
<div style="line-height: 115%; margin-bottom: 0in;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com1tag:blogger.com,1999:blog-6461723396986571103.post-64933899244021522432013-07-27T11:50:00.000-07:002017-08-06T05:53:48.235-07:00Is there such a thing as "best" Recommender System algorithm?I received emails from users asking <b>which recommender system algorithm they should use</b>. Usually people start looking for articles on which approach has a better performance, and once they find something convincing they start to implement it.<br />
<br />
I believe that <b>the best recommender system depends on the data and the problem you have</b> to deal with.<br />
<br />
With that in mind, I decided to publish here some pros and cons for each recommender type (collaborative, content and hybrid), so people can decide for their own what algoritms better suit their needs.<br />
<br />
I've already presented these approaches <a href="http://girlincomputerscience.blogspot.com.br/2010/10/recommender-systems.html">here</a>, so if you know nothing about recommender systems, you can read it there first.<br />
<br />
<span style="color: #6fa8dc;"><span style="font-size: large;">Collaborative Filtering</span></span><br />
<br />
<b>Pros</b><br />
<br />
<ul>
<li>Recommends diverse items to users, being innovative;</li>
<li>Good practical results (read Amazon's <a href="http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf">article</a>); </li>
<li>It is widely used, and you can find several OpenSource implementations of it (<a href="http://mahout.apache.org/">Apache Mahout</a>);</li>
<li>It can be used on ratings from users on items;</li>
<li>It can deal with video and audio data;</li>
</ul>
<br />
<b>Cons</b><br />
<br />
<ul>
<li> It suffers with scarcity of data, if you don't have many ratings for example you might end up with bad results;</li>
<li> When the number of ratings grow, scalability becomes an issue, it might be hard to calculate similarity for all users;</li>
</ul>
<br />
<br />
<span style="color: #6fa8dc;"><span style="font-size: large;">Content Based Filtering</span></span><br />
<br />
<b>Pros</b><br />
<br />
<ul>
<li>It works better with smaller amount of information than Collaborative Filtering;</li>
<li>It uses description of items, so it works well with tagged items, and it usually matches well users preferences profile;</li>
</ul>
<br />
<b>Cons</b><br />
<br />
<ul>
<li>It doesn't work so well for video or audio data with no text tags;</li>
<li>Frequently recommends repetitive items, staying only on similar things that the user has already seen;</li>
</ul>
<br />
<span style="color: #6fa8dc;"><span style="font-size: large;">Hybrid Systems</span></span><br />
<br />
<b>Pros</b><br />
<br />
<ul>
<li>Usually the most effective approach (more accuracy on results);</li>
<li>It overcomes the single approaches; </li>
</ul>
<br />
<b>Cons</b><br />
<br />
<ul>
<li>Hard to find a balance when combining the two approaches;</li>
<li>Challenging to implement;</li>
</ul>
<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-65868435479362370042013-07-26T17:50:00.001-07:002013-07-26T17:50:02.867-07:00 Recommender Systems Online Free Course on CourseraI already talked about Coursera's great courses <a href="http://girlincomputerscience.blogspot.com.br/2013/05/bigdata-free-web-course-online.html">here</a>.<br />
There is a new course on Recommender Systems starting in September:<br />
<br />
<a href="https://www.coursera.org/course/recsys">https://www.coursera.org/course/recsys</a><br />
<br />
I don't know how it is going to be, but based on the courses I've done so far, it looks good.Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com3tag:blogger.com,1999:blog-6461723396986571103.post-2290706307149989382013-06-18T09:01:00.002-07:002013-06-18T09:01:34.849-07:00Apache Hive .orig test file and "#### A masked pattern was here ####"<br />Just a quick information about something in Hive. <br />
<br />
If you ever typed: <br />
<br />
<div style="background-color: #f3f3f3;">
$ ant clean package test</div>
<br />
to run <a href="http://hive.apache.org/">Apache Hive</a> unit tests, you may have seen that <b>Hive sometimes creates two output files</b>.<br />
If you run for example:<br />
<br />
<div style="background-color: #f3f3f3;">
$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q</div>
<br />
Hive sometimes generates a alter5.q.out and a alter5.q.out.orig :<br />
<br />
build/ql/test/logs/clientpositive/alter5.q.out<br />
build/ql/test/logs/clientpositive/alter5.q.out.orig <br />
<br />
This happens because Hive uses a method to mask any local information, as local time, or local path, with the following sentence:<br />
<br />
#### A masked pattern was here ####<br />
<br />
So, if you check <b>your .q.out file it should have a bunch of this sentence above covering several local information</b>. This information needs to be covered so that the tests outputs are the same in all computers.<br />
<br />
The .q.out.orig file has the original test output, with all the local information non covered.<br />
<br />
Out of curiosity, the method to mask the local patterns (private void maskPatterns(String[] patterns, String fname) throws Exception) is located on:<br />
<br />
ql/src/test/org/apache/hadoop/hive/ql/QTestUtil.java<br />
<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com2tag:blogger.com,1999:blog-6461723396986571103.post-62803779417929611292013-05-21T06:34:00.000-07:002013-05-21T11:59:56.259-07:00BigData Free Course Online<a href="https://www.coursera.org/">Coursera</a> offers several great online courses from the best universities around the world. The courses involve video lectures being released weekly, work assignments for the student, and reading material indications.<br />
<br />
I had enrolled on this <a href="https://www.coursera.org/course/bigdata">course about BigData</a> a couple of months ago, and I confess I didn't have time to start doing it since last week.<br />
<br />
Once I started the course I was pleased with the content presented.<br />
They talk about important <b>Data Mining algorithms</b> for dealing with great amount of data such as <b>PageRank</b>.<br />
<b>MapReduce</b> and <b>Distributed File Systems</b> are also two very well explained topics on this course.<br />
<br />
So, for those who want to know more about computing related to BigData this course is certainly recommended!<br />
<br />
<a href="https://www.coursera.org/course/bigdata">https://www.coursera.org/course/bigdata</a><br />
<br />
<span style="font-size: x-small;"><i>PS: The course is being offered since march, and its inscriptions period must soon be over. But keep watching the course page, because they open new courses often.</i></span>Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-24387101422042870932013-04-23T08:53:00.001-07:002013-06-18T09:14:18.979-07:00How to Build Oozie with Different Versions of HadoopAfter downloading <a href="http://oozie.apache.org/">Oozie</a> code with<br />
<br />
<div style="font-family: "Courier New",Courier,monospace;">
svn checkout http://svn.apache.org/repos/asf/oozie/tags/release-3.3.0/ .</div>
<br />
and then building it with Hadoop 1.1.0 with the familiar<br />
<br />
<div style="font-family: "Courier New",Courier,monospace;">
mvn clean compile -Dhadoop.version=1.1.0</div>
<br />
I got the following error:<br />
<br />
<div style="font-family: "Courier New",Courier,monospace;">
<span style="background-color: white;">[INFO] BUILD FAILURE</span><br />
<span style="background-color: white;">[INFO] ------------------------------------------------------------------------</span><br />
<span style="background-color: white;">[INFO] Total time: 1:06.497s</span><br />
<span style="background-color: white;">[INFO] Finished at: Tue Apr 23 12:36:53 BRT 2013</span><br />
<span style="background-color: white;">[INFO] Final Memory: 20M/67M</span><br />
<span style="background-color: white;">[INFO] ------------------------------------------------------------------------</span><br />
<span style="background-color: white;">[ERROR] Failed to execute goal on project oozie-sharelib-distcp: Could not resolve dependencies for project org.apache.oozie:oozie-sharelib-distcp:jar:3.3.0: Could not find artifact org.apache.oozie:oozie-hadoop-distcp:jar:1.1.0.oozie-3.3.0 in central (http://repo1.maven.org/maven2) -> [Help 1]</span></div>
<br />
<br />
<span style="font-family: inherit;">Reading a bit about it, and checking some pom files, I realized that inside the <span style="font-family: "Courier New",Courier,monospace;">hadoolibs</span> directory (inside oozie home), there are three sub-directories with the<b> hadoop version</b></span><b> hard coded on their poms</b>.<br />
So when you pass the -Dhadoop.version, these pom don't "change"! And they continue on using their pre-defined version of Hadoop!<br />
<br />
I talked to the community guys from Oozie, and they say that the recommended thing to do is to change the pom files itself, and not pass by parameter.<br />
<br />
Resuming, if you want to build oozie 3.3 with a different Hadoop, edit these pom files:<br />
<br />
<div style="background-color: #f3f3f3;">
</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/hadooplibs/hadoop-1/pom.xml</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/hadooplibs/hadoop-distcp-1/pom.xml</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/hadooplibs/hadoop-test-1/pom.xml</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/pom.xml</div>
<br />
Setting the desired version of Hadoop. This off courseif you are building against Hadoop 1.x. If you are building oozie with Hadoop 2.x, edit:<br />
<br />
<br />
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/hadooplibs/hadoop-2/pom.xml</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/hadooplibs/hadoop-distcp-2/pom.xml</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/hadooplibs/hadoop-test-2/pom.xml</div>
<div style="background-color: #f3f3f3; font-family: "Courier New",Courier,monospace;">
oozie_home/pom.xml<br />
<div style="background-color: white;">
<br /></div>
<div style="background-color: white; font-family: Verdana,sans-serif;">
</div>
<div style="background-color: white; font-family: Verdana,sans-serif;">
<br /></div>
</div>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com4tag:blogger.com,1999:blog-6461723396986571103.post-77713763316628192062013-04-05T16:02:00.000-07:002013-04-19T13:23:57.428-07:00HashMap JVM DifferencesAlthough Java slogan's is <a href="http://en.wikipedia.org/wiki/Write_once,_run_anywhere">"Write once, run everywhere"</a> , to emphasize the cross-platform benefit, in practice unfortunately this is not totally true.<br />
<br />
One known difference between Sun and other JVMs is the <b>HashMap</b> order output.<br />
<br />
When executing the exact same program and iterating though the same exact same HashMap input, a Sun JVM will produce a different output than another JVM.<br />
<br />
See as example the code below:<br />
<br />
<pre class="brush: java">
import java.util.LinkedHashMap;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
public class HashMapTest {
static HashMap<String, String> result = new HashMap<String, String>();
static Iterator<Map.Entry<String, String>> entryIter;
static HashMap<String, String> thash = new HashMap<String, String>();
public static void main(String[] args) {
for (int i = 0; i < 10; i++){
thash.put(Integer.toString(10 - i), "abc");
}
result.putAll(thash);
entryIter = result.entrySet().iterator();
while (entryIter.hasNext()) {
Map.Entry<String, String> entry = entryIter.next();
String val1 = entry.getKey();
String val = entry.getValue();
System.out.println("Key: "+ val1 + " Value: "+val);
}
}
}
</pre>
<br />
Compiling and executing this code with Sun Java will create the following output:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 3 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 2 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 10 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 1 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 7 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 6 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 5 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 4 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 9 Value: abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 8 Value: abc</span><br />
<br />
While whether doing the same thing with IBM Java you should get:<br />
<div style="font-family: "Courier New",Courier,monospace;">
<br /></div>
<span style="font-family: "Courier New",Courier,monospace;">Key: 10 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 9 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 8 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 7 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 6 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 5 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 4 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 3 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 2 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<span style="font-family: "Courier New",Courier,monospace;">Key: 1 Value: </span><span style="font-family: "Courier New",Courier,monospace;">abc</span><br />
<br />
I don't want to enter in merits of which one is right and which one is wrong. Just want to alert people that this issue can cause serious differences in programs output.<br />
<br />Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-57578445248954854312013-03-28T13:30:00.002-07:002018-05-26T07:40:09.407-07:00 IBM BigData approach: BigInsightsHadoop and BigData have been two tremendous hot topic lately.<br />
<br />
<br />
Although many people want to dig into Hadoop and enjoy the benefits of Big Data, most of them don't know exactly how to do it or where to start it. This is where <a href="http://www-01.ibm.com/software/data/infosphere/biginsights/">BigInsights</a> is most beneficial.<br />
<br />
<b>BigInsights is the <a href="http://hadoop.apache.org/">Apache Hadoop</a> related software from IBM</b>, and its many built-in features and capabilities leverage your start point.<br />
<br />
First, besides having all Hadoop ecosystem components (Hadoop, Hbase, Hive, Pig, Oozie, Zookeeper, Flume, Avro and Lucene) already working together and tested, it has a very easy-to-use install utility.<br />
<br />
If you have ever downloaded and installed Hadoop and all its components, and tried to make sure everything was working, you should know how much time a automatic installer can save. <br />
<br />
<br />
The principal value brought by BigInsights is, in my opinion, the <b>friendly web-interface</b> of the Hadoop tools. You don't have to program on "vim" or create MapReduce Java applications. You can use web tools, in a spreasheet-interface utility, to run queries on you data.<br />
You can import and export data to your cluster through the web-interface, and manage it too.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhidqALlMTIiqveqjRv4Iyv6771SaVHgH9xh9sRVR06wWc6qbZcxPvLOT5vPv1O9TvzJg9WYaBAOtq01fQNMt_39k1Wd_WYjZdaMt2V268ZcvhjqwnQTVO2VV8upcbsOZERheep4v00ZbQ/s1600/chap09_download_flume.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="187" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhidqALlMTIiqveqjRv4Iyv6771SaVHgH9xh9sRVR06wWc6qbZcxPvLOT5vPv1O9TvzJg9WYaBAOtq01fQNMt_39k1Wd_WYjZdaMt2V268ZcvhjqwnQTVO2VV8upcbsOZERheep4v00ZbQ/s640/chap09_download_flume.gif" width="400" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUHDGTQFfitkSlqksqfshyevqRo6QFMP1NXAISpNZ0JSA3tVJrnEOAVktbj7gLvLH57caQ-e4Fg-OkMaliuwOdgrpg4mutL3RXD45jO34YlcZdpWZA1YfCbkBWH89UrZAVo1Z6IxOZpJI/s1600/database_import_v2.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="190" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUHDGTQFfitkSlqksqfshyevqRo6QFMP1NXAISpNZ0JSA3tVJrnEOAVktbj7gLvLH57caQ-e4Fg-OkMaliuwOdgrpg4mutL3RXD45jO34YlcZdpWZA1YfCbkBWH89UrZAVo1Z6IxOZpJI/s640/database_import_v2.gif" width="400" /></a></div>
<br />
<br />
<br />
<br />
I wrote a book about BigInsights, describing what it is, how to install it and how to use it.<br />
You can find it here:<br />
<a href="http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg248077.html?Open">http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/sg248077.html?Open</a><br />
<br />
You can download the free version <a href="http://www-01.ibm.com/software/data/infosphere/biginsights/basic.html">here</a>. Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0tag:blogger.com,1999:blog-6461723396986571103.post-85009995649469576282013-03-20T12:23:00.002-07:002013-04-05T16:04:19.064-07:00Dummy Mahout Recommender System ExampleI already talked about the Open Source Apache Mahout <a href="http://girlincomputerscience.blogspot.com.br/2010/11/apache-mahout.html">here</a>, and now I'll show a dummy dummy first example of how to use its recommender system.<br />
<br />
It is a basic Java example that I used to try out Mahout. Hope it helps people starting to work with it.<br />
<br />
<br />
<pre class="brush: java">
package myexample;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.XmlFile;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.ItemBasedRecommender;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.AveragingPreferenceInferrer;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.List;
/*
* Renata Ghisloti - Dummy Mahout Example
*/
public class GeneralRecommender {
public static void main(String[] args) throws IOException, TasteException, SAXException, ParserConfigurationException {
String recsFile = args[0];
long userId = Long.parseLong(args[1]);
String categoriesFile = args[2];
String outputPlace = args[3];
Integer neighborhoodSize = Integer.parseInt(args[4]);
Integer method = 0;
String version = null;
if(args.length >= 6 )
{
method = Integer.parseInt(args[5]);
version = args[6];
}
//Default - needed to initiate the recommendation
InputSource is = new InputSource(new FileInputStream(recsFile));
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
SAXParser sp = factory.newSAXParser();
ContentHandler handler = new ContentHandler();
sp.parse(is, handler);
//Here is were you should load your own input
XmlFile dataModel = new XmlFile(new File(recsFile));
switch(method){
case 0:
recommenderItemBased(dataModel, userId , categoriesFile, outputPlace, handler, version);
break;
case 1:
recommenderItemBased(dataModel, userId , categoriesFile, outputPlace, handler, version);
break;
case 2:
recommenderSlopeOne(dataModel, userId , categoriesFile, outputPlace, handler);
break;
case 3:
recommenderUserBased(dataModel, userId , categoriesFile, outputPlace, handler, neighborhoodSize, version);
break;
}
}
//Item Based Recommender System
public static void recommenderItemBased(XmlFile dataModel, long userId ,
String categoriesFile, String outputPlace, ContentHandler handler, String version) throws TasteException{
System.out.println("Recommending with Item Based");
ItemSimilarity itemSimilarity;
if(version == "LogLikelihoodSimilarity")
itemSimilarity = new LogLikelihoodSimilarity(dataModel);
else {
itemSimilarity = new PearsonCorrelationSimilarity(dataModel);
System.out.println("Recommending with Item Based Pearson");
}
ItemBasedRecommender recommender =
new GenericItemBasedRecommender(dataModel, itemSimilarity);
//Just get top 5 recommendations
List<recommendeditem> recommendations =
recommender.recommend(userId, 5);
//This is were you should add your own print output method
PrintXml.printRecs(dataModel, userId, recommendations, handler.map, categoriesFile, outputPlace);
}
//Slope One Recommender System
public static void recommenderSlopeOne(XmlFile dataModel, long userId ,
String categoriesFile, String outputPlace, ContentHandler handler) throws TasteException{
System.out.println("Recommending with Slope One");
CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(dataModel));
List<recommendeditem> recommendations =
cachingRecommender.recommend(userId, 5);
PrintXml.printRecs(dataModel, userId, recommendations, handler.map, categoriesFile, outputPlace);
}
//User based Recommender System
public static void recommenderUserBased(XmlFile dataModel, long userId ,
String categoriesFile, String outputPlace, ContentHandler handler, Integer neighborhoodSize, String version) throws TasteException{
System.out.println("Recommending with User Based");
UserSimilarity userSimilarity;
if(version == "LogLikelihoodSimilarity")
userSimilarity = new LogLikelihoodSimilarity(dataModel);
else
userSimilarity = new PearsonCorrelationSimilarity(dataModel);
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(dataModel));
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(neighborhoodSize, userSimilarity, dataModel);
Recommender recommender =
new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity);
List<recommendeditem> recommendations =
recommender.recommend(userId, 5);
PrintXml.printRecs(dataModel, userId, recommendations, handler.map, categoriesFile, outputPlace);
}
}
</br>
</pre>
Renata Ghisloti Duarte de Souza Granhahttp://www.blogger.com/profile/18336442605720194782noreply@blogger.com0