Linux/POSIX commands that every Data Scientist should know

May 22, 2022

Sometimes, we face the challenge to work on legacy projects or systems that have very little documentation if any.
I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho).

First remember that you can always type

$ man command

to get more information on the command. This should tell you what the command is and how you can use it. For example, the following should give you the manual of the awk command.

$ man awk

Let's say you have a File/Library not found error. One thing you can try is the locate command.

$ locate pattern

Locate will return any repo that matches the pattern passed. With this, one can check if a file is on your computer, and where it is.

whereis file is also a good tool to find programs, but with whereis you have to specify the exact name of the program you want found. For example

$ whereis python

will show you where the program python (the one in your PATH, what you execute when you type "python" in the command line on your terminal) is located.

Let's say one realizes that the they do have the file you were looking for, but still gets an error. In this case, they might not have the right permission to access it. You can change its permissions rights with:

$ chmod 755 file

or

$ chmod u+x file

Let's say the program you want is not installed at all on your system. If you are on an ubuntu environment, you should be able to install it with:

$ apt-cache search pattern

With this you will get a bunch of distinct results matching pattern. See in the list the program you want to install. This is the program you are going to install next

$ sudo apt-get install program

On a mac os, we usually use brew:

$ brew install file

If what you need is a python package, you can run:

$ pip install package

BTW if you ever want to check the list of python packages installed on your computer, you can run:

$ pip freeze

Let's say you are compiling a program and getting "Error 1" as output, but you have no idea what error 1 is, or where it could be in the code should be. You can type:

$ grep -r "Error 1" .

This will look recursively for the string starting from you current directory, and output all files that present this string.
If there are too many and, you can type instead

$ grep -r "Error 1" . | less

This will give you the ability to scroll the screen up and down and see results better.

Ok, so you ran your program, but it is still not working properly. Let's say some application is getting stuck. If you have the program you want to kill on your terminal, you can stop its execution by pressing CRTL + C. If not, or if it is on background, you can look for it's run id with

$ ps -e

Look for your application pid (process id, the number beside your program's name) and type

$ kill number

Another great resource is the find command. You can find files by name or size! For example:

$ find . -name "*.jar"

Will find all files with the .jar extension, in any repo located under your current directory. You can also use it to find large files, like:

$ find / -size +100M

The above command finds all files with size equal or greater to 100MB in your computer!

Last, my favorite of all time. nohup. Nohup is a great tool to let a script or program run in a remote system even if you get disconnected from it! So let's say you have sshed to whatever system you need to ssh to, and need to execute a program that takes hours to finish. With nohup, you can exit the system and the program continues to run!

$ nohup python potato.py &

will leave the potato.py executing while you can go and finish your business elsewhere.

Of course you can still be an absolutely amazing data scientist without knowing any of these, but they can definitely be life savers and might be worth taking the time to learn them!

Search This Blog

Just a girl in Tech

Linux/POSIX commands that every Data Scientist should know

Comments

Post a Comment

Popular posts from this blog

Apache Mahout

Slope One

Apache Hadoop Admin Tricks and Tips