Linux/POSIX commands that every Data Scientist should know
Sometimes, we face the challenge to work on legacy projects or systems that have very little documentation if any.
I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho).
First remember that you can always type
$ man command
to get more information on the command. This should tell you what the command is and how you can use it. For example, the following should give you the manual of the awk command.
will show you where the program python (the one in your PATH, what you execute when you type "python" in the command line on your terminal) is located.
I see a lot of data scientist struggling to locate themselves in these projects, so I decided to write here a few very useful and basic Linux/POSIX compliant commands that every data scientist/engineer/programmer should know (imho).
First remember that you can always type
$ man command
to get more information on the command. This should tell you what the command is and how you can use it. For example, the following should give you the manual of the awk command.
$ man awk
Let's say you have a File/Library not found error. One thing you can try is the locate command.
$ locate pattern
Locate will return any repo that matches the pattern passed. With this, one can check if a file is on your computer, and where it is.
whereis file is also a good tool to find programs, but with whereis you have to specify the exact name of the program you want found. For example
$ whereis python
will show you where the program python (the one in your PATH, what you execute when you type "python" in the command line on your terminal) is located.
Let's say one realizes that the they do have the file you were looking for, but still gets an error. In this case, they might not have the right permission to access it. You can change its permissions rights with:
$ chmod 755 file
or
$ chmod 755 file
or
$ chmod u+x file
Let's say the program you want is not installed at all on your system. If you are on an ubuntu environment, you should be able to install it with:
$ apt-cache search pattern
$ apt-cache search pattern
With this you will get a bunch of distinct results matching pattern. See in the list the program you want to install. This is the program you are going to install next
$ sudo apt-get install program
$ sudo apt-get install program
On a mac os, we usually use brew:
$ brew install file
If what you need is a python package, you can run:
If what you need is a python package, you can run:
$ pip install package
BTW if you ever want to check the list of python packages installed on your computer, you can run:
$ pip freeze
Let's say you are compiling a program and getting "Error 1" as output, but you have no idea what error 1 is, or where it could be in the code should be. You can type:
$ grep -r "Error 1" .
This will look recursively for the string starting from you current directory, and output all files that present this string.
If there are too many and, you can type instead
$ grep -r "Error 1" . | less
This will give you the ability to scroll the screen up and down and see results better.
Ok, so you ran your program, but it is still not working properly. Let's say some application is getting stuck. If you have the program you want to kill on your terminal, you can stop its execution by pressing CRTL + C. If not, or if it is on background, you can look for it's run id with
$ ps -e
Look for your application pid (process id, the number beside your program's name) and type
$ kill number
Another great resource is the find command. You can find files by name or size! For example:
$ find . -name "*.jar"
$ grep -r "Error 1" .
This will look recursively for the string starting from you current directory, and output all files that present this string.
If there are too many and, you can type instead
$ grep -r "Error 1" . | less
This will give you the ability to scroll the screen up and down and see results better.
Ok, so you ran your program, but it is still not working properly. Let's say some application is getting stuck. If you have the program you want to kill on your terminal, you can stop its execution by pressing CRTL + C. If not, or if it is on background, you can look for it's run id with
$ ps -e
Look for your application pid (process id, the number beside your program's name) and type
$ kill number
Another great resource is the find command. You can find files by name or size! For example:
$ find . -name "*.jar"
Will find all files with the .jar extension, in any repo located under your current directory. You can also use it to find large files, like:
$ find / -size +100M
The above command finds all files with size equal or greater to 100MB in your computer!
Last, my favorite of all time. nohup. Nohup is a great tool to let a script or program run in a remote system even if you get disconnected from it! So let's say you have sshed to whatever system you need to ssh to, and need to execute a program that takes hours to finish. With nohup, you can exit the system and the program continues to run!
$ nohup python potato.py &
will leave the potato.py executing while you can go and finish your business elsewhere.
will leave the potato.py executing while you can go and finish your business elsewhere.
Of course you can still be an absolutely amazing data scientist without knowing any of these, but they can definitely be life savers and might be worth taking the time to learn them!
:D
Comments
Post a Comment