DM565 - Formal Languages and Data Processing
 
Fall 2023
Kim Skak Larsen

Home Innovation

Exercises
The file participants.txt will be used in many of the next examples. It will be useful to save your results (direct output to a file), since some results are reused as input in later exercises.
  1. Use the command-line tool sort to sort on the first names. Which effect does sort -k 2 have? Now sort on the email addresses.
  2. Use tr twice to delete ":" and change "@" to "$".
  3. Use cut several times to remove anything other than the user name, i.e., first remove @student.sdu.dk, and then continue.
  4. Make a file with as many copies of the line @student.sdu.dk as there are lines in the original file. You can of course find out how many lines are needed using wc. ツ One way to create that file is to use seq n, where n is the number of lines you need, followed by an appropriate sed substitution using a regular expression. Now paste this file together with the file of user names from above to create full email addresses again.
  5. Use gawk to put a line number and a colon in front of the full emails from above. Remove the space following the colon in the original file, and then join these two results on the email address field.
  6. Consider the UTF-8 encoded files from last time. Using command-line tools, place a filter before wc so that characters (allowing the Danish ones) are counted, as opposed to counting bytes, as wc does.
  7. Consider the different end-of-line formats from last time. Write sed code that translates from MS-DOS to Unix style and the other way around.
  8. Mask credit card numbers so that 1234 5678 9012 3456 becomes **** **** **** 3456.
  9. With input
    Name,Team,First Test,Second Test,Third Test
    Tom,Red,5,17,22
    Joe,Green,3,14,22
    Maria,Blue,6,18,21
    Fred,Blue,2,15,23
    Carlos,Red,-1,15,24
    Phuong,Green,7,19,21
    Enrique,Green,3,16,20
    Nancy,Red,9,12,24
    
    use gawk to create individual, test, and team averages (-1 represents a non-existing result and should simply be ignored) as
    Name         Average
    ----         -------
    Tom            14.67
    Joe            13.00
    Maria          15.00
    Fred           13.33
    Carlos         19.50
    Phuong         15.67
    Enrique        13.00
    Nancy          15.00
    ----------------------
    Average, Test 1:  5.00
    Average, Test 2: 15.75
    Average, Test 3: 22.12
    ----------------------
    Average, Red:    16.00
    Average, Blue:   14.17
    Average, Green:  13.89
    
    Write the program on a script file and run using the -f option to gawk. Do not try to write the entire program from the beginning. Start by printing the names, then try to compute the average of each individual and print that, and so on. This is proptotyping! The formatting is the least important.
  10. What happens if the field separator is the empty string? Write an gawk script that counts occurrences of the normal (English) letters and digits and writes each letter and digit and the number of times it occurred. Now to use this for programs, only count in lines that are not comment lines (you can decide what a comment line starts with). Thus, you have to use a pattern.
  11. Use tr, sort, and uniq to take an input (text) file and produce a list of words and their frequencies in the text sorted in order of highest frequencies first.

 


   Data protection at SDUDatabeskyttelse på SDU