DM565 - Formal Languages and Data Processing
 
Fall 2023
Kim Skak Larsen

Home Innovation

Exercises
  1. Consider the files Use the command-line possibilities to inspect these files, i.e., what does file and wc say about them? Inspect them using od with appropriate options; start with od -tcuC, for instance. How do they differ? Discuss pros and cons of the formats. Try recode to change from one character encoding to another.
  2. Using the Python csv package, read a file in the default csv format and output it in tsv format.
  3. Define separate grep -E regular expressions matching lines with
    1. Scandinavian email address.
    2. CPR numbers.
    3. phone numbers written as 2 groups of 4 digits or 4 groups of 2 digits; groups separated by one space.
    4. dates in the Danish format 1/1 1970.
    In all of these problems, we are interested in the format. Thus, you do not have to worry about exactly which characters are legal in email addresses, if months have 30 or 31 days, or whether CPR numbers are legal according to checksums rules etc.
  4. Using /usr/share/dict/words (or similar), define separate grep -E regular expressions matching lines (words, since there is only one word per line in that file) with
    1. consecutive repetition of at least three characters.
    2. a consecutive repetition of the same sequence of four characters.
    3. a repetition of total length 4 and a palindrome of total length 4.
    4. words without vowels (a, e, i, o, u, y); use an option.
  5. Define separate grep -E regular expressions matching lines with
    1. an opening and closing html headline tag, e.g., <h2>My Headline</h2>; use an option to make it case insensitive, then use an option to print the line number for every match. You may require that headlines are on a line by themselves (and of course not nested).
    2. numbers in the range 1000 through 9999.
    3. numbers in the range 100 through 9999.
  6. Using ls -l | grep -E REGULAR_EXPRESSION, list all files in some directory that
    1. others can read or write (it is the 8th and 9th characters that are relevant).
    2. were created in November and are pdf files.

 


   Data protection at SDUDatabeskyttelse på SDU