Metadata-Version: 2.1
Name: eerah-A5
Version: 0.0.47
Summary: Contains programs for the course Algorithmen und Programmentwicklung für die Biologische Chemie
Home-page: UNKNOWN
Author: Philipp Fischer
Author-email: fischer-philipp@gmx.net
License: UNKNOWN
Description: # APBC 2021
        ---------------------------------------------------------------------
        270056 Algorithmen und Programmentwicklung für die Biologische Chemie
        ---------------------------------------------------------------------
        (Algorithms and program development for biological chemistry) 
        
        Usage eerah_A5 [-A* <parameters> ] [inputfile]
        
        
        
        
        * Mar-03-2021 (A0 and A1) --- Word Count
          e.g eerah_A5 -A1 -l -i <inputfile>
            * [-l (list of words)]
            * [-I ignore case]
            * [inputfile]
        
        * Mar-10-2021 (A2) --- Optimization
          e.g eerah_A5 -A2 -o <inputfile>
            * [-o  optimize the cost (instead of enumerating)]
            * [inputfile]
          
        * Mar-24-2021 (A3) --- Dynamic programming
          e.g eerah_A5 -A3 -d <inputfile>
            * [-t trace  prints one optimal solution]
            * [-d use diagonal elements]
            * [inputfile]
        
        * Apr-14-2021 (A4) --- Random Sequences
          e.g eerah_A5 -A4 -N 4 -K 3 KLetShuffle <inputfile>
            * [RollingDive, MonoShuffle, KLettShuffle]
            * [-N number of output sequences]
            * [-k for KLetShuffle - klet frequency]
            * [inputfile]
            
        
        ##A1
        Write a program that reads in a text (file name given on the command
        line) and counts the total number of words and the number of different
        words. On request, the program should print a list of the words.
        
        * If option -I (for 'Ignore') is given, case shall be ignored (by
          converting all upper case to lower case, see below).
        
        * If option -l (for 'list') is present, the program must print a list
          of words instead only counts.  This list must be sorted by word
          frequency, starting with the most common words.  ( per line print
          one word and its frequency, separated by a tab symbol; in case of
          equal frequencies, the words must be sorted alphabetically.). Please
          have a look at the example inputs and outputs; your program should
          produce outputs in exactly the same format.
          
        ##A2
        ### Optimizing the administration of Atirasu
        
        The government of the federal state Atirasu plans to modernize its administration by creating four new authorities for generally
        unremarkable affairs.
        These authorities shall be distributed to the capitals of eight
        provinces such that each two provinces share one authority.
        Consequently, the set of capitals
        
        ```
          {B,E,G,I,K,L,P,S}
        ```
        
        shall be partitioned into four subsets of two elements each.
        
        The cost of a partition is calculated as sum over the costs per
        authority, where the cost per authority depends on the two assigned
        cities and can be read from the following matrix (in million Euros /
        year). For example, putting authority 1 in charge of the provinces
        with capitals B and I costs 2 million Euros per year.
        
        ```
          8 10                     # number of capitals; cost limit
           B  E  G  I  K  L  P  S  # names of capitals
           - 10 10  2 10 10 10 10  # symmetric cost matrix
          10  -  2 10 10 10  1 10
          10  2  - 10  2  3  3  3
           2 10 10  -  4 10 10  2
          10 10  2  4  - 10 10  3
          10 10  3 10 10  -  2  2
          10  1  3 10 10  2  - 10
          10 10  3  2  3  2 10  -
        ```
        
        The federal government can provide 10 million Euros per year; it wants to
        know all the possible combinations that don't exceed this cost limit.
        
        The output lists like:
        
        ```
          BE GI KL SP
        ```
        
        When given the flag -o, the program must optimize the cost (instead
        of enumerating). The cost limit should be used as initial bound. As
        result it print the score of the best solution.
        
        
        ##A3
        ###dynamic programming.
        The Manhattan Tourist Problem:
        ------------------------------
        
        The street network of Manhattan has the form of a grid. Our tourist values streets (from crossing to crossing) by the number of sights along them.
        
        ```
               start here
               |
               v
               +--3--+--3--+
               |     |     |
               1     6     2
               |     |     |
               +--3--+--2--+
               |     |     |
               4     0     7
               |     |     |
               +--5--+--7--+
                           ^
                           |
                          end here
        ```
        
        The problem is to find a path from the top left to the bottom right corner of the grid with maximum weight. In each step, the tourist moves either to the east or to the south.
        
        The program reads <inputfile> as followed:
        
        ```
             # size (north-south dimension times west-east dimension)
             # 3 3
             # north-south streets
             1 6 2
             4 0 7
             # west-east streets
             3 3
             3 2
             5 7
        ```
        
        
        ```
        #G_down: 4 5
          0.60   0.65   0.91   0.94   0.14
          0.85   0.27   0.70   0.31   0.63
          0.63   0.23   0.35   0.77   0.20
          0.37   0.76   0.41   0.30   0.67
        #---
        #G_right: 5 4
          0.76   0.41   0.72   0.13
          0.57   0.64   0.62   0.62
          0.37   0.98   0.36   0.24
          0.99   0.77   0.39   0.35
          0.37   0.34   0.62   0.82
        #---
        #G_diag: 4 4
          6.74   7.03   2.47   6.25
          4.48   3.75   2.98   3.62
          7.90   3.63   3.67   3.18
          9.30   8.40   9.02   2.58
        #---
        ```
        
        ##A4
        ###Generating random sequences 
        
        Create a random sequence based on either RollingDive, MonoShuffle or KLetShuffle  (k-mer
        frequencies).
        
        ## Random sequences (preserving the frequencies of the single symbols):
        
        ### Rolling dice:
        
        The program reads in a  sequence and determines the
        frequencies of its symbols. Then, it generates N random sequences of
        the same length, where symbols are drawn at random with the same
        frequencies. 
        
        Here is an example of input, frequencies and possible output
        sequences.
        
        ```
              CUUUUGCUAG
        
                  |
        	  V
              
              A => 0.1             UGUACGCUGA
              C => 0.2    -----\   ACCUUCAUCU
              G => 0.2    -----/   UCUUCCUCCC
              U => 0.5             GUGCAUUUGU
        ```
        
        ### MonoShuffle:
        
        The program reads in a sequence and generates random
        sequences by randomly shuffling (i.e. swapping symbols of) the input
        sequence.
        
        e.g
        For input of length n, perform exactly n-1 swaps like in this example
        trace of an algorithm run:
        
        ```
              C <-+ G     G     G     G     G     G     G     G     G
              U   | U <-+ C     C     C     C     C     C     C     C
              U   | U   | U <-+ U     U     U     U     U     U     U
              U   | U   | U   | U <-+ C     C     C     C     C     C
              U   | U   | U <-+ U   | U <-+ G     G     G     G     G
              G <-+ C <-+ U     U   | G   | G <-+ A     A     A     A
              C     C     C     C <-+ U   | U   | U <-+ G     G     G
              U     U     U     U     U   | U   | U   | U <-+ U     U
              A     A     A     A     A   | A <-+ G <-+ U   | U <-+ U
              G     G     G     G     G <-+ U     U     U <-+ U <-+ U
        ```
        
        Remark: this method produces permutations of the input, preserving the
        frequencies of symbols in the input exactly; pay attention to produce
        each possible permutation with the same probability. This method is
        known as Fisher-Yates or Knuth shuffling.
        
        
        ## K-let-Shuffling (Shuffling preserving k-let frequencies)
        
        The program reads in a sequence and generates random
        sequences that preserve the k-let frequencies of the original
        sequence. The _k-lets_ of a sequence s are its subsequences
        s[i..i+k-1] of length $k$ (e.g. for dinucleotide frequencies, we
        consider the 2-lets).
        
        The k-let shuffling algorithm (for k>=2) works as follows
        
        1. Determine all subsequences s[i..i+k-2] (of length k-1) of the
           input sequence s. 
        
        2. These subsequences (the (k-1)-lets of s) form the vertices of
           a graph (which initially has no edges).
        
        ```
             CUUUUGCUAG               
                                        
                 | for k=2             
                 V                      
                                        
                A     U                  
                                        
                
                 
                 G    C                    
        ```
                                        
        3. Run through the original sequence and insert an edge between
           each successive (k-1)-lets. Note that two (k-1)-lets could
           occur several times in direct succession; in this case, one
           must record an additional edge for each occurrence. (Thus, we
           construct a a _multi_-graph.)
             
        ```
        	 CUUUUGCUAG                      +-----+
                                                 |+---+|
                     | for k=2                   ||+-+||
        	     V                           ||| |||
        	                                 vvv |||
        	    A     U                  A<---U--+++
                                     ->      |  / ^^	
        	                             | /  ||	
        				     vv   ||	
        	     G    C                   G-->C     
        ```
        
        4.  Each Euler-path in this multigraph corresponds to a shuffled
            version of the original sequence; this shuffled sequence
            preserves the k-let frequencies!  Recall: Euler-paths are
            paths that contain each edge of the graph exactly once (but
            can visit some vertices more than once.)
        
        5.  The true art is to find a method that randomly selects
            Euler-paths, such that each Euler-path is selected with equal
            probability (meaning uniform distribution of shuffled
            sequences).
        
            The following two papers present solutions to this problem:
        
            1. Kandel D et al (1996) Shuffling biological sequences.
               Discrete Appl Math 71:171-185
               [doi:10.1016/S0166-218X(97)81456-4](https://doi.org/10.1016/S0166-218X(97)81456-4)
               
            2. Jiang M et al (2008) uShuffle: a useful tool for shuffling
               biological sequences while preserving the k-let counts.
               BMC Bioinformatics 9:192
               [doi:10.1186/1471-2105-9-192](https://doi.org/10.1186/1471-2105-9-192)
               
        
Keywords: APBC2021,python3
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
