Zkouška 17.1.2023 Histogram proteinu

Základní kurs objektově orientovaného programování v C++. Třídy a objekty, zapouzdření, metody, plymorfismus. Abstraktní datové typy, přetěžování. Kontejnery, iterátory, algoritmy. Šablony, generické programování, kompilační polymorfismus. Výjimky. Bezpečné a přenositelné programování, vazby na OS.
dalsineznamymatfyzak

Zkouška 17.1.2023 Histogram proteinu

Příspěvek od dalsineznamymatfyzak »

Intro: Proteins are large biological molecules consisting of amino acids. In general, the genetic code specifies 20 standard amino acids. This assignment is based on the systematic exploration of the distribution of certain amino acids in proteins’ structures.

3-letter codes of the 20 standard amino acids:

Kód: Vybrat vše

ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
Your task

Implement a program invoked like:

program_name configuration_file output_file

Command line always contains configuration file name and it can contain output file name. If the output file name is not listed, standard output should be used.

Both the configuration file and the data files are row-oriented.
Data file structure

One file describes one protein and contains information about all its amino acids and their spatial coordinates (x, y, z – discrete values). Each row begins with the 3-letter amino acid code and continues with the spatial coordinates for that amino acid.

Keep in mind: coordinates can also be negative numbers, space between strings on one line can be one or more whitespaces.

(Note: This is a simplification of a real PDB file describing protein structures).
Configuration file structure

Kód: Vybrat vše

R-neighborhood .. is an integer
Pattern ......... is a sequence of one or more amino acids separated by one or more whitespaces
protein_1 ....... is the name of the first data file
...
protein_N ....... is the name of the N-th data file
Histogram

The R-neighborhood represents the neighborhood of a certain amino acid at a distance less than or equal to R. The R-neighborhood of an amino acid with coordinates [x,y,z] is defined as points with all coordinates in the range [x-R..x+R, y-R..y+R , z-R..z+R].

A histogram is constructed for each point in discrete 3D space in which an amino acid from the set of specified proteins is located. For each point, we calculate the number of amino acid types - specified in pattern - in its R-neighborhood. Let these numbers be (in order according to the specified pattern) [c1..cn]. Then the record corresponding to the values [c1..cn] is incremented. The resulting histogram is created by gradually incrementing the records according to the R-neighborhood of all points corresponding to the amino acids of all input proteins.
Output

The output format is row-oriented, one line is in the form

The output is sorted lexicographically, i.e.

Kód: Vybrat vše

[0 0 0 1]: xxx
[0 0 0 2]: xxx
...
[0 0 1 0]: xxx
[0 0 1 2]: xxx
...
[0 0 2 1]: xxx
Only non-zero occurrences are included in the output.
Example

Configuration file:

Kód: Vybrat vše

6000
ARG LYS
simple.pdb
Data file (simple.pdb):

Kód: Vybrat vše

ARG 14872 -18107 30327
LYS 16112 -17325 26790
HIS 17615 -20594 25563
ILE 18797 -24042 26472
ARG 21860 -24523 24296
ARG 24156 -21734 23132
GLY 27393 -22378 21345
HIS 29225 -19697 19391
ALA 32741 -18808 18304
Output:

Kód: Vybrat vše

[1 0]: 1
[1 1]: 2
[2 0]: 3
[2 1]: 1
[3 0]: 1
One of the data files used in tests:

Attached link, that will die soon probably, copied at the bottom

Assumptions and efficiency requirements

The discrete 3D space where all the amino acids are located is large, think on the order of 100000^3. It is therefore not possible to store in memory a map with data for every point in this space.

Space filling with amino acids is very sparse. Assume tens to small hundreds of amino acids (occupied points in space). Therefore, choose a suitable data representation so that the necessary operations are as efficient as possible.

It is certainly not efficient to search every point of the entire space for each amino acid, nor to go through all other amino acids entered.

You may find it useful to observe that for each amino acid in each dimension there are sufficiently few other amino acids in the range of R-neighborhoods (i.e., in the subspace [x-R..x+R, *, *]) that one can already search sequentially.
Configuration and data file syntax checking requirements

The primary evaluation criterion is functional correctness and efficiency on correctly entered data. The program must be stable (i.e. not perform any undefined operations, have unhandled exceptions, exit uncontrollably, etc.) on any (i.e. arbitrarily corrupted) data.

In order to achieve the full number of points, a check of the syntax of the configuration and data files is necessary, if it is violated, the program writes (to the output file or to the standard output, according to the parameters of the command line) the string "error" and ends (with a return code of 0). Consider a syntax violation other than a valid 3-letter amino acid code, a different number of coordinates, non-numeric characters at coordinate positions, etc.

If any data file specified in the configuration file cannot be opened (e.g. because it does not exist), it is not considered an error, simply skip the file. Being not able to open configuration file is an error.

----
File from downloadable from the link:

Kód: Vybrat vše

GLY  -5902  73707  44647
PRO  -6264  73743  40764
TYR  -3705  71988  38542
LEU  -3494  70898  34880
VAL  -2843  67241  34000
ILE  -2073  65571  30665
VAL  -5000  63138  30106
GLU  -3371  61798  26891
GLN   0298  62526  26068
PRO   1548  63200  22516
LYS   2971  60182  20620
GLN   6732  60114  21213
ARG   7588  58664  17782
GLY   6245  58612  14266
PHE   5232  62279  13870
ARG   6704  64364  11105
PHE   7776  67992  11705
ARG   7017  70083   8594
TYR   9026  72923   7063
GLY   7212  76167   6207
CYS   7470  75226   2529
GLU   5463  71982   2791
GLY   2067  73278   3865
PRO  -0149  73817   6887
SER  -2581  70903   6312
HIS  -0429  67805   6799
GLY  -1985  66380   9962
GLY  -2112  67101  13681
LEU   0271  65758  16340
PRO  -1704  62677  17591
GLY  -2080  61172  21099
ALA  -0097  58364  22710
SER  -2608  55734  21533
SER  -3856  56280  17981
GLU  -4190  52984  16070
LYS  -6365  52493  12940
GLY  -9823  54116  12842
ARG  -9728  55799  16290
LYS  -7067  58485  16685
THR  -6530  60460  19863
TYR  -4810  63883  20210
PRO  -2865  65278  23197
THR  -5445  66082  25941
VAL  -5149  67919  29242
LYS  -7630  68753  31976
ILE  -7906  71378  34717
CYS  -8719  69194  37749
ASN -10404  70905  40716
TYR -11209  73822  38373
GLU -11458  77453  39415
GLY -13284  79824  37052
PRO -13557  79329  33267
ALA  -9954  79073  32030
LYS  -7299  79768  29439
ILE  -4861  77005  28304
GLU  -1846  77730  26095
VAL   0660  75301  24561
ASP   4077  75922  23036
LEU   7228  74017  22190
VAL   9966  73747  24855
THR  13386  72129  24601
HIS  14112  68506  25486
SER  16709  69869  27923
ASP  15438  70035  31507
PRO  14855  73696  32442
PRO  12178  73987  29700
ARG  13172  76853  27501
ALA  11308  77859  24303
HIS  12261  75907  21143
ALA  13204  77563  17826
HIS  10407  75400  16314
SER   6815  76712  16480
LEU   3416  75054  16816
VAL   1340  75471  13671
GLY  -2446  75104  13090
LYS  -5815  75395  14808
GLN  -5961  78357  17279
CYS  -2120  78878  17123
SER  -0589  82386  16999
GLU   2146  83357  14620
LEU   4788  82755  17230
GLY   3830  79151  17958
ILE   1670  79732  21043
CYS  -1672  77951  21035
ALA  -4276  79584  23267
VAL  -7716  78245  24030
SER -10487  78742  26569
VAL -11923  75715  28467
GLY -15648  76296  29200
PRO -17220  76803  32657
LYS -18205  73099  32658
ASP -16010  70721  30632
MET -12476  71383  31856
THR -10707  69209  29289
ALA  -8884  70494  26198
GLN  -8229  68097  23312
PHE  -5664  69825  21044
ASN  -7002  68392  17826
ASN  -4686  70039  15291
LEU  -1173  70852  16446
GLY   1745  70832  14001
VAL   5462  71549  14380
LEU   6882  74031  11870
HIS  10501  73041  11348
VAL  13044  75904  11296
THR  15901  75530   8833
LYS  19433  75655  10285
LYS  19805  78943   8383
ASN  16548  80073  10015
MET  17317  79109  13580
MET  19235  82218  14690
GLY  16743  84782  13422
THR  13784  82759  14770
MET  15698  82476  18034
ILE  16634  86225  18473
GLN  12996  87040  17579
LYS  11470  84834  20294
LEU  14178  86005  22727
GLN  13369  89704  22123
ARG   9699  88988  22587
GLN  10682  87286  25841
ARG  12753  90306  26946
LEU  10048  92859  26526
ARG   7623  90445  28155
SER   7480  90991  31936
ARG  10089  93811  31835
PRO  10058  97252  29870
GLN   9652  98057  26142
GLY  12949  98079  24275
LEU  16442  96818  23593
THR  19556  98978  23660
GLU  22375  97823  21343
ALA  24145  96236  24282
GLU  20912  94388  25119
GLN  20607  93331  21481
ARG  24262  92206  21301
GLU  23787  89912  24374
LEU  20410  88680  23083
GLU  22215  87556  19909
GLN  24750  85902  22229
GLU  22082  83603  23681
ALA  20996  82677  20179
LYS  24443  81409  19128
GLU  25101  79517  22353
LEU  21496  78247  22381
LYS  21766  76912  18827
LYS  24900  74822  19619
VAL  23010  72824  22323
MET  19482  72631  20887
ASP  18136  69379  19421
LEU  16045  70184  16340
SER  14729  66589  16289
ILE  13135  66618  19726
VAL  10525  68948  21257
ARG   8191  68845  24255
LEU   4723  70383  24547
ARG   4344  72756  27479
PHE   0787  73383  28638
SER   0296  76558  30645
ALA  -3150  76794  32224
PHE  -4588  80131  33465
LEU  -7519  80693  35772
ARG  -9333  83809  34513
SER  -4554  86056  33994
LEU  -3785  83785  37002
PRO  -1175  81039  36386
LEU  -1311  77353  37450
LYS   1867  75230  37378
PRO   3065  74417  33818
VAL   2985  70745  32807
ILE   5618  69450  30341
SER   4877  66389  28188
GLN   7291  63642  27201
PRO   9670  64213  24228
ILE   8346  64193  20665
HIS  10762  62906  18060
ASP  10791  64095  14467
SER  10061  61036  12283
LYS  12033  62736   9520
SER  15216  62997  11542
PRO  17245  59950  10317
GLY  17438  58727  13939
ALA  13952  59279  15418
SER  11119  57328  13782
ASN   9027  54290  14618
LEU  11099  51143  14313
LYS   9026  49054  11907
ILE   9506  45573  10446
SER   7956  45698   6957
ARG   8694  42101   6048
MET  11300  39415   6458
ASP  12873  36586   4550
LYS  12076  33624   6745
THR  10020  33044   9879
ALA  11498  29718  10916
GLY  14891  28175  11319
SER  16964  25704  13281
VAL  17143  25680  17052
ARG  20963  25794  16510
GLY  20824  29465  15507
GLY  22848  30896  12662
ASP  20178  31069   9967
GLU  20493  34243   7847
VAL  17354  36452   7654
TYR  16848  39532   5394
LEU  14658  42018   7295
LEU  13119  45024   5485
CYS  12551  47987   7815
ASP  11835  51741   7949
LYS  14999  53877   8126
VAL  17326  52835  10966
GLN  20796  53806  12137
LYS  23533  51175  11547
ASP  25093  51639  14954
ASP  22001  52243  17053
ILE  19859  49238  16143
GLU  19426  45655  17326
VAL  17308  42563  16717
ARG  15951  41370  20044
PHE  14592  37820  20221
TYR  12631  36842  23322
GLU   9749  34819  24787
ASP   8051  36240  27899
ASP   6886  33394  30254
GLU   7554  33506  34056
ASN  11262  33452  33047
GLY  12252  35701  30103
TRP  14890  34456  27603
GLN  16442  37019  25278
ALA  19102  36908  22576
PHE  20204  38978  19590
GLY  20627  38977  15814
ASP  24156  38799  14470
PHE  25101  41451  11922
SER  27787  44113  11384
PRO  26997  47664  10160
THR  27811  46422   6610
ASP  24814  44143   7078
VAL  22504  47113   7550
HIS  21818  48009   3911
LYS  21168  51714   3043
GLN  19246  52071   6376
TYR  16281  50029   5037
ALA  17352  46377   5349
ILE  19238  44272   7870
VAL  20851  40933   6960
PHE  21363  39049  10191
ARG  22067  35534  11465
THR  19508  34280  14044
PRO  20741  33513  17538
PRO  20765  29990  19008
TYR  17734  28967  21094
HIS  18353  28429  24837
LYS  17559  24665  24766
MET  19471  22662  22095
LYS  17386  19628  23051
ILE  13946  20695  21789
GLU  11678  17961  20507
ARG   8822  20158  19311
PRO   9074  23447  17403
VAL   8841  26555  19640
THR   7655  30009  18645
VAL   9550  32969  20076
PHE   9433  36593  18957
LEU  11879  39031  17452
GLN  11742  42784  17347
LEU  13867  45695  16363
LYS  15128  47655  19342
ARG  17014  50927  19625
LYS  20262  50606  21626
ARG  20462  54056  23188
GLY  16657  54294  23601
GLY  14906  50997  24262
ASP  11870  51519  22010
VAL  10922  48525  19942
SER   9069  47812  16756
ASP   5930  45668  16633
SER   7200  42113  17169
LYS   7071  39374  14520
GLN   7145  35655  15455
PHE   9903  33186  14637
THR   9762  29446  15181
TYR  12667  27357  16209
TYR  12735  23834  14772
PRO  14682  20629  15825
GLY  37829  72937 -44895
PRO  39423  72794 -41437
TYR  37085  70892 -39150
LEU  37150  69964 -35506
VAL  36522  66345 -34541
ILE  35992  64789 -31118
VAL  38760  62171 -30602
GLU  37605  60944 -27176
GLN  33947  61358 -26113
PRO  32931  62292 -22574
LYS  31774  59246 -20589
GLN  27986  58978 -20925
ARG  27254  57588 -17370
GLY  29050  57545 -14042
PHE  29709  61297 -13617
ARG  27942  63593 -11120
PHE  26922  67248 -11435
ARG  27902  69285  -8385
TYR  26066  72303  -6889
GLY  27852  75521  -5996
CYS  27794  74536  -2296
GLU  29922  71490  -2950
GLY  33186  73050  -4097
PRO  35508  73107  -7144
SER  37760  70053  -6401
HIS  35373  67158  -7101
GLY  36772  65325 -10144
GLY  36952  66091 -13863
LEU  34560  64640 -16456
PRO  36434  61522 -17763
GLY  36657  60146 -21346
ALA  34448  57382 -22743
SER  37231  54895 -21834
SER  38620  55456 -18366
GLU  39077  52164 -16546
LYS  41392  51549 -13499
GLY  44617  53604 -13698
ARG  44243  54637 -17369
LYS  42031  57687 -17194
THR  41083  59663 -20266
TYR  39365  63079 -20564
PRO  37316  64423 -23507
THR  39574  65366 -26395
VAL  39122  67040 -29740
LYS  41373  68063 -32609
ILE  41324  70539 -35486
CYS  41946  68584 -38686
ASN  43403  69977 -41943
TYR  44839  72761 -39735
GLU  44771  76441 -40494
GLY  46301  79189 -38229
PRO  47095  78906 -34391
ALA  43648  78184 -32947
LYS  41316  78587 -30054
ILE  38732  76075 -28770
GLU  35780  77030 -26562
VAL  33632  74468 -24686
ASP  30188  75142 -23224
LEU  27199  73091 -22019
VAL  24297  72664 -24407
THR  21014  70880 -24011
HIS  20088  67351 -24992
SER  17220  68465 -27306
ASP  18198  69046 -30976
PRO  18611  72685 -31991
PRO  21409  72986 -29380
ARG  20731  75777 -26945
ALA  22892  76815 -23963
HIS  22058  74953 -20771
ALA  21474  76500 -17328
HIS  24519  74535 -16100
SER  28005  75924 -16368
LEU  31391  74300 -16789
VAL  33790  74805 -13893
GLY  37519  74517 -13335
LYS  40668  74478 -15402
GLN  40425  77592 -17526
CYS  36675  78042 -17053
SER  35058  81522 -17152
GLU  32580  82666 -14533
LEU  29870  82035 -17146
GLY  30647  78441 -17876
ILE  32746  78936 -21037
CYS  35977  76948 -21311
ALA  38437  78503 -23737
VAL  41905  77243 -24548
SER  44516  77856 -27256
VAL  45636  74954 -29506
GLY  49323  75501 -30280
PRO  50759  76164 -33752
LYS  51624  72523 -34351
ASP  49767  70304 -31866
MET  46101  70914 -32799
THR  44598  68639 -30157
ALA  43079  69783 -26877
GLN  42566  67450 -23924
PHE  39931  68678 -21442
ASN  41585  67449 -18259
ASN  39319  69052 -15668
LEU  35681  69631 -16569
GLY  33250  70024 -13721
VAL  29536  70474 -14319
LEU  27859  73088 -12154
HIS  24307  72231 -11125
VAL  22013  75266 -11054
THR  19235  75040  -8477
LYS  15623  75391  -9703
LYS  15398  78571  -7664
ASN  18379  79795  -9696
MET  17636  78545 -13167
MET  15411  81499 -14058
GLY  18078  83957 -12924
THR  21083  82429 -14703
MET  19053  81406 -17729
ILE  18243  85108 -18162
GLN  21916  86068 -17595
LYS  22788  83669 -20436
LEU  20118  84884 -22882
GLN  21005  88549 -22110
ARG  24672  87693 -22616
GLN  23671  85807 -25723
ARG  21295  88476 -27022
LEU  23322  91574 -26330
ARG  26019  89432 -27930
SER  25805  89806 -31728
ARG  23461  92803 -31639
PRO  23802  95989 -29396
GLN  24416  96566 -25613
GLY  21159  96897 -23672
LEU  17588  95752 -23213
THR  14599  98104 -23056
GLU  11835  97171 -20554
ALA   9898  95329 -23247
GLU  13115  93418 -24002
GLN  13797  92428 -20363
ARG  10196  91204 -20045
GLU  10419  88862 -23062
LEU  13685  87547 -21613
GLU  12001  86316 -18456
GLN   9461  84426 -20610
GLU  11916  82414 -22732
ALA  13423  81462 -19352
LYS  10147  80698 -17671
GLU   9112  78601 -20684
LEU  12647  77405 -21077
LYS  12712  76076 -17490
LYS   9600  73963 -18084
VAL  11260  72077 -20982
MET  14891  71883 -19786
ASP  16243  68527 -18513
LEU  18654  69339 -15747
SER  20064  65778 -15666
ILE  21517  65642 -19203
VAL  23951  68034 -20892
ARG  26031  67834 -24067
LEU  29538  69313 -24601
ARG  29970  71840 -27425
PHE  33360  72571 -28962
SER  33548  75789 -30986
ALA  36966  76344 -32621
PHE  38292  79719 -33893
LEU  41297  80162 -36201
ARG  43042  83517 -35791
SER  38086  85130 -34988
LEU  37045  82737 -37790
PRO  34565  80041 -36628
LEU  34793  76331 -37577
LYS  31477  74455 -37338
PRO  30630  73532 -33676
VAL  30917  69785 -32897
ILE  28544  68398 -30256
SER  29373  65369 -28130
GLN  27045  62567 -27060
PRO  24822  63154 -23987
ILE  26287  63144 -20432
HIS  23973  61973 -17642
ASP  23935  63028 -13955
SER  24990  60041 -11793
LYS  23222  61538  -8765
SER  19909  62085 -10494
PRO  17509  59227  -9469
GLY  16966  58164 -13095
ALA  20657  57955 -14074
SER  22205  56453 -10944
ASN  24644  53523 -11154
LEU  22663  50243 -10992
LYS  24099  47657  -8536
ILE  22975  44300  -7088
SER  24336  44221  -3560
ARG  23105  40924  -2045
MET  20579  38271  -2955
ASP  18332  36099  -0843
LYS  18801  32885  -2669
THR  20739  32013  -5772
ALA  18894  28851  -6708
GLY  15322  27664  -7134
SER  13047  25395  -9187
VAL  12766  25518 -12968
ARG   9037  26422 -12579
GLY  10151  29791 -11316
GLY   7869  31477  -8797
ASP  10454  31473  -6040
GLU  10393  34614  -3941
VAL  13711  36507  -3830
TYR  14814  39437  -1636
LEU  17392  41375  -3661
LEU  19020  44375  -2038
CYS  20329  46965  -4451
ASP  21480  50568  -4929
LYS  18665  53141  -5412
VAL  16190  52357  -8208
GLN  12876  53968  -9091
LYS   9999  51542  -8573
ASP   8331  52370 -11854
ASP  11563  52273 -13922
ILE  13324  49080 -12825
GLU  13381  45447 -13927
VAL  15164  42232 -12928
ARG  16189  40548 -16147
PHE  17234  36834 -16128
TYR  19061  35699 -19259
GLU  21626  33484 -20983
ASP  23772  34090 -24092
ASP  24587  31202 -26397
GLU  24101  31390 -30186
ASN  20364  31765 -29327
GLY  19543  33559 -26099
TRP  16677  33700 -23613
GLN  15344  36329 -21221
ALA  12637  36581 -18586
PHE  11740  39050 -15875
GLY  11248  38679 -12138
ASP   7632  39425 -11182
PHE   6800  42104  -8657
SER   4558  45119  -8051
PRO   5610  48719  -7102
THR   4730  47736  -3511
ASP   7510  45118  -3736
VAL  10258  47658  -4256
HIS  11094  48572  -0658
LYS  12111  52249  -0233
GLN  14117  52266  -3519
TYR  16778  49927  -2025
ALA  15286  46475  -2137
ILE  13155  44447  -4511
VAL  11069  41418  -3389
PHE  10212  39404  -6508
ARG   9213  35982  -7791
THR  11689  34430 -10225
PRO  10526  33672 -13764
PRO  10450  30155 -15221
TYR  13435  28807 -17212
HIS  12592  27953 -20867
LYS  13119  24168 -20635
MET  11163  22511 -17793
LYS  12513  19051 -18392
ILE  16171  19712 -17420
GLU  17906  16675 -16002
ARG  20739  18749 -14442
PRO  20795  22205 -12656
VAL  21420  25231 -14950
THR  23035  28549 -13979
VAL  21787  31726 -15708
PHE  22609  35461 -15284
LEU  20442  38158 -13807
GLN  20811  41936 -13908
LEU  18869  45057 -12875
LYS  17767  47105 -15890
ARG  15886  50384 -16147
LYS  12986  49924 -18601
ARG  13828  53217 -20348
GLY  17599  53680 -20335
GLY  18712  50174 -21345
ASP  21222  50776 -18587
VAL  22137  47629 -16586
SER  23898  46535 -13407
ASP  26794  44083 -13346
SER  25489  40525 -13460
LYS  24824  38022 -10674
GLN  23829  34354 -11076
PHE  20884  32126 -10268
THR  20616  28329 -10690
TYR  17440  26526 -11742
TYR  16906  23050 -10274
PRO  15210  20064 -12063
Odpovědět

Zpět na „NPRG041 Programování v C++“