Calculating Molecular Descriptors using RDKit and Mordred

Machine Learning in Chemistry from Scratch

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 2 січ 2025

КОМЕНТАРІ • 59

@licidamarcristinadiazbamba8061 2 роки тому ⁺⁸
Thaks for upload this! helps a loooot!
@Kloppenhiemer 2 роки тому ⁺²
Thank you so much for your efforts.
@kevinender5409 2 роки тому ⁺⁵
very useful !! thanks for sharing c:
@Dr.Gashaw.M_Goshu 2 роки тому ⁺¹
I am glad it helps!
@kil1ua Рік тому ⁺²
Thanks a lot for this amazing tutorial! It helps me a lot.
@АнтонГрадов-ъ8з 2 роки тому ⁺³
Thank you very much. It is really helpful for my project in the university! And it would be really helpful if you prepare a video about application of descriptors in QSAR, machine learning….
Please!
@Dr.Gashaw.M_Goshu 2 роки тому ⁺¹
I am glad it helps. Thanks for your suggestion.
@DrHenryNguyen Рік тому
@@Dr.Gashaw.M_Goshu Dear Sir, I am new one. How can I run my data on your link? Thanks so much.
@SinaGilassi 2 роки тому ⁺²
Very interesting, Thanks.
@Dr.Gashaw.M_Goshu 2 роки тому
You are welcome!
@sumitkumar-el3kc 2 роки тому ⁺³
Thank you so much, sir. Can you teach me about protein descriptors too?
@awomutiadeboye5251 2 роки тому ⁺²
Thanks for this tutorial Dr. Gashaw, could you give a line of code that could list the 1826 Mordred Descriptors just like you listed the RDkit descriptors during the video. I tried to do the same for Mordred, but it didn't work.
@Dr.Gashaw.M_Goshu 2 роки тому
Please try the following:
print(Calculator(descriptors, ignore_3D=False).descriptors). It should print all the 1826 descriptors.
Please take a look at the following reference for further information github.com/mordred-descriptor/mordred/blob/develop/examples/020-single_mol-multiple_desc.py
@jval3614 2 роки тому ⁺³
Amazing video! About the FingerPrints, every time they are generated, does column 1 always correspond to the same chemical fragment?
@Dr.Gashaw.M_Goshu 2 роки тому ⁺²
Very good question! In these chemical series, every time I run the notebook only three rows have on bits in the first column, and the rest have zero values.
Morgan_fingerprints.Col_0.value_counts()
0 2870
1 3
You can rerun it. Probably, you will get the same result. My guess is that it encodes the same chemical fragment otherwise it should give me different values every time I run it, but I am not sure what fragment was encoded in the first column. You can take look at the following RDKit blog to know more about bit rendering. rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
@juandavidrangel6915 2 роки тому ⁺²
Thank you! Amazing tutorial. How can I download the mordred data frame to an excel file?. I have tried ,but several errors are shown because mordred_descriptors' type is a MordredDataFrame not a "normal" DataFrame type. I mean, it has different methods and it is not possible to create the file.
@Dr.Gashaw.M_Goshu 2 роки тому
If you have a mordred data frame in a jupyter notebook like mine, you can save it as csv file like this:
mordred_descriptors.to_csv('mordred_descriptors.csv',index=False)
Then, you can open it in excel without any problem. Once you opened it in Excel, you can save it as an excel file. I hope this helps!
@satyarahul8007 2 роки тому ⁺²
During the canonical smiles part. the number of compounds 2904 before and after running the function are the same. that means aren't any canonical smiles?
@Dr.Gashaw.M_Goshu 2 роки тому
The number of SMILES before and after running the function is the same and the function does not remove duplicate SMILES. That is why we need to remove duplicate structures after running the function.
@DrHenryNguyen Рік тому ⁺²
Dear Sir, I am new one. How can I run my data on your link? Thanks so much.
@Dr.Gashaw.M_Goshu Рік тому ⁺¹
No problem. It is prepared for beginners. Use the following youtube video to get a sense of how google colab works: ua-cam.com/video/6Xt6L1I5jScI/v-deo.htmlt
After that, you can open it in google colab and save it in your account, and then, you can replace my data with yours.
@DrHenryNguyen Рік тому
@@Dr.Gashaw.M_Goshu thanks so much for your replies. Dear Sir, I only have SMILES, Is it possible to calculate the energy gap from SMILES?
@DrHenryNguyen Рік тому ⁺¹
@@Dr.Gashaw.M_Goshu Is there any file to explain 200 descriptors, fingerprint, and 1826 descriptors? thanks
@Dr.Gashaw.M_Goshu Рік тому
Good question!. Here is the publication for Mordred, which is free to read, but I do not remember if they put detailed information about each descriptor. jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y#abbreviations
@abdallahabouhajal Рік тому
Thank you so much for this video, its really helpful. a quick question, can i choose which descriptors i want to calculate from Mordred instead of calculating the whole 1826. especially that after feature selection i know which ones i want to include in my model, and thus when testing new inputs. this shall help save so much time.
@mrnicjic3511 Рік тому ⁺¹
Great video Sir! I am a beginner in software but very interested and I just didn’t fully understand how I plug in my SMILES strings in the program. Could somebody explain it? Thanks in advance 🙏🏻
@Dr.Gashaw.M_Goshu Рік тому
As far as you are interested, learning programming or anything for that matter is much easier than 10 years ago. Knowing basic Python is needed to understand my videos. You can learn basic Python using W3 schools www.w3schools.com/python/default.asp or using UA-cam videos from freecodecamp ua-cam.com/video/eWRfhZUzrAc/v-deo.html and many others
@robertcormia7970 2 місяці тому
A good demonstration, especially of coding proecedure, a little difficult to understand some pronunciation, for instance Morgan fingerprint, it was useful to have captioning available.
@milindrahatwal Рік тому
Thank you for sharing this video.
I keep getting this error on Rdkit features.:
XGBoostError: [01:58:25] ../src/data/data.cc:1104: Check failed: valid: Input data contains `inf` or `nan`.
I have used dropna() but I still can't get rid of the error. Also when I try to see if there are any nan or infinite values it shows 0.
@Dr.Gashaw.M_Goshu Рік тому ⁺¹
If some rdkit descriptors have infinity or nan values for your compounds, you need to identify such columns and drop them before feeding the molecular descriptors to the XGBoost algorithm.
@davidl.e5203 Рік тому
Try np.where(np.isnan(a) | np.isinf(a), 0)
This line replaces all nan, inf, -inf to 0
@boufissiou 2 роки тому ⁺²
How do I get chemical compounds into colab server from a file in sdf format?
@Dr.Gashaw.M_Goshu 2 роки тому ⁺¹
Hi Ahmed,
If your file is not in GitHub, you need to upload it to your google drive, mount it and work from that. Please take a look at the following video starting from 9:00 min. He clearly described it. ua-cam.com/video/6Xt6L1I5jSc/v-deo.html
I hope this would help.
@attaullah4998 Рік тому
Thank you Sir for your valuable knowledge. I followed the tutorial but i was not getting the molecular descriptors for mordred properly, for for descriptors it was showing 3D missing, what should i do? do i need to optimize the geometry? becuase i use the canonical smiles from pubchem for my molecules. Thanks
@Dr.Gashaw.M_Goshu Рік тому
Try ignoring 3D descriptors as shown below:
calc = Calculator(descriptors, ignore_3D=True)
@fatemehf2566 2 роки тому ⁺²
thank you so much for your grate video. could you please make a video about 4D descriptors? could you please name some software that I can calculate 4D descriptors?
@Dr.Gashaw.M_Goshu 2 роки тому ⁺²
When I was in graduate school, I used a commercial software called SYBYL-X to calculate COMFA. Now, the software is not available. I am not sure if there is free and reliable software that can calculate 4D descriptors. 4D descriptors are computationally expensive so that may be the main reason that we do not have many options. I am not sure how good they are, but take a look the following links. www.ra.cs.uni-tuebingen.de/software/4DFAP/
github.com/rougeth/Web-4D-QSAR
@aisiazmi1983 Рік тому ⁺²
Prof, what is the code to save the results of rdkit and mordred?
@Dr.Gashaw.M_Goshu Рік тому ⁺²
Good question! They are in the panda’s data frame. We can save them as a csv file. For example, in the notebook, mordred descrioptors were stored in a variable name mordred_descriptors. It can be saved as follows.
mordred_descriptors.to_csv('mordred_descriptors.csv', index = False)
This should save your Mordred descriptors in your current working directory.
@aisiazmi1983 Рік тому ⁺²
@@Dr.Gashaw.M_Goshu thank you prof
@sathyashiva3363 4 місяці тому
Nice
@aisiazmi2753 Рік тому ⁺²
Sir, can you more explain about canonical smiles? What is the difference of SMILES and canonical SMILES? Thank you Sir
@Dr.Gashaw.M_Goshu Рік тому ⁺¹
A molecule can be represented by more than one SMILES, but a molecule cannot have more than one canonical SMILES. That is the basic difference.
@aisiazmi2753 Рік тому
@@Dr.Gashaw.M_Goshu I understand, thank you so much Sir
@aisiazmi1983 Рік тому
@@Dr.Gashaw.M_Goshu Prof, when I calculated the molecule using mordred, It had 2 values for each feature. I need One canonical smile has one value for each feature. May I delete one of them Prof? So, Why does it happen Prof?🙏
@Dr.Gashaw.M_Goshu Рік тому
I am not sure why you are getting duplicate values. If you follow the steps in the notebook, you should not get duplicate SMILES. I would suggest, you need to remove duplicate SMILES first before calculating molecular descriptors.
@aisiazmi1983 Рік тому
@@Dr.Gashaw.M_Goshu I mean, for the previous calculating using rdkit my dataset is 270 Canonical Smiles. But, after calculating Rdkit, the results was more than 270. So, is it okay Prof?
@sriramvaidyanathan5094 2 роки тому ⁺¹
We find molecular descriptors after hydrogen explicit here is the finding descriptors with hydrogen can you confirm
@Dr.Gashaw.M_Goshu 2 роки тому
Yes, hydrogen atoms are added to the structures and then, molecular descriptors were calculated for each compound.
@sriramvaidyanathan5094 2 роки тому
@@Dr.Gashaw.M_Goshu but some of 2D indices we usually deplite the hydrogen will it do by itself when it is calculation of those
@AGnanaprakasam Рік тому
@@Dr.Gashaw.M_Goshu can you please explain why hydrogen atoms are added to the structures and then, molecular descriptors were calculated for each compound?
@Dr.Gashaw.M_Goshu Рік тому
@@AGnanaprakasam The hydrogens are still included in the structure. What I did was, make them explicit. We can try with and without adding hydrogens. Then, we can take the one that gave better results. In my case, I got slightly better results when hydrogens are explicit. That is why I added hydrogens.
@AGnanaprakasam Рік тому
@@Dr.Gashaw.M_Goshu Thank you
@bomcimtube 7 місяців тому
Thank you. In 2024, the function to calculate RDKit and Mordred descriptors give error.
@Dr.Gashaw.M_Goshu 7 місяців тому
I ran it in Colab a few minutes ago and did not see any issues. I am not sure why it did not work for you.
@bomcimtube 7 місяців тому
@@Dr.Gashaw.M_Goshu You are right. I just ran in colab and it didn't see any issues either. When I installed the modules and python 3.9 in my local computer and ran the cells, it gave errors (maybe changes in the packages).
@bomcimtube 7 місяців тому
OK, I rewrote the first function, now it worked:
# get descriptor table
def calculate_descriptors(smiles_list):
# List of descriptor names
descriptor_names = [desc_name[0] for desc_name in Descriptors._descList]
# Create a MoleculeDescriptors calculator
calculator = MoleculeDescriptors.MolecularDescriptorCalculator(descriptor_names)
# Initialize a list to hold the descriptors for each molecule
descriptors_list = []
for smiles in smiles_list:
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(smiles)
if mol is not None:
# Calculate descriptors
descriptors = calculator.CalcDescriptors(mol)
descriptors_list.append(descriptors)
else:
# Append None or some indication of invalid SMILES
descriptors_list.append(None)
return descriptor_names, descriptors_list
# function usage:
smiles_list = dataset_new['SMILES']
descriptor_names, descriptors = calculate_descriptors(smiles_list)
df_with_200_descriptors = pd.DataFrame(descriptors, columns = descriptor_names)

Наступне

Автоматичне відтворення

Generating Molecular Fingerprints using RDKit