I im trying to extract a title from the famous "Titanic" dataset, where the format is like this:
[Name] [1]: .png
I'm trying to avoid an iterative solution, so i've tried something like this:
df['Title'] = df['Name'].str[df['Name'].str.find(' ') + 1 : df['Name'].str.find('.')]
This doesn't work since i'm using series as indexes instead of an unique value. ¿What would be the correct way to do this?
This works, but seems too complex:
space_pos=data.Name.str.find(" ")
dot_pos=data.Name.str.find(".")
data["Title"]=[data.Name[i][space_pos[i]+1:dot_pos[i]] for i in range(len(data.Name))]
I im trying to extract a title from the famous "Titanic" dataset, where the format is like this:
[Name] [1]: https://i.sstatic.net/HlkH8zHO.png
I'm trying to avoid an iterative solution, so i've tried something like this:
df['Title'] = df['Name'].str[df['Name'].str.find(' ') + 1 : df['Name'].str.find('.')]
This doesn't work since i'm using series as indexes instead of an unique value. ¿What would be the correct way to do this?
This works, but seems too complex:
space_pos=data.Name.str.find(" ")
dot_pos=data.Name.str.find(".")
data["Title"]=[data.Name[i][space_pos[i]+1:dot_pos[i]] for i in range(len(data.Name))]
You can use regular expressions to pull out text without looping through each row. Here's a way to do it using str.extract
in pandas:
import pandas as pd
# Assuming df is your DataFrame and 'Name' is the column with the names
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.')
According to the image, the schema seems to be "LastName, Title. FirstName Other" with no comma or dot in Other. So we can split the name first by the comma and take the group 1 (2nd element) and then split by the dot and take the first group. So you can use:
data['Title'] = data['Name'].map(lambda s: s.split(",")[1].split(".")[0].strip())